Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature: Install GCP Ops Agent Automatically #1296

Open
evamaxfield opened this issue Dec 22, 2022 · 6 comments
Open

Feature: Install GCP Ops Agent Automatically #1296

evamaxfield opened this issue Dec 22, 2022 · 6 comments
Assignees
Labels
cloud-gcp Google Cloud cml-runner Subcommand enhancement New feature or request external-request You asked, we did p2-nice-to-have Low priority

Comments

@evamaxfield
Copy link

When creating a GCP runner with CML it would be great to have memory utilization and disk utilization and generally better logging available. GCP Ops Agent seems to be the way to do that.

It would be great to install Ops Agent on GCP runners automatically during the startup script.

@DavidGOrtega DavidGOrtega added enhancement New feature or request cml-runner Subcommand cloud-gcp Google Cloud labels Dec 26, 2022
@DavidGOrtega
Copy link
Contributor

👋 @evamaxfield thanks for your feedback!

Runners allows also to include a setup script via base64 via cloud-startup-script parameter.
According to the docs installing it would be

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

The cml command could be:

cml runner launch
      --labels=cml-runner
      --cloud=gcp
      --cloud-startup-script=Y3VybCAtc1NPIGh0dHBzOi8vZGwuZ29vZ2xlLmNvbS9jbG91ZGFnZW50cy9hZGQtZ29vZ2xlLWNsb3VkLW9wcy1hZ2VudC1yZXBvLnNoCnN1ZG8gYmFzaCBhZGQtZ29vZ2xlLWNsb3VkLW9wcy1hZ2VudC1yZXBvLnNoIC0tYWxzby1pbnN0YWxs

Disclaimer: I have not tried it yet but its the counterpart of the already known AWS scenario.

@DavidGOrtega DavidGOrtega self-assigned this Dec 26, 2022
@evamaxfield
Copy link
Author

Hey! I actually gave that a try a couple of days ago but it didn't work. The setup script, the --also-install bit, resulted in a failure. I will try it again and forward the logs.

The only way for me to try it and get the logs is unfortunately not to do it in the startup script but to open a cloud ssh connection and run it there. Is that okay?

@dacbd
Copy link
Contributor

dacbd commented Dec 29, 2022

You might find this helpful to view the instance's startup. tail -n 10000 -f /var/log/syslog | awk 'match($0, /startup-script:/){print substr($0,RSTART+16) }' optionally | more

⚠️ note these logs may contain credentials so I wouldn't blindly copy paste them.

@evamaxfield
Copy link
Author

evamaxfield commented Jan 8, 2023

Update: I am trying to look at the logs but not entirely sure what I should be looking for. Not only that, but I copied much of one of the cml-playground yaml examples and am using it but instead of sleeping for 30s I am sleeping for 10 minutes just to see the logs / explore the machine in a different SSH session.

The machine crashes regardless. Taken together with all of my other failed runs, it looks like the machine can't last longer than ~10 minutes?

Example with just sleeping / cycling for 10 minutes and crashing: https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3868218689
Example with python code attempting to run: https://github.com/evamaxfield/gcloud-whisper-testing/actions/runs/3868130080

actually correction -- across all the runs I am noticing that none of the runs succeed if the actual usage of the created runner lasts longer than 5 minutes. The first job to create the GCP runner works just fine, the second job of using that runner has never lasted more than 5 minutes and typically crashes at 4:59 duration (or 5:00 min +- 10 seconds).

Lots of examples: https://github.com/evamaxfield/gcloud-whisper-testing/actions

@evamaxfield
Copy link
Author

This seems more related to #1291 rather than this feature. If you want me to reopen that issue / move discussion there, let me know.

@evamaxfield
Copy link
Author

On my discovery of that, it seems like most of my issues stem from: #1255

I increased the idle-timeout and my action is working. Will run a few more tests to make sure

@casperdcl casperdcl added the external-request You asked, we did label Jan 12, 2023
@0x2b3bfa0 0x2b3bfa0 added the p2-nice-to-have Low priority label Jan 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cloud-gcp Google Cloud cml-runner Subcommand enhancement New feature or request external-request You asked, we did p2-nice-to-have Low priority
Projects
None yet
Development

No branches or pull requests

5 participants