Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not use the same runner when the script failed #1514

Merged
merged 2 commits into from Apr 30, 2024
Merged

Conversation

enescakir
Copy link
Member

Add a helper to create spare runner

Our runners are job agnostic, meaning they can run any job with the matched label. This enables us to create a spare runner with the same label if the initial one doesn't function properly.

This helper is also useful for on-call engineers.

Do not use the same runner when the script failed

Currently, when the script failed, we assume it failed because of an initialization error, and we try to register the same runner again. This is not always true. The script might be "failed" while running the workflow. We should create a new spare runner and destroy the failed one.

I can implement additional checks such as verify if the runner has completed the job, and avoid creating a spare runner if the job is completed. However, runner script failures are uncommon. Even when some errors occur, they exit with a zero exit code, not a non-zero one generally. Therefore, I believe it's currently unnecessary to add more checks. If script failures increase, we can reconsider adding them.

Copy link
Contributor

@bsatzger bsatzger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Would be great if you could update the wiki on how to properly use provision_spare_runner as on-call.

Base automatically changed from runner-script to main April 30, 2024 07:18
Our runners are job agnostic, meaning they can run any job with the
matched label. This enables us to create a spare runner with the same
label if the initial one doesn't function properly.

This helper is also useful for on-call engineers.
Currently, when the script failed, we assume it failed because of an
initialization error, and we try to register the same runner again.
This is not always true. The script might be "failed" while running the
workflow. We should create a new spare runner and destroy the failed
one.

I can implement additional checks such as verify if the runner has
completed the job, and avoid creating an spare runner if the job is
completed. However, runner script failures are uncommon. Even when they
occur, they exit with a zero exit code, not a non-zero one. Therefore, I
believe it's currently unnecessary to add more checks. If script
failures increase, we can reconsider adding them.
@enescakir enescakir merged commit 18b4836 into main Apr 30, 2024
6 checks passed
@enescakir enescakir deleted the failed-runner branch April 30, 2024 07:31
@github-actions github-actions bot locked and limited conversation to collaborators Apr 30, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants