Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job random stuck with "Job is waiting for a runner from XXX to come online" #3501

Closed
4 tasks done
cheskayang opened this issue May 6, 2024 · 3 comments
Closed
4 tasks done
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers

Comments

@cheskayang
Copy link

Checks

Controller Version

0.9.1

Deployment Method

ArgoCD

Checks

  • This isn't a question or user support case (For Q&A and community support, go to Discussions).
  • I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

this error happens randomly

Describe the bug

Job randomly gets stuck with the msg "Job is waiting for a runner from XXX to come online"
cancel the job and rerun will fix it.

Observations:
1: Compared with jobs without issue, for this job gets stuck, there is no job started msgs get received on the listener pod (i.e. it only gets job available, job assigned, and then stuck util cancelled manually..)
2. for the job gets stuck, the EphemeralRunnerSet gets patched to 1 replica and then immediately gets patched to null, see logs below

2024-05-06T13:44:52Z INFO listener-app.worker.kubernetesworker Created merge patch json for EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":1,\"replicas\":1}}"}
2024-05-06T13:44:52Z INFO listener-app.worker.kubernetesworker Scaling ephemeral runner set {"assigned job": 1, "decision": 1, "min": 0, "max": 20, "currentRunnerCount": 0, "jobsCompleted": 0}
2024-05-06T13:44:52Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runners", "name": "basic-runner-622tx", "replicas": 1}
2024-05-06T13:44:52Z INFO listener-app.listener Getting next message {"lastMessageID": 248}
2024-05-06T13:45:42Z INFO listener-app.worker.kubernetesworker Created merge patch json for EphemeralRunnerSet update {"json": "{\"spec\":{\"patchID\":0,\"replicas\":null}}"}
2024-05-06T13:45:42Z INFO listener-app.worker.kubernetesworker Scaling ephemeral runner set {"assigned job": 0, "decision": 0, "min": 0, "max": 20, "currentRunnerCount": 1, "jobsCompleted": 0}
2024-05-06T13:45:42Z INFO listener-app.worker.kubernetesworker Ephemeral runner set scaled. {"namespace": "arc-runners", "name": "basic-runner-622tx", "replicas": 0}
2024-05-06T13:45:42Z INFO listener-app.listener Getting next message {"lastMessageID": 248}
  1. the following log seems to occur on the controller side when this issue happens
Getting runner jit config failed with conflict error, trying to get the runner by name

Describe the expected behavior

job should not stuck

Additional Context

- cancel the job and rerun will fix the issue
- having both issues on 0.9.0, 0.9.1
- running on gke

Controller Logs

see description

Runner Pod Logs

N/A
@cheskayang cheskayang added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels May 6, 2024
Copy link
Contributor

github-actions bot commented May 6, 2024

Hello! Thank you for filing an issue.

The maintainers will triage your issue shortly.

In the meantime, please take a look at the troubleshooting guide for bug reports.

If this is a feature request, please review our contribution guidelines.

@cheskayang
Copy link
Author

same issue as mentioned in #3499 and #3420
providing more details on the observed behavior

@nikola-jokic
Copy link
Member

Closing this one as a duplicate. Thank you for linking it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers
Projects
None yet
Development

No branches or pull requests

2 participants