New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"Job is waiting for a runner from XXX to come online" with 0.9.0 and 0.9.1 (works with 0.8.3) #3499
Comments
Same behaviour here as well. Appreciate you posting the workaround(s)! Killing the listener pod got things online, but I will revert to 0.8.3 in order to have some semblance of reliability in the interim. 🙏 |
We're facing the same issue. Reverted to 0.8.3 as well. |
having the same issue with 0.9.0 and 0.9.1, added some observations in this ticket thx for the tip! will revert to 0.8.3 for now |
Same issue, reverting to 0.8.3. Thanks for the tip! |
Tagging @nikola-jokic who was active in #3420: Would you be able to help with this issue? |
For those of you that are experiencing this issue—are you using ArgoCD, by chance? Just trying to see if there is any other correlations between my setup and others that are experiencing issues. I've noticed that it seems like the listeners are needing to be restarted around the time of a sync with Argo, and I'm curious if that has an effect. |
I have the same issue. I am using FluxCD. I will try to use the 0.8.2 version to see if it helps. EDIT: I opted for version 0.8.3, and it's performing beautifully! To downgrade, I removed all the namespaces, CRDs, roles, role bindings, service accounts, deployments, helm charts, etc. Omitting this step appears to cause problems with the listener not starting up. |
Appears to be happening with 0.8.3 as well now. |
happening here as well, when inspecting the ephemeral runner, we're getting |
I'm able to reproduce the problem pretty consistently in 0.9.0 and 0.9.1 by kicking off around a dozen jobs simultaneously. With this many pods spinning up at the same time, it takes longer for them to all initialize, which seems to trigger the bug. A little less than a minute after EphemeralRunnerSet started scaling up, I see logs like this:
All the runners get killed and never report any status back to GitHub. Note, I'm using Rolling back to 0.8.3 seems to fix the problem for me. Update... After doing a bit more searching, the problem I'm seeing sounds exactly like #3450. |
i am facing this issue on 0.7.0 version. whats the solution ? |
Hey @alecor191, Could you please let us know if the issue is resolved in 0.9.2 version? |
Hi @nikola-jokic. Sure, happy to. I just upgraded to Update 2024-05-21: No issues after running it for 12+ hours. However, then we started observing stuck pipelines. However, this could very well be due to an ongoing GHA incident. Update 2024-05-22: No issues since the GHA incident mentioned above was resolved. IOW |
I'm facing the same issue, previously using 0.5.1, updated to 0.8.3 as suggested but the problem remains, so updated for 0.9.2 and nothing seems to change. |
|
0.9.2 suffers from the same issues. Either the job is waiting for a runner and it never starts although the pod is up, running and job is assigned or it, time by time, timeouts in the middle. Very dissatisfying ... In fact ARC is not usable from 0.9.0 for us apparently. |
to put my 2 cents:
I have to say we're facing these issues for past few weeks. Any idea what's going on? |
Hey @zdenekpizl-foxdeli, Just to make sure I understand, the job lands on the runner, but it never finishes? |
I am all for debug it thoroughly, but I have no idea what else to debug on my side. Please, propose the RootCauseAnalysis hunting approach for this issue ... |
Another observations (I've redeployed self-hosted runners and controller once again, just to be sure everything is clean and installed neatly):
According to the output within Web UI, the job is waiting for:
So there is new runner/worker for DIND created but the job is assigned to the old one. And of course, it does not even start much less run or finish. Interestingly, the Kubernetes mode (k8s- prefixed runners) works fine, no issues during invocation of workflows/jobs. Why there is such dichotomy? |
Hey @zdenekpizl-foxdeli, I'm not sure, is it possible that docker or something kills the runner? It must not be an ARC issue, since the pod is up, so it scales and the runner takes the job. What I'm not sure is what causes the runner to exit? What is the pod lifecycle after the job is accepted by the runner? |
Hmm, that's good question and I have no answer for it. I found some traces that the container in a pod has been restarted, but did not find reason. So maybe resource exhaustion, who knows ... Anyway, my problem has disappeared because I uninstalled the ARC's K8S helm chart, deleted CRDs, namespaces and other related leftovers from the cluster. Also, I've re-created runners group in Organization's settings and then installed clean ARC version 0.8.3. So I would say there was some mixture of troubles resulting in non-functional deployment. |
Hey @alecor191, Thank you for writing updates! I think this issue is now safe to close. We can always re-open it, or create a new one. And I want to thank you all for providing more information! |
Checks
Controller Version
0.9.0 and 0.9.1
Deployment Method
Helm
Checks
To Reproduce
Describe the bug
Job is waiting for a runner from XXX to come online
).Describe the expected behavior
No scheduled GHA run gets stuck waiting for a runner. All jobs eventually get a runner assigned.
Additional Context
Workaround: Kill the
listener
pod. When K8S restarts the listener, it also spins up new runner pods and everything keeps working for a while (i.e. a couple more CI pipelines can run until eventually we hit the issue again).We reverted our ACR to version 0.8.3 and have been successfully running it for the past 3 days. Not a single time we ran into the issue above; whereas we had 100% repro with 0.9.0 and 0.9.1 within a few hours.
The symptoms are very similar to Pipeline gets stuck randomly with "Job is waiting for a runner from XXX to come online" #3420. However, that issue is fixed in 0.9.1. Unfortunately, we ran into the issue also after upgrading to 0.9.1.
It may be worth mentioning that we're quite aggressively scaling our K8S cluster node pool hosting runners in and out. I.e. quite frequently a new node is required to be able to run a runner pod. We're on AKS and adding a new node takes about 2-3min before the pod can be scheduled on the new node.
Controller Logs
Search for
// ISSUE
in the following controller logs to find the time where we observed the issue.Link to Gist
Esp. in file
manager-medium.log
searching for// ISSUE
will show that there were no logs for minutes. Only when we killed the listener, then new logs started to show up indicating that more runners were needed.Runner Pod Logs
N/A
The text was updated successfully, but these errors were encountered: