"Job is waiting for a runner from XXX to come online" with 0.9.0 and 0.9.1 (works with 0.8.3) #3499

alecor191 · 2024-05-06T12:21:41Z

Checks

I've already read https://docs.github.com/en/actions/hosting-your-own-runners/managing-self-hosted-runners-with-actions-runner-controller/troubleshooting-actions-runner-controller-errors and I'm sure my issue is not covered in the troubleshooting guide.
I am using charts that are officially provided

Controller Version

0.9.0 and 0.9.1

Deployment Method

Helm

Checks

This isn't a question or user support case (For Q&A and community support, go to Discussions).
I've read the Changelog before submitting this issue and I'm sure it's not due to any recently-introduced backward-incompatible changes

To Reproduce

1. We've set up an AKS cluster with ACR 0.9.0 and one with ACR 0.9.1
2. schedule a couple dozen workflows runs that will trigger cluster scale out (and back in)

Describe the bug

After a couple dozen runs, some GHA runs get stuck waiting for a runner (Job is waiting for a runner from XXX to come online).
At that time, no runner was running. I.e. job was waiting, but ACR didn't spin up a runner pod.
Reproducible with both 0.9.0 and 0.9.1 (in fact, 0.9.1 was somewhat worse as we experienced the issue much more quickly than on 0.9.0; but may be coincidence)

Describe the expected behavior

No scheduled GHA run gets stuck waiting for a runner. All jobs eventually get a runner assigned.

Additional Context

Workaround: Kill the listener pod. When K8S restarts the listener, it also spins up new runner pods and everything keeps working for a while (i.e. a couple more CI pipelines can run until eventually we hit the issue again).
We reverted our ACR to version 0.8.3 and have been successfully running it for the past 3 days. Not a single time we ran into the issue above; whereas we had 100% repro with 0.9.0 and 0.9.1 within a few hours.
The symptoms are very similar to Pipeline gets stuck randomly with "Job is waiting for a runner from XXX to come online" #3420. However, that issue is fixed in 0.9.1. Unfortunately, we ran into the issue also after upgrading to 0.9.1.
It may be worth mentioning that we're quite aggressively scaling our K8S cluster node pool hosting runners in and out. I.e. quite frequently a new node is required to be able to run a runner pod. We're on AKS and adding a new node takes about 2-3min before the pod can be scheduled on the new node.

Controller Logs

Search for // ISSUE in the following controller logs to find the time where we observed the issue.

Link to Gist

Esp. in file manager-medium.log searching for // ISSUE will show that there were no logs for minutes. Only when we killed the listener, then new logs started to show up indicating that more runners were needed.

Runner Pod Logs

N/A

The text was updated successfully, but these errors were encountered:

jonwest · 2024-05-06T14:32:14Z

Same behaviour here as well. Appreciate you posting the workaround(s)! Killing the listener pod got things online, but I will revert to 0.8.3 in order to have some semblance of reliability in the interim. 🙏

thiago-juro · 2024-05-06T15:56:27Z

We're facing the same issue. Reverted to 0.8.3 as well.

cheskayang · 2024-05-06T20:15:55Z

having the same issue with 0.9.0 and 0.9.1, added some observations in this ticket
#3501

thx for the tip! will revert to 0.8.3 for now

Mahmoudkassry · 2024-05-07T14:38:30Z

Same issue, reverting to 0.8.3. Thanks for the tip!

alecor191 · 2024-05-08T11:13:16Z

Tagging @nikola-jokic who was active in #3420: Would you be able to help with this issue?

jonwest · 2024-05-09T17:24:06Z

For those of you that are experiencing this issue—are you using ArgoCD, by chance? Just trying to see if there is any other correlations between my setup and others that are experiencing issues. I've noticed that it seems like the listeners are needing to be restarted around the time of a sync with Argo, and I'm curious if that has an effect.

Tycale · 2024-05-09T19:57:00Z

I have the same issue. I am using FluxCD. I will try to use the 0.8.2 version to see if it helps.

EDIT: I opted for version 0.8.3, and it's performing beautifully! To downgrade, I removed all the namespaces, CRDs, roles, role bindings, service accounts, deployments, helm charts, etc. Omitting this step appears to cause problems with the listener not starting up.

Mahmoudkassry · 2024-05-10T13:04:06Z

Appears to be happening with 0.8.3 as well now.

jnogol · 2024-05-13T13:58:30Z

happening here as well, when inspecting the ephemeral runner, we're getting Failed to create the pod: pods "<my-runner-name>-hr7bs-runner-rjlp9" is forbidden: exceeded quota: <myNS>-ghr-resource-quota, requested: limits.memory=8Gi, used: limits.memory=16Gi, limited: limits.memory=16Gi but, interestingly, there are no pods running in that namespace, so I'm not sure what is using up the quota.

mpjarvis · 2024-05-13T19:28:08Z

I'm able to reproduce the problem pretty consistently in 0.9.0 and 0.9.1 by kicking off around a dozen jobs simultaneously. With this many pods spinning up at the same time, it takes longer for them to all initialize, which seems to trigger the bug.

A little less than a minute after EphemeralRunnerSet started scaling up, I see logs like this:

May 13 13:11:50 arc-gha-rs-controller-6bb495b57-l8hhf manager INFO EphemeralRunnerSet	Ephemeral runner counts	{"ephemeralrunnerset": {"name":"my-runner-zdldx","namespace":"my-runners"}, "pending": 5, "running": 7, "finished": 0, "failed": 0, "deleting": 0}
May 13 13:11:50 arc-gha-rs-controller-6bb495b57-l8hhf manager INFO EphemeralRunnerSet	Scaling comparison	{"ephemeralrunnerset": {"name":"my-runner-zdldx","namespace":"my-runners"}, "current": 12, "desired": 0}
May 13 13:11:50 arc-gha-rs-controller-6bb495b57-l8hhf manager INFO EphemeralRunnerSet	Deleting ephemeral runners (scale down)	{"ephemeralrunnerset": {"name":"my-runner-zdldx","namespace":"my-runners"}, "count": 12}

All the runners get killed and never report any status back to GitHub. Note, I'm using containerMode.type=dind in my gha-runner-scale-set, so each runner needs to wait for dind to initialize before jobs will start executing.

Rolling back to 0.8.3 seems to fix the problem for me.

Update... After doing a bit more searching, the problem I'm seeing sounds exactly like #3450.

krupakar1329 · 2024-05-15T18:09:16Z

i am facing this issue on 0.7.0 version. whats the solution ?
INFO EphemeralRunner EphemeralRunner has failed more than 5 times. Marking it as failed {"ephemeralrunner": {"name":"pd-cluster-xxxx-runner-9nk4l","namespace":"github-arc-runner-xxxx"}}

nikola-jokic · 2024-05-20T10:29:07Z

Hey @alecor191,

Could you please let us know if the issue is resolved in 0.9.2 version?

alecor191 · 2024-05-20T11:31:37Z

Hi @nikola-jokic. Sure, happy to. I just upgraded to 0.9.2 and will share results in ~1 day.

Update 2024-05-21: No issues after running it for 12+ hours. However, then we started observing stuck pipelines. However, this could very well be due to an ongoing GHA incident.

Update 2024-05-22: No issues since the GHA incident mentioned above was resolved. IOW 0.9.2 has been working fine for us for 48h (with 0.9.0/1 we ran into the issue within minutes/few hours).

CarlosHenriqueDamasceno · 2024-05-20T18:10:33Z

I'm facing the same issue, previously using 0.5.1, updated to 0.8.3 as suggested but the problem remains, so updated for 0.9.2 and nothing seems to change.

shanehull · 2024-05-21T00:26:04Z

0.9.2 is working for us so far.

zdenekpizl-foxdeli · 2024-05-21T14:13:46Z

0.9.2 suffers from the same issues. Either the job is waiting for a runner and it never starts although the pod is up, running and job is assigned or it, time by time, timeouts in the middle. Very dissatisfying ...

In fact ARC is not usable from 0.9.0 for us apparently.

zdenekpizl-foxdeli · 2024-05-21T14:49:02Z

to put my 2 cents:

**** Web UI Output of the job:
Requested runner group: Foxdeli-runners-dind
Job defined at: redacted
Reusable workflow chain: redacted -> redacted, not relevant
Waiting for a runner to pick up this job...
Job is waiting for a runner from 'dind-tools-prod-europe-west3' to come online.
Job is about to start running on the runner: dind-tools-prod-europe-west3-gc62k-runner-6gp7f (organization)


**** and the POD in arc-runners ns reports:
16:26 $ k logs pods/dind-tools-prod-europe-west3-gc62k-runner-6gp7f
Defaulted container "runner" out of: runner, dind, init-dind-externals (init)

√ Connected to GitHub

Current runner version: '2.316.1'
2024-05-21 14:24:20Z: Listening for Jobs
2024-05-21 14:26:10Z: Running job: PR / Run integration tests
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ it "runs" a job but it has never started


**** Listener in arc-systems ns:
Name": "foxdeli", "repoName": "repository redacted", "workflowRef": "redacted", "workflowRunId": 9162243747, "jobDisplayName": "PR / Run integration tests", "requestId": 11376}
2024-05-21T14:26:16Z	INFO	listener-app.worker.kubernetesworker	Updating ephemeral runner with merge patch	{"json": "{\"status\":{\"jobDisplayName\":\"PR / Run integration tests\",\"jobRepositoryName\":\"foxdeli/redacted\",\"jobRequestId\":11376,\"jobWorkflowRef\":\"redacted, not relevant\",\"workflowRunId\":9162243747}}"}
2024-05-21T14:26:16Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner status updated with the merge patch successfully.
2024-05-21T14:26:16Z	INFO	listener-app.worker.kubernetesworker	Created merge patch json for EphemeralRunnerSet update	{"json": "{\"spec\":{\"replicas\":2}}"}
2024-05-21T14:26:16Z	INFO	listener-app.worker.kubernetesworker	Scaling ephemeral runner set	{"assigned job": 1, "decision": 2, "min": 1, "max": 30, "currentRunnerCount": -1}
2024-05-21T14:26:16Z	INFO	listener-app.worker.kubernetesworker	Ephemeral runner set scaled.	{"namespace": "arc-runners", "name": "dind-tools-prod-europe-west3-gc62k", "replicas": 2}

^^^^^^^^^^^^^^^^^^^^^ so it scales-up the set, but nothing is going to happen.

I have to say we're facing these issues for past few weeks. Any idea what's going on?

nikola-jokic · 2024-05-21T15:01:57Z

Hey @zdenekpizl-foxdeli,

Just to make sure I understand, the job lands on the runner, but it never finishes?

zdenekpizl-foxdeli · 2024-05-21T15:11:04Z

Hi, it even never starts. There is just a message about the run that it is about to start but that's all.

zdenekpizl-foxdeli · 2024-05-21T15:13:16Z

I am all for debug it thoroughly, but I have no idea what else to debug on my side. Please, propose the RootCauseAnalysis hunting approach for this issue ...

zdenekpizl-foxdeli · 2024-05-21T17:43:44Z

Another observations (I've redeployed self-hosted runners and controller once again, just to be sure everything is clean and installed neatly):

19:32 $ k get pods -n arc-runners
NAME                                                READY   STATUS    RESTARTS   AGE
dind-tools-prod-europe-west3-d7bzs-runner-227zh     2/2     Running   0          89m
^^^^^^^^^^^^^ min count of runners is 1, this one is there from the beginning

dind-tools-prod-europe-west3-d7bzs-runner-wsjv9     2/2     Running   0          79s
^^^^^^^^^^^^^ this one has been created just after a workflow started

k8s-tools-prod-europe-west3-dtk8p-runner-8bklr      1/1     Running   0          35m
nodejs-tools-prod-europe-west3-jkg5p-runner-hsc6w   2/2     Running   0          89m
nodejs-tools-prod-europe-west3-jkg5p-runner-w9r8z   2/2     Running   0          3m21s

According to the output within Web UI, the job is waiting for:

Job is waiting for a runner from 'dind-tools-prod-europe-west3' to come online.
Job is about to start running on the runner: dind-tools-prod-europe-west3-d7bzs-runner-227zh (organization)

So there is new runner/worker for DIND created but the job is assigned to the old one. And of course, it does not even start much less run or finish.

Interestingly, the Kubernetes mode (k8s- prefixed runners) works fine, no issues during invocation of workflows/jobs. Why there is such dichotomy?

nikola-jokic · 2024-05-22T11:24:06Z

Hey @zdenekpizl-foxdeli,

I'm not sure, is it possible that docker or something kills the runner? It must not be an ARC issue, since the pod is up, so it scales and the runner takes the job. What I'm not sure is what causes the runner to exit? What is the pod lifecycle after the job is accepted by the runner?

zdenekpizl-foxdeli · 2024-05-22T12:34:56Z

Hmm, that's good question and I have no answer for it. I found some traces that the container in a pod has been restarted, but did not find reason. So maybe resource exhaustion, who knows ...

Anyway, my problem has disappeared because I uninstalled the ARC's K8S helm chart, deleted CRDs, namespaces and other related leftovers from the cluster. Also, I've re-created runners group in Organization's settings and then installed clean ARC version 0.8.3.
Now it works like a charm.

So I would say there was some mixture of troubles resulting in non-functional deployment.

nikola-jokic · 2024-05-24T09:43:37Z

Hey @alecor191,

Thank you for writing updates! I think this issue is now safe to close. We can always re-open it, or create a new one. And I want to thank you all for providing more information!

alecor191 added bug Something isn't working gha-runner-scale-set Related to the gha-runner-scale-set mode needs triage Requires review from the maintainers labels May 6, 2024

cheskayang mentioned this issue May 6, 2024

Job random stuck with "Job is waiting for a runner from XXX to come online" #3501

Closed

4 tasks

nikola-jokic closed this as completed May 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Job is waiting for a runner from XXX to come online" with 0.9.0 and 0.9.1 (works with 0.8.3) #3499

"Job is waiting for a runner from XXX to come online" with 0.9.0 and 0.9.1 (works with 0.8.3) #3499

alecor191 commented May 6, 2024 •

edited

jonwest commented May 6, 2024

thiago-juro commented May 6, 2024

cheskayang commented May 6, 2024

Mahmoudkassry commented May 7, 2024

alecor191 commented May 8, 2024

jonwest commented May 9, 2024 •

edited

Tycale commented May 9, 2024 •

edited

Mahmoudkassry commented May 10, 2024 •

edited

jnogol commented May 13, 2024

mpjarvis commented May 13, 2024 •

edited

krupakar1329 commented May 15, 2024

nikola-jokic commented May 20, 2024

alecor191 commented May 20, 2024 •

edited

CarlosHenriqueDamasceno commented May 20, 2024

shanehull commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024

nikola-jokic commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024 •

edited

zdenekpizl-foxdeli commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024 •

edited

nikola-jokic commented May 22, 2024

zdenekpizl-foxdeli commented May 22, 2024

nikola-jokic commented May 24, 2024

"Job is waiting for a runner from XXX to come online" with 0.9.0 and 0.9.1 (works with 0.8.3) #3499

"Job is waiting for a runner from XXX to come online" with 0.9.0 and 0.9.1 (works with 0.8.3) #3499

Comments

alecor191 commented May 6, 2024 • edited

Checks

Controller Version

Deployment Method

Checks

To Reproduce

Describe the bug

Describe the expected behavior

Additional Context

Controller Logs

Runner Pod Logs

jonwest commented May 6, 2024

thiago-juro commented May 6, 2024

cheskayang commented May 6, 2024

Mahmoudkassry commented May 7, 2024

alecor191 commented May 8, 2024

jonwest commented May 9, 2024 • edited

Tycale commented May 9, 2024 • edited

Mahmoudkassry commented May 10, 2024 • edited

jnogol commented May 13, 2024

mpjarvis commented May 13, 2024 • edited

krupakar1329 commented May 15, 2024

nikola-jokic commented May 20, 2024

alecor191 commented May 20, 2024 • edited

CarlosHenriqueDamasceno commented May 20, 2024

shanehull commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024

nikola-jokic commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024 • edited

zdenekpizl-foxdeli commented May 21, 2024

zdenekpizl-foxdeli commented May 21, 2024 • edited

nikola-jokic commented May 22, 2024

zdenekpizl-foxdeli commented May 22, 2024

nikola-jokic commented May 24, 2024

alecor191 commented May 6, 2024 •

edited

jonwest commented May 9, 2024 •

edited

Tycale commented May 9, 2024 •

edited

Mahmoudkassry commented May 10, 2024 •

edited

mpjarvis commented May 13, 2024 •

edited

alecor191 commented May 20, 2024 •

edited

zdenekpizl-foxdeli commented May 21, 2024 •

edited

zdenekpizl-foxdeli commented May 21, 2024 •

edited