Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

all cluster-agent pods are scheduled on same node #45422

Open
vatsalparekh opened this issue May 8, 2024 · 1 comment
Open

all cluster-agent pods are scheduled on same node #45422

vatsalparekh opened this issue May 8, 2024 · 1 comment
Assignees
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/2 team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@vatsalparekh
Copy link
Contributor

On a freshly baked downstream cluster with 3 nodes, I see that all cluster-agent pods are scheduled on the same node. I created multiple clusters to verify this behaviour.

➜  aws git:(master) ✗ k get pods -n cattle-system -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE     IP              NODE                                   NOMINATED NODE   READINESS GATES
apply-system-agent-upgrader-on-test-vatsalp-ds-pool1-b9af-9n7xk   0/1     Completed   0          3m3s    172.31.43.103   test-vatsalp-ds-pool1-b9af3f3d-46wft   <none>           <none>
apply-system-agent-upgrader-on-test-vatsalp-ds-pool1-b9af-x6pgw   0/1     Completed   0          3m3s    172.31.46.135   test-vatsalp-ds-pool1-b9af3f3d-z6r87   <none>           <none>
apply-system-agent-upgrader-on-test-vatsalp-ds-pool1-b9af-xq52t   0/1     Completed   0          3m3s    172.31.38.67    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
cattle-cluster-agent-66c9bd8544-dmnpn                             1/1     Running     0          3m37s   10.42.74.215    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
cattle-cluster-agent-66c9bd8544-q8td2                             1/1     Running     0          3m39s   10.42.74.214    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
helm-operation-sjlvb                                              0/2     Completed   0          4m4s    10.42.74.211    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
rancher-webhook-85b57d6bf8-bflvz                                  1/1     Running     0          3m49s   10.42.74.212    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
system-upgrade-controller-68d57657cb-mjnqn                        1/1     Running     0          4m29s   10.42.74.210    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
➜  aws git:(master) ✗ k get node                         
NAME                                   STATUS   ROLES                              AGE     VERSION
test-vatsalp-ds-pool1-b9af3f3d-46wft   Ready    control-plane,etcd,master,worker   4m17s   v1.26.15+rke2r1
test-vatsalp-ds-pool1-b9af3f3d-tv8cq   Ready    control-plane,etcd,master,worker   8m23s   v1.26.15+rke2r1
test-vatsalp-ds-pool1-b9af3f3d-z6r87   Ready    control-plane,etcd,master,worker   3m41s   v1.26.15+rke2r1

The issue seems to be with the following line which prefers pods to not be in the same node, but schedules likely schedules the pods too quick as soon as the first node is up, this can also be (repeatedly) confirmed by the node which came up first
changing this to required should fix it

"preferredDuringSchedulingIgnoredDuringExecution": [

@vatsalparekh
Copy link
Contributor Author

also, this is likely causing the issue where all agent pods are on the same node and draining it for maintenance causes a small downtime in connection from rancher to downstream cluster

@vatsalparekh vatsalparekh self-assigned this May 8, 2024
vatsalparekh added a commit to vatsalparekh/rancher that referenced this issue May 8, 2024
As described in the issue rancher#45422, the preferredDuringSchedulingIgnoredDuringExecution makes scheduler schedule all the pods on the first available node, changing this to requiredDuringSchedulingIgnoredDuringExecution will make the schedule wait in pending till the 2nd node to comes up

Signed-off-by: Vatsal Parekh <vatsalparekh@outlook.com>
@Sahota1225 Sahota1225 added kind/bug Issues that are defects reported by users or that we know have reached a real release team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels May 9, 2024
@Sahota1225 Sahota1225 added this to the v2.9-Next1 milestone May 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release priority/2 team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

3 participants