all cluster-agent pods are scheduled on same node #45422

vatsalparekh · 2024-05-08T19:49:57Z

On a freshly baked downstream cluster with 3 nodes, I see that all cluster-agent pods are scheduled on the same node. I created multiple clusters to verify this behaviour.

➜  aws git:(master) ✗ k get pods -n cattle-system -o wide
NAME                                                              READY   STATUS      RESTARTS   AGE     IP              NODE                                   NOMINATED NODE   READINESS GATES
apply-system-agent-upgrader-on-test-vatsalp-ds-pool1-b9af-9n7xk   0/1     Completed   0          3m3s    172.31.43.103   test-vatsalp-ds-pool1-b9af3f3d-46wft   <none>           <none>
apply-system-agent-upgrader-on-test-vatsalp-ds-pool1-b9af-x6pgw   0/1     Completed   0          3m3s    172.31.46.135   test-vatsalp-ds-pool1-b9af3f3d-z6r87   <none>           <none>
apply-system-agent-upgrader-on-test-vatsalp-ds-pool1-b9af-xq52t   0/1     Completed   0          3m3s    172.31.38.67    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
cattle-cluster-agent-66c9bd8544-dmnpn                             1/1     Running     0          3m37s   10.42.74.215    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
cattle-cluster-agent-66c9bd8544-q8td2                             1/1     Running     0          3m39s   10.42.74.214    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
helm-operation-sjlvb                                              0/2     Completed   0          4m4s    10.42.74.211    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
rancher-webhook-85b57d6bf8-bflvz                                  1/1     Running     0          3m49s   10.42.74.212    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
system-upgrade-controller-68d57657cb-mjnqn                        1/1     Running     0          4m29s   10.42.74.210    test-vatsalp-ds-pool1-b9af3f3d-tv8cq   <none>           <none>
➜  aws git:(master) ✗ k get node                         
NAME                                   STATUS   ROLES                              AGE     VERSION
test-vatsalp-ds-pool1-b9af3f3d-46wft   Ready    control-plane,etcd,master,worker   4m17s   v1.26.15+rke2r1
test-vatsalp-ds-pool1-b9af3f3d-tv8cq   Ready    control-plane,etcd,master,worker   8m23s   v1.26.15+rke2r1
test-vatsalp-ds-pool1-b9af3f3d-z6r87   Ready    control-plane,etcd,master,worker   3m41s   v1.26.15+rke2r1

The issue seems to be with the following line which prefers pods to not be in the same node, but schedules likely schedules the pods too quick as soon as the first node is up, this can also be (repeatedly) confirmed by the node which came up first
changing this to required should fix it

rancher/pkg/settings/affinity.go

Line 83 in 4964f37

"preferredDuringSchedulingIgnoredDuringExecution": [

The text was updated successfully, but these errors were encountered:

vatsalparekh · 2024-05-08T19:51:33Z

also, this is likely causing the issue where all agent pods are on the same node and draining it for maintenance causes a small downtime in connection from rancher to downstream cluster

As described in the issue rancher#45422, the preferredDuringSchedulingIgnoredDuringExecution makes scheduler schedule all the pods on the first available node, changing this to requiredDuringSchedulingIgnoredDuringExecution will make the schedule wait in pending till the 2nd node to comes up Signed-off-by: Vatsal Parekh <vatsalparekh@outlook.com>

vatsalparekh self-assigned this May 8, 2024

vatsalparekh mentioned this issue May 8, 2024

Change default anti-affinity to required mode #45423

Draft

Sahota1225 added kind/bug Issues that are defects reported by users or that we know have reached a real release team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support labels May 9, 2024

Sahota1225 added this to the v2.9-Next1 milestone May 13, 2024

snasovich added the priority/2 label May 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

all cluster-agent pods are scheduled on same node #45422

all cluster-agent pods are scheduled on same node #45422

vatsalparekh commented May 8, 2024

vatsalparekh commented May 8, 2024

all cluster-agent pods are scheduled on same node #45422

all cluster-agent pods are scheduled on same node #45422

Comments

vatsalparekh commented May 8, 2024

vatsalparekh commented May 8, 2024