Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] RKE2 control plane stuck in reconciliation and deleting when a control plane node is being deleted #45449

Open
sarahhenkens opened this issue May 12, 2024 · 0 comments
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release

Comments

@sarahhenkens
Copy link

Rancher Server Setup

  • Rancher version: v2.8.2
  • Installation option (Docker install/Helm Chart): Helm chart (local cluster is running k3s):

Information about the Cluster

  • Kubernetes version: v1.27.8+rke2r1
  • Cluster Type (Local/Downstream): Downstream Harvester Cluster

User Information

  • What is the role of the user logged in? Admin

Describe the bug

I wanted to rotate all of the control plane nodes of a downstream harvester RKE2 cluster.

I have 3 control plane nodes running (all over 100+ days old) and when I deleted one of the 3 nodes, the cluster gets stuck in a state where the node being deleted is stuck in "Deleting" and one other node is stuck in "Reconciling".

image

Error logs from the rancher pod:

2024/05/12 18:29:38 [ERROR] [rkebootstrap] fleet-default/home-cluster-bootstrap-template-kjc45: cluster fleet-default/home-cluster machine fleet-default/home-cluster-worker-7c7c446fd4xnkxx6-rts2f was still joined to deleting etcd machine fleet-default/home-cluster-control-plane-5845cc685dxck4bd-p4rjj
2024/05/12 18:29:43 [ERROR] [rkebootstrap] fleet-default/home-cluster-bootstrap-template-kjc45: cluster fleet-default/home-cluster machine fleet-default/home-cluster-worker-7c7c446fd4xnkxx6-rts2f was still joined to deleting etcd machine fleet-default/home-cluster-control-plane-5845cc685dxck4bd-p4rjj
2024/05/12 18:29:48 [ERROR] [rkebootstrap] fleet-default/home-cluster-bootstrap-template-kjc45: cluster fleet-default/home-cluster machine fleet-default/home-cluster-worker-7c7c446fd4xnkxx6-rts2f was still joined to deleting etcd machine fleet-default/home-cluster-control-plane-5845cc685dxck4bd-p4rjj
2024/05/12 18:29:48 [INFO] [planner] rkecluster fleet-default/home-cluster: configuring bootstrap node(s) home-cluster-control-plane-5845cc685dxck4bd-5tlr6: Node condition MemoryPressure is Unknown. Node condition DiskPressure is Unknown. Node condition PIDPressure is Unknown. Node condition Ready is Unkno
2024/05/12 18:29:53 [ERROR] [rkebootstrap] fleet-default/home-cluster-bootstrap-template-kjc45: cluster fleet-default/home-cluster machine fleet-default/home-cluster-control-plane-5845cc685dxck4bd-7b6f4 was still joined to deleting etcd machine fleet-default/home-cluster-control-plane-5845cc685dxck4bd-p4rj
2024/05/12 18:29:58 [ERROR] [rkebootstrap] fleet-default/home-cluster-bootstrap-template-kjc45: cluster fleet-default/home-cluster machine fleet-default/home-cluster-worker-7c7c446fd4xnkxx6-s98hx was still joined to deleting etcd machine fleet-default/home-cluster-control-plane-5845cc685dxck4bd-p4rjj

journalctl -u rke2-server logs

home-cluster-control-plane-5845cc685dxck4bd-5tlr6 (the one stuck in reconciling)

May 12 18:34:15 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:15Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
May 12 18:34:17 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:17Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
May 12 18:34:22 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:22Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
May 12 18:34:23 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:23Z" level=info msg="Waiting for etcd server to become available"
May 12 18:34:26 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: {"level":"warn","ts":"2024-05-12T18:34:26.336362Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007faa80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: i/o timeout\""}
May 12 18:34:26 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: {"level":"info","ts":"2024-05-12T18:34:26.337205Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/client.go:210","msg":"Auto sync endpoints failed.","error":"context deadline exceeded"}
May 12 18:34:27 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:27Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
May 12 18:34:30 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: {"level":"warn","ts":"2024-05-12T18:34:30.210507Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007faa80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = context deadline exceeded"}
May 12 18:34:30 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:30Z" level=error msg="Failed to check local etcd status for learner management: context deadline exceeded"
May 12 18:34:32 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:32Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"
May 12 18:34:36 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: {"level":"warn","ts":"2024-05-12T18:34:36.654792Z","logger":"etcd-client","caller":"v3@v3.5.9-k3s1/retry_interceptor.go:62","msg":"retrying of unary invoker failed","target":"etcd-endpoints://0xc0007faa80/127.0.0.1:2379","attempt":0,"error":"rpc error: code = DeadlineExceeded desc = latest balancer error: last connection error: connection error: desc = \"transport: Error while dialing: dial tcp 127.0.0.1:2379: i/o timeout\""}
May 12 18:34:36 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:36Z" level=info msg="Failed to test data store connection: context deadline exceeded"
May 12 18:34:37 home-cluster-control-plane-0524511f-v98l9 rke2[3391846]: time="2024-05-12T18:34:37Z" level=info msg="Waiting to retrieve kube-proxy configuration; server is not ready: https://127.0.0.1:9345/v1-rke2/readyz: 500 Internal Server Error"

To Reproduce

see bug section above

Result

Rancher fails to move the control plane from 3 nodes to 2 when a node is being deleted.

Expected Result

For rancher to be able to delete a control plane node and recreate it with a resulting 3 node control plane

@sarahhenkens sarahhenkens added the kind/bug Issues that are defects reported by users or that we know have reached a real release label May 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues that are defects reported by users or that we know have reached a real release
Projects
None yet
Development

No branches or pull requests

1 participant