[BUG] [CAPR] etcd restoration fails if using Calico and node being restored to has different hostname but same IP #45443

Oats87 · 2024-05-10T17:02:04Z

Rancher Server Setup

Rancher version: v2.8.3
Installation option (Docker install/Helm Chart): N/A
- If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): N/A
Proxy/Cert Details: N/A

Information about the Cluster

Kubernetes version: v1.26.15+rke2r1
Cluster Type (Local/Downstream): Downstream v2prov/CAPR

Describe the bug
When performing an etcd restoration onto a new node with v2prov/CAPR and using Calico as the CNI for the cluster, it is possible to have a failed etcd snapshot restoration in a situation where the new etcd node has a different hostname but duplicate IP (i.e. reused) within the cluster.

To Reproduce
You can reproduce this with a custom cluster. There are some manual steps (taking a copy of the etcd snapshot, etc)

Create a RKE2 cluster with Calico as the CNI
Take an etcd snapshot of the cluster in steady state
Copy the etcd snapshot from the node into a safe place i.e. the home directory of your user (cp /var/lib/rancher/rke2/server/db/snapshots/<snapshot> ~)
Delete all of the machines in the cluster from the cluster management page and clean them (thoroughly) (rke2-uninstall.sh && rancher-system-agent-uninstall.sh)
Reboot all of the machines in the cluster
Change the hostname of your desired restore machine to something else i.e. hostnamectl hostname new-hostname
Restart shell session to ensure new hostname has been set, and run the registration command on that node
mkdir -p /var/lib/rancher/rke2/server/db/snapshots and cp ~/on-demand* /var/lib/rancher/rke2/server/db/snapshots
Set the etcdSnapshotRestore.name in the cluster spec to the name of the etcd snapshot file

Result
The restore gets stuck on Waiting for etcd restore probes

Expected Result
Restoration is successful

Screenshots

root@ck-ub2340-a-0a:~# kubectl get nodes -o wide
NAME             STATUS     ROLES                              AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
ck-ub2304-a-0    NotReady   control-plane,etcd,master,worker   17h     v1.26.15+rke2r1   172.19.1.215   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
ck-ub2304-a-1    NotReady   control-plane,etcd,master,worker   17h     v1.26.15+rke2r1   172.19.1.225   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
ck-ub2304-a-2    NotReady   control-plane,etcd,master,worker   17h     v1.26.15+rke2r1   172.19.1.230   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
ck-ub2340-a-0a   Ready      control-plane,etcd,master          8m27s   v1.26.15+rke2r1   172.19.1.215   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
root@ck-ub2340-a-0a:~#

Additional context
The failure is due to calico being in CrashLoopBackOff citing:

2024-05-10 16:56:20.527 [WARNING][9] startup/winutils.go 144: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-05-10 16:56:20.538 [INFO][9] startup/startup.go 503: Initialize BGP data
2024-05-10 16:56:20.539 [INFO][9] startup/autodetection_methods.go 103: Using autodetected IPv4 address on interface ens192: 172.19.1.215/23
2024-05-10 16:56:20.539 [INFO][9] startup/startup.go 579: Node IPv4 changed, will check for conflicts
2024-05-10 16:56:20.545 [WARNING][9] startup/startup.go 1016: Calico node 'ck-ub2304-a-0' is already using the IPv4 address 172.19.1.215.
2024-05-10 16:56:20.545 [INFO][9] startup/startup.go 409: Clearing out-of-date IPv4 address from this node IP="172.19.1.215/23"
2024-05-10 16:56:20.553 [WARNING][9] startup/utils.go 48: Terminating
Calico node failed to start

Workaround
The workaround for this issue is to delete the old node that has a duplicate IP to your new one. In the example above, I would kubectl delete node ck-ub2304-a-0. If your calico-node in the namespace calico-system has been in CrashLoopBackOff, you can also delete it at this point to make things happen a little quicker. You may also need to restart the corresponding capi-controller-manager-* pod in the cattle-provisioning-capi-system namespace if you get stuck waiting for non-ready bootstrap node and join-url to be available.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] [CAPR] etcd restoration fails if using Calico and node being restored to has different hostname but same IP #45443

[BUG] [CAPR] etcd restoration fails if using Calico and node being restored to has different hostname but same IP #45443

Oats87 commented May 10, 2024 •

edited

[BUG] [CAPR] etcd restoration fails if using Calico and node being restored to has different hostname but same IP #45443

[BUG] [CAPR] etcd restoration fails if using Calico and node being restored to has different hostname but same IP #45443

Comments

Oats87 commented May 10, 2024 • edited

Oats87 commented May 10, 2024 •

edited