[BUG] [CAPR] etcd restoration fails if using Calico and node being restored to has different hostname but same IP #45443
Labels
area/capr
Provisioning issues that involve cluster-api-provider-rancher
kind/bug
Issues that are defects reported by users or that we know have reached a real release
team/hostbusters
The team that is responsible for provisioning/managing downstream clusters + K8s version support
Rancher Server Setup
v2.8.3
Information about the Cluster
v1.26.15+rke2r1
Describe the bug
When performing an etcd restoration onto a new node with v2prov/CAPR and using Calico as the CNI for the cluster, it is possible to have a failed etcd snapshot restoration in a situation where the new etcd node has a different hostname but duplicate IP (i.e. reused) within the cluster.
To Reproduce
You can reproduce this with a custom cluster. There are some manual steps (taking a copy of the etcd snapshot, etc)
cp /var/lib/rancher/rke2/server/db/snapshots/<snapshot> ~
)rke2-uninstall.sh && rancher-system-agent-uninstall.sh
)hostnamectl hostname new-hostname
mkdir -p /var/lib/rancher/rke2/server/db/snapshots
andcp ~/on-demand* /var/lib/rancher/rke2/server/db/snapshots
etcdSnapshotRestore.name
in the cluster spec to the name of the etcd snapshot fileResult
The restore gets stuck on
Waiting for etcd restore probes
Expected Result
Restoration is successful
Screenshots
Additional context
The failure is due to
calico
being inCrashLoopBackOff
citing:Workaround
The workaround for this issue is to delete the old node that has a duplicate IP to your new one. In the example above, I would
kubectl delete node ck-ub2304-a-0
. If yourcalico-node
in the namespacecalico-system
has been inCrashLoopBackOff
, you can also delete it at this point to make things happen a little quicker. You may also need to restart the correspondingcapi-controller-manager-*
pod in thecattle-provisioning-capi-system
namespace if you get stuck waiting for non-ready bootstrap node and join-url to be available.The text was updated successfully, but these errors were encountered: