Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] [CAPR] etcd restoration fails if using Calico and node being restored to has different hostname but same IP #45443

Open
Oats87 opened this issue May 10, 2024 · 0 comments
Labels
area/capr Provisioning issues that involve cluster-api-provider-rancher kind/bug Issues that are defects reported by users or that we know have reached a real release team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support

Comments

@Oats87
Copy link
Contributor

Oats87 commented May 10, 2024

Rancher Server Setup

  • Rancher version: v2.8.3
  • Installation option (Docker install/Helm Chart): N/A
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): N/A
  • Proxy/Cert Details: N/A

Information about the Cluster

  • Kubernetes version: v1.26.15+rke2r1
  • Cluster Type (Local/Downstream): Downstream v2prov/CAPR

Describe the bug
When performing an etcd restoration onto a new node with v2prov/CAPR and using Calico as the CNI for the cluster, it is possible to have a failed etcd snapshot restoration in a situation where the new etcd node has a different hostname but duplicate IP (i.e. reused) within the cluster.

To Reproduce
You can reproduce this with a custom cluster. There are some manual steps (taking a copy of the etcd snapshot, etc)

  1. Create a RKE2 cluster with Calico as the CNI
  2. Take an etcd snapshot of the cluster in steady state
  3. Copy the etcd snapshot from the node into a safe place i.e. the home directory of your user (cp /var/lib/rancher/rke2/server/db/snapshots/<snapshot> ~)
  4. Delete all of the machines in the cluster from the cluster management page and clean them (thoroughly) (rke2-uninstall.sh && rancher-system-agent-uninstall.sh)
  5. Reboot all of the machines in the cluster
  6. Change the hostname of your desired restore machine to something else i.e. hostnamectl hostname new-hostname
  7. Restart shell session to ensure new hostname has been set, and run the registration command on that node
  8. mkdir -p /var/lib/rancher/rke2/server/db/snapshots and cp ~/on-demand* /var/lib/rancher/rke2/server/db/snapshots
  9. Set the etcdSnapshotRestore.name in the cluster spec to the name of the etcd snapshot file

Result
The restore gets stuck on Waiting for etcd restore probes

Expected Result
Restoration is successful

Screenshots
Screenshot 2024-05-10 at 9 57 19 AM

root@ck-ub2340-a-0a:~# kubectl get nodes -o wide
NAME             STATUS     ROLES                              AGE     VERSION           INTERNAL-IP    EXTERNAL-IP   OS-IMAGE       KERNEL-VERSION     CONTAINER-RUNTIME
ck-ub2304-a-0    NotReady   control-plane,etcd,master,worker   17h     v1.26.15+rke2r1   172.19.1.215   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
ck-ub2304-a-1    NotReady   control-plane,etcd,master,worker   17h     v1.26.15+rke2r1   172.19.1.225   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
ck-ub2304-a-2    NotReady   control-plane,etcd,master,worker   17h     v1.26.15+rke2r1   172.19.1.230   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
ck-ub2340-a-0a   Ready      control-plane,etcd,master          8m27s   v1.26.15+rke2r1   172.19.1.215   <none>        Ubuntu 23.04   6.2.0-39-generic   containerd://1.7.11-k3s2
root@ck-ub2340-a-0a:~#

Additional context
The failure is due to calico being in CrashLoopBackOff citing:

2024-05-10 16:56:20.527 [WARNING][9] startup/winutils.go 144: Neither --kubeconfig nor --master was specified.  Using the inClusterConfig.  This might not work.
2024-05-10 16:56:20.538 [INFO][9] startup/startup.go 503: Initialize BGP data
2024-05-10 16:56:20.539 [INFO][9] startup/autodetection_methods.go 103: Using autodetected IPv4 address on interface ens192: 172.19.1.215/23
2024-05-10 16:56:20.539 [INFO][9] startup/startup.go 579: Node IPv4 changed, will check for conflicts
2024-05-10 16:56:20.545 [WARNING][9] startup/startup.go 1016: Calico node 'ck-ub2304-a-0' is already using the IPv4 address 172.19.1.215.
2024-05-10 16:56:20.545 [INFO][9] startup/startup.go 409: Clearing out-of-date IPv4 address from this node IP="172.19.1.215/23"
2024-05-10 16:56:20.553 [WARNING][9] startup/utils.go 48: Terminating
Calico node failed to start

Workaround
The workaround for this issue is to delete the old node that has a duplicate IP to your new one. In the example above, I would kubectl delete node ck-ub2304-a-0. If your calico-node in the namespace calico-system has been in CrashLoopBackOff, you can also delete it at this point to make things happen a little quicker. You may also need to restart the corresponding capi-controller-manager-* pod in the cattle-provisioning-capi-system namespace if you get stuck waiting for non-ready bootstrap node and join-url to be available.

@Oats87 Oats87 added kind/bug Issues that are defects reported by users or that we know have reached a real release team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support area/capr Provisioning issues that involve cluster-api-provider-rancher labels May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/capr Provisioning issues that involve cluster-api-provider-rancher kind/bug Issues that are defects reported by users or that we know have reached a real release team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

1 participant