Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: Migrate init-node from CustomMachine to Amazonec2Machine #45426

Open
kgtw opened this issue May 9, 2024 · 1 comment
Open

Question: Migrate init-node from CustomMachine to Amazonec2Machine #45426

kgtw opened this issue May 9, 2024 · 1 comment

Comments

@kgtw
Copy link

kgtw commented May 9, 2024

Environmental Info:
RKE2 Version: v1.27.6+rke2r1
Rancher Version: v2.8.3

Cluster Configuration:

  • self-registering CustomMachine worker nodes, backed by a per-az ASG deployment in AWS.
  • 3x control-plane + etcd CustomMachine nodes, backed by a per-az ASG deployment in AWS. (what we are wanting to remove)
  • 3x control-plane + etcd Amazonec2Machine nodes for a per-az deployment, managed by Rancher.

Context:
We currently have a custom setup of both control-plane/etcd nodes and worker nodes backed by AWS ASG's. As part of our companies policy for security & patching upgrades we need to frequently rollout new AMI's. This approach works extremely well for our "worker" nodes, where we have configured the AWS ASG with an instance TTL of 3 days.

When it comes to the control-plane/etcd nodes its slightly more problematic. This is because of the destruction of the "init-node", and Ranchers inability to re-designate a new init-node when the previous one was deleted.

To mitigate this we have moved to using Amazonec2Machine managed node pools for our control-plane/etcd nodes, where Rancher maintains the lifecycle of those nodes and can gracefully re-assign an existing control-plane/etcd node to be the new init-node.

How to migrate?

Currently for a large portion of our clusters the init-node is currently assigned to a CustomMachine control-plane/etcd node that is managed by AWS ASG, and we want to move it to an Amazonec2Machine instance managed by Rancher.

This is the current approach that we have validated, and are hoping to perform for all clusters:

  1. Retrieve the cattle-id from the instance
$ cat /etc/rancher/agent/cattle-id
4a8d613dcd212daa87ef31f8964870e9fc10e94b8a506d439c1b8f9c57d6507
  1. Identify the "machine plan" secret resource name for the Rancher managed control-plane/etcd instance that we want to be the new init-node.
    2.1 Use the resource name .spec.bootstrap.configRef.name from the machine.cluster.x-k8s.io resource.

  2. Add the rke.cattle.io/machine-id label to the machine plan secret resource from 2.1

$ kubectl label secret -n fleet-default <resource-name> rke.cattle.io/machine-id=4a8d613dcd212daa87ef31f8964870e9fc10e94b8a506d439c1b8f9c57d6507
  1. Update clusters.provisioning.cattle.io resource with rke.cattle.io/init-node-machine-id label.
$ kubectl label clusters.provisioning.cattle.io -n fleet-default <cluster> rke.cattle.io/init-node-machine-id=4a8d613dcd212daa87ef31f8964870e9fc10e94b8a506d439c1b8f9c57d6507

At this point, Rancher automatically starts updating all nodes and reconfiguring them to connect to the init-node that we have defined.

The reason why we are setting the rke.cattle.io/machine-id label on the machine plan is because its used within the following function to select/filter eligible nodes that are suitable for being made an init-node.

https://github.com/rancher/rancher/blob/release/v2.8/pkg/capr/planner/initnode.go#L48-L54

Questions

  • We noticed that for CustomMachine nodes the label rke.cattle.io/machine-id is set, whereas for Amazonec2Machine nodes the label is absent. Is this expected, or a bug?
  • Follow up, by setting the rke.cattle.io/machine-id label on the Amazonec2Machine nodes, are we potentially breaking some other functionality?
  • Is the process I've outlined above suitable for forcing a new node to be an "init-node"? There seems to be a lack of operational tooling to handle such a use-case.
@brandond
Copy link
Contributor

brandond commented May 9, 2024

I am moving this to rancher/rancher, as cluster provisioning is not part of RKE2.

@brandond brandond transferred this issue from rancher/rke2 May 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants