Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a None value to ClusterSyncMode #798

Closed
chxcode opened this issue Apr 11, 2024 · 10 comments
Closed

Add a None value to ClusterSyncMode #798

chxcode opened this issue Apr 11, 2024 · 10 comments
Labels
kind/feature New feature or request

Comments

@chxcode
Copy link
Contributor

chxcode commented Apr 11, 2024

What would you like to be added:

Add a None value to ClusterSyncMode to prevent or pause updates and distribution of applications on a certain mcls, while preserving the existing applications on the mcls.

Why is this needed:

The 'None' value indicates that the synchronization mode is unset, which means applications will not be synchronized. Introducing 'None' as a value allows ClusterSyncMode to accommodate a wider range of scenarios.

We manage hundreds of clusters distributed across the country, and there is an occasional need to halt the synchronization of applications in certain clusters while retaining the original applications on the cluster. However, after halting synchronization, we must still maintain connectivity to the subclusters, thus the connection must remain active.

If this requirement is feasible, I think I can try submitting a PR.

@chxcode chxcode added the kind/feature New feature or request label Apr 11, 2024
@dixudx
Copy link
Member

dixudx commented Apr 11, 2024

Add a None value to ClusterSyncMode to prevent or pause updates and distribution of applications on a certain mcls, while preserving the existing applications on the mcls.

The 'None' value indicates that the synchronization mode is unset, which means applications will not be synchronized. Introducing 'None' as a value allows ClusterSyncMode to accommodate a wider range of scenarios.

This looks interesting and reasonable. I don't know whether adding a taint for the cluster will do some help in this scenario. Maybe it can only cover a small part.

If this requirement is feasible, I think I can try submitting a PR.

I am okay for this feature. But before we move on, we need to make some clarifications.

  1. Will we change the ClusterSyncMode from None to Push, Push or Dual? And vice versa.

    If we add such a None sync mode, this may bring some side effects, such as the inconsistency of clusternent-agent flags with those settings in MangedCluster. This may lead to chaos when restarting.

  2. Maybe we should introduce a new boolean field, such as syncPaused, to ManagedCluster objects. This will help get rid of above problem. I'd prefer this option.

  3. Shall we opt-out such "paused" clusters for scheduling? Since they are not allowed to sync and run with newer configurations. It is better not to schedule new workloads to such clusters. This should have the same effect with cluster taints.

@chxcode
Copy link
Contributor Author

chxcode commented Apr 11, 2024

Add a None value to ClusterSyncMode to prevent or pause updates and distribution of applications on a certain mcls, while preserving the existing applications on the mcls.

The 'None' value indicates that the synchronization mode is unset, which means applications will not be synchronized. Introducing 'None' as a value allows ClusterSyncMode to accommodate a wider range of scenarios.

This looks interesting and reasonable. I don't know whether adding a taint for the cluster will do some help in this scenario. Maybe it can only cover a small part.

I have tried adding taints to the cluster, but it didn't work as well as I had hoped (I was expecting the apps on the tainted cluster to continue running normally, not to be evicted). Below are the steps and details of what I did at the time. Could you tell me what went wrong?

edit-mcls pod-status

If this requirement is feasible, I think I can try submitting a PR.

I am okay for this feature. But before we move on, we need to make some clarifications.

  1. Will we change the ClusterSyncMode from None to Push, Push or Dual? And vice versa.

Yes.

If we add such a None sync mode, this may bring some side effects, such as the inconsistency of clusternent-agent flags with those settings in MangedCluster. This may lead to chaos when restarting.

You are right.

  1. Maybe we should introduce a new boolean field, such as syncPaused, to ManagedCluster objects. This will help get rid of above problem. I'd prefer this option.

The introduction of a new boolean field, syncPaused, as you mentioned, I think would be best. The reason I mentioned adding the None field was because I thought it should be relatively easy to implement at the time.

  1. Shall we opt-out such "paused" clusters for scheduling? Since they are not allowed to sync and run with newer configurations. It is better not to schedule new workloads to such clusters. This should have the same effect with cluster taints.

Eventually, if adding taints could work as expected, then there wouldn't be a need to introduce syncPaused.

@chxcode
Copy link
Contributor Author

chxcode commented Apr 11, 2024

In our environment, we use the "Deploying Applications to Multiple Clusters with Replication Scheduling" method for distributing applications, so we would prefer that setting taints or introducing syncPaused leads to a sort of "lockdown" state.

@dixudx
Copy link
Member

dixudx commented Apr 12, 2024

I have tried adding taints to the cluster, but it didn't work as well as I had hoped (I was expecting the apps on the tainted cluster to continue running normally, not to be evicted)

When adding a taint, it will trigger a rescheduling.

The reason I mentioned adding the None field was because I thought it should be relatively easy to implement at the time.

If so, we need to firstly make sure there won't be any inconsistencies between clusternet-agent flags and ManagedCluster object. For example, we may need to introduce a check on clusternet-agent side when starting.

@chxcode
Copy link
Contributor Author

chxcode commented Apr 12, 2024

If so, we need to firstly make sure there won't be any inconsistencies between clusternet-agent flags and ManagedCluster object. For example, we may need to introduce a check on clusternet-agent side when starting.

Regarding the inconsistency between clusternet-agent flags and the ManagedCluster object, I understand that it already exists. I also have a question: from an overall design perspective, is it recommended to modify the values of the spec-related fields of a ManagedCluster after it has been created in the parent cluster?

However, I now have a rather interesting 🌟 solution💡! I think I can set the syncMode of the selected child cluster mcls to Pull on the parent cluster, and set the startup parameter syncMode of the clusternet-agent on the child cluster to Push. In this way, the child cluster will not actively Pull, and the parent cluster will not actively Push, which should also meet my needs.

Or rather, the inconsistency of syncMode is not an issue because it seems that once it's created in mcls, the child cluster won't update it anymore. In mcls, it's for the parent cluster to use, and on the clusternet-agent flags, it's for the clusternet-agent to use. Let me brainstorm here! What do you think if we add a method like updateClusterStatus that periodically takes the syncMode from the clusternet-agent and updates the syncMode in mcls? Would that be feasible?

@dixudx
Copy link
Member

dixudx commented Apr 13, 2024

I think I can set the syncMode of the selected child cluster mcls to Pull on the parent cluster, and set the startup parameter syncMode of the clusternet-agent on the child cluster to Push. In this way, the child cluster will not actively Pull, and the parent cluster will not actively Push, which should also meet my needs.

Better not. And we won't want to change these settings/flags back and forth, since we may want to enable the synchronization.

it seems that once it's created in mcls, the child cluster won't update it anymore

What do you think if we add a method like updateClusterStatus that periodically takes the syncMode from the clusternet-agent and updates the syncMode in mcls? Would that be feasible?

Currently clusternet-agent will periodically report heartbeats with updateClusterStatus. But we did not modify syncMode.

Now the question comes to whether we should allow modifying syncMode and how could we modify it. For example, directly editing ManagedCluster object, or changing flags of clusternet-agent, or other annotations. And moreover, should we need to restart clusternet-agent on such changes. This is part of the reasons why I introduce Dual mode to help mitigate this problem.

If we only want to pause the synchronization, I'd prefer to add a new boolean field syncPaused.

@chxcode
Copy link
Contributor Author

chxcode commented Apr 14, 2024

Better not. And we won't want to change these settings/flags back and forth, since we may want to enable the synchronization.
If we only want to pause the synchronization, I'd prefer to add a new boolean field syncPaused.

Alright, let's discuss the details of how to add the syncPaused field. Once syncPaused is added, it should be relatively convenient to handle in the parent cluster, as we can easily retrieve the syncPaused value from a certain mcls object to decide whether synchronization is needed. However, in the child cluster, when triggering the synchronization of applications, should we also query the syncPaused value in real-time on the mcls object to determine whether to synchronize the application? What are your thoughts on this?

@dixudx
Copy link
Member

dixudx commented Apr 15, 2024

Once syncPaused is added, it should be relatively convenient to handle in the parent cluster, as we can easily retrieve the syncPaused value from a certain mcls object to decide whether synchronization is needed.

Right. We only need to add a check for this on clusternet-controller-manager side.

However, in the child cluster, when triggering the synchronization of applications, should we also query the syncPaused value in real-time on the mcls object to determine whether to synchronize the application? What are your thoughts on this?

We can add a mcls informer only watching itself by cluster id. Once the field syncPaused is changed, we can receive the events to determine whether to synchronize the applications.

@chxcode
Copy link
Contributor Author

chxcode commented Apr 16, 2024

We can add a mcls informer only watching itself by cluster id. Once the field syncPaused is changed, we can receive the events to determine whether to synchronize the applications.

I don't have extensive knowledge about k8s, and I want to confirm if my understanding is correct: On the clusternet-agent side, an informer is used to monitor changes in the mcls of the corresponding sub-cluster. If syncPaused in the event is false, then application synchronization is triggered.

Additionally, is it necessary to fetch the cached syncPaused value in the current clusternet-agent synchronization logic to determine whether to sync?

@dixudx
Copy link
Member

dixudx commented Apr 17, 2024

On the clusternet-agent side, an informer is used to monitor changes in the mcls of the corresponding sub-cluster. If syncPaused in the event is false, then application synchronization is triggered.

Yes. And clusternet-agent only synchronize applications when the syncMode is Pull mode or Dual mode with FeatureGate AppPusher=False.

is it necessary to fetch the cached syncPaused value in the current clusternet-agent synchronization logic to determine whether to sync?

We've got informers. In the informers, these objects have already been cached. We just need to retrieve the mcls object from the listers cache.

@chxcode chxcode closed this as completed May 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants