fix: Deployment generation/observedGeneration bug #4867

veophi · 2024-04-24T12:26:13Z

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:
Part of #4870

Special notes for your reviewer:

Does this PR introduce a user-facing change?:

karmada-bot · 2024-04-24T12:26:18Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
To complete the pull request process, please assign kevin-wangzefeng after the PR has been reviewed.
You can assign the PR to them by writing /assign @kevin-wangzefeng in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

codecov-commenter · 2024-04-24T12:38:17Z

Codecov Report

Attention: Patch coverage is 24.00000% with 19 lines in your changes are missing coverage. Please review.

Project coverage is 53.04%. Comparing base (6e5a602) to head (62b976f).
Report is 16 commits behind head on master.

Files	Patch %	Lines
pkg/controllers/status/work_status_controller.go	0.00%	9 Missing ⚠️
...esourceinterpreter/default/native/reflectstatus.go	0.00%	8 Missing ⚠️
...ourceinterpreter/default/native/aggregatestatus.go	71.42%	1 Missing and 1 partial ⚠️

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #4867      +/-   ##
==========================================
+ Coverage   52.98%   53.04%   +0.05%     
==========================================
  Files         250      251       +1     
  Lines       20421    20411      -10     
==========================================
+ Hits        10820    10826       +6     
+ Misses       8881     8871      -10     
+ Partials      720      714       -6

Flag	Coverage Δ
unittests	`53.04% <24.00%> (+0.05%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

XiShanYongYe-Chang · 2024-04-24T12:49:25Z

Hi @veophi can you help fix the lint error?

XiShanYongYe-Chang · 2024-04-24T13:09:11Z

/assign @yike21

veophi · 2024-04-25T02:32:38Z

Hi @veophi can you help fix the lint error?

@XiShanYongYe-Chang fixed.

XiShanYongYe-Chang · 2024-04-25T02:39:42Z

Hi @veophi, Is this pr ready? If it is ready, you can remove the WIP info in the title. Then the work-in-progress label will be removed, indicating that the current pr is ready.

In addition, you can add the issue number of the current pr to the end of Fiexs, like this:

**Which issue(s) this PR fixes:**
Fixes #4866

and GitHub will automatically associate it with the issue. After the pr is merged, the associated issue will be closed.

XiShanYongYe-Chang · 2024-04-25T03:17:54Z

pkg/controllers/binding/common.go

@@ -171,6 +173,7 @@ func mergeAnnotations(workload *unstructured.Unstructured, workNamespace string,
 	annotations := make(map[string]string)
 	util.MergeAnnotation(workload, workv1alpha2.WorkNameAnnotation, names.GenerateWorkName(workload.GetKind(), workload.GetName(), workload.GetNamespace()))
 	util.MergeAnnotation(workload, workv1alpha2.WorkNamespaceAnnotation, workNamespace)
+	util.MergeAnnotation(workload, workv1alpha2.KarmadaWorkloadGenerationAnnotationKey, strconv.Itoa(int(workload.GetGeneration())))


Does int64 to int conversion lose information on 32-bit systems?

Does int64 to int conversion lose information on 32-bit systems?

@XiShanYongYe-Chang fixed.

yike21

/lgtm
There are now stricter requirements for aligning the status.observedGeneration of a federated resource (deployment) with its metadata.generation.
Sometimes you may find that the status.observedGeneration of federated resource (Deployment) is smaller than its metadata.generation, meaning that some resource in member cluster have not yet been updated by deployment-controller, which is normal.
Thanks for you work! @veophi

XiShanYongYe-Chang · 2024-04-25T08:20:46Z

pkg/apis/work/v1alpha2/well_known_constants.go

@@ -58,6 +58,10 @@ const (
 	// - Manifest in Work object: describes the name of ClusterResourceBinding which the manifest derived from.
 	ClusterResourceBindingAnnotationKey = "clusterresourcebinding.karmada.io/name"

+	// KarmadaWorkloadGenerationAnnotationKey records the generation of karmada workload to member workload.
+	// This annotation is helpful to generate observedGeneration for karmada workload.
+	KarmadaWorkloadGenerationAnnotationKey = "workload.karmada.io/generation"


The resource wrapped in work is not necessarily a workload resource. It may be more appropriate to name it resourcetemplate.

The resource wrapped in work is not necessarily a workload resource. It may be more appropriate to name it resourcetemplate.

resourcetemplate.karmada.io/generation?

I have a question. Does this annotation record the resource template generation of the host cluster or the workload generation of the member cluster?

XiShanYongYe-Chang · 2024-04-25T08:22:24Z

pkg/resourceinterpreter/default/native/aggregatestatus.go

@@ -80,14 +81,23 @@ func aggregateDeploymentStatus(object *unstructured.Unstructured, aggregatedStat
 		// which is the generation Karmada 'observed'.
 		// The 'observedGeneration' is mainly used by GitOps tools(like 'Argo CD') to assess the health status.
 		// For more details, please refer to https://argo-cd.readthedocs.io/en/stable/operator-manual/health/.
-		newStatus.ObservedGeneration = deploy.Generation


This comment may no longer be appropriate:

// always set 'observedGeneration' with current generation(.metadata.generation) // which is the generation Karmada 'observed'.

XiShanYongYe-Chang · 2024-04-25T08:55:26Z

pkg/resourceinterpreter/default/native/reflectstatus.go

+	if deploymentStatus.ObservedGeneration < object.GetGeneration() {
+		klog.Errorf("%s(%s/%s) latest generation is not observed by its controller, current status is untrustworthy, ignore reflect status",
+			object.GetKind(), object.GetNamespace(), object.GetName())
+		return nil, nil


If nil is directly returned, the upper-layer logic sets the entire status to nil when processing status collection:

karmada/pkg/controllers/status/work_status_controller.go

Line 384 in 4105790

Status: statusRaw,

So, can we avoid unnecessary updates to work by making the determination directly before starting the state collection?

karmada/pkg/controllers/status/work_status_controller.go

Lines 254 to 255 in 4105790

klog.Infof("reflecting %s(%s/%s) status to Work(%s/%s)", observedObj.GetKind(), observedObj.GetNamespace(), observedObj.GetName(), workNamespace, workName)

return c.reflectStatus(workObject, observedObj)

make sense

If nil is directly returned, the upper-layer logic sets the entire status to nil when processing status collection:

karmada/pkg/controllers/status/work_status_controller.go

Line 384 in 4105790

Status: statusRaw,

So, can we avoid unnecessary updates to work by making the determination directly before starting the state collection?

karmada/pkg/controllers/status/work_status_controller.go

Lines 254 to 255 in 4105790

klog.Infof("reflecting %s(%s/%s) status to Work(%s/%s)", observedObj.GetKind(), observedObj.GetNamespace(), observedObj.GetName(), workNamespace, workName)

return c.reflectStatus(workObject, observedObj)

make sense.

karmada-bot · 2024-04-25T13:07:49Z

New changes are detected. LGTM label has been removed.

XiShanYongYe-Chang · 2024-04-25T13:09:49Z

Part of #4870
/kind feature
Hi @veophi, can you help add the release note.

yike21 · 2024-04-26T01:45:13Z

pkg/controllers/status/work_status_controller.go

@@ -251,10 +256,27 @@ func (c *WorkStatusController) syncWorkStatus(key util.QueueKey) error {
 		// status changes.
 	}

+	if reason, ready := c.isReflectStatusReady(observedObj); !ready {
+		// Just return nil and no need to reflect the status, waiting for the next correct event.
+		klog.Infof("Skip %s/%s/%s reflect status, reason: %s", observedObj.GetKind(), observedObj.GetNamespace(), observedObj.GetName(), reason)


just a formatting problem: klog.Infof("Skip %s(%s/%s) reflect status, reason: %s", observedObj.GetKind(), observedObj.GetNamespace(), observedObj.GetName(), reason)

yike21 · 2024-04-26T01:46:14Z

pkg/controllers/status/work_status_controller.go

+func (c *WorkStatusController) isReflectStatusReady(observedObj *unstructured.Unstructured) (string, bool) {
+	observedGeneration, ok, err := unstructured.NestedInt64(observedObj.Object, "status", "observedGeneration")
+	if err == nil && ok {
+		// We compare them iff `status.observedGeneration` exists.


typo: iff -> if

XiShanYongYe-Chang

Thanks~

XiShanYongYe-Chang · 2024-04-26T01:41:58Z

pkg/controllers/status/work_status_controller.go

@@ -251,10 +256,27 @@ func (c *WorkStatusController) syncWorkStatus(key util.QueueKey) error {
 		// status changes.
 	}

+	if reason, ready := c.isReflectStatusReady(observedObj); !ready {
+		// Just return nil and no need to reflect the status, waiting for the next correct event.


How about this comment: When the generation in the resource does not reach a consistent state, skip reflect status and wait for the next update event.

XiShanYongYe-Chang · 2024-04-26T01:44:28Z

pkg/controllers/status/work_status_controller.go

 	klog.Infof("reflecting %s(%s/%s) status to Work(%s/%s)", observedObj.GetKind(), observedObj.GetNamespace(), observedObj.GetName(), workNamespace, workName)
 	return c.reflectStatus(workObject, observedObj)
 }

+func (c *WorkStatusController) isReflectStatusReady(observedObj *unstructured.Unstructured) (string, bool) {


Is it enough for the current scenario to just return a bool value?

This is for extensibility considerations.

Currently, there is only one cause, and only log output processing is performed for the cause. No other processing logic is available. After the processing logic is available, we can design how to add a reason.

XiShanYongYe-Chang · 2024-04-26T01:47:33Z

pkg/resourceinterpreter/default/native/reflectstatus.go

@@ -68,12 +69,17 @@ func reflectDeploymentStatus(object *unstructured.Unstructured) (*runtime.RawExt
 		return nil, fmt.Errorf("failed to convert DeploymentStatus from map[string]interface{}: %v", err)
 	}

+	resourceTemplateGenerationInt := int64(0)
+	resourceTemplateGenerationStr := util.GetAnnotationValue(object.GetAnnotations(), v1alpha2.ResourceTemplateGenerationAnnotationKey)
+	_ = runtime.Convert_string_To_int64(&resourceTemplateGenerationStr, &resourceTemplateGenerationInt, nil)


If the annotation value is deleted or set to an invalid value due to misoperations, will an error occur?

Signed-off-by: veophi <vec.g.sun@gmail.com>

whitewindmills · 2024-04-26T06:18:10Z

pkg/controllers/status/work_status_controller.go

+		if observedGeneration < observedObj.GetGeneration() {
+			return false
+		}
+	}


Do you wanna express that the deployment has reached a consistent state if .status.observedGeneration is absent?

whitewindmills · 2024-04-26T06:32:00Z

pkg/resourceinterpreter/default/native/reflectstatus.go

 	grabStatus := appsv1.DeploymentStatus{
 		Replicas:            deploymentStatus.Replicas,
 		UpdatedReplicas:     deploymentStatus.UpdatedReplicas,
 		ReadyReplicas:       deploymentStatus.ReadyReplicas,
 		AvailableReplicas:   deploymentStatus.AvailableReplicas,
 		UnavailableReplicas: deploymentStatus.UnavailableReplicas,
+		ObservedGeneration:  resourceTemplateGenerationInt,


I don't get why not use deploymentStatus.ObservedGeneration. This seems so confusing. According to the previous logic, when the function reflectDeploymentStatus is executed, status.observedGeneration is already equal to metadata.generation.

Users in the member cluster or the controller will modify resources. If deploymentStatus.ObservedGeneration in the member cluster is used, it cannot be consistent with generation in the resource template on the control plane.

I got you, but let's take a look at a specific example.

We assume that the reconciliation of member1 will take some time. Before the deployment of member1 reaches a consistent state, Is the deployment status of Karmada credible?

Yes, this will happen. I have talked to @veophi about this. Resources in this state will not be considered updated. After the controllers in the member clusters are processed, generation will be processed consistently, which means it may have some delay. But in my opinion, an extension is better than an advance. And, this should be a low probability event.

I don't know if I described it clearly. Let's see if the author has anything to add. @veophi

I have a question: there doesn't seem to be a way to reflect the actual .status.observedGeneration of deployment in member clusters. If status.ObservedGeneration changes in a member cluster, a judgment is made at the time of AggregationStatus: only status.ObservedGeneration greater than or equal to the .metadata.generation in the resource template are counted.

I got you, but let's take a look at a specific example.

We assume that the reconciliation of member1 will take some time. Before the deployment of member1 reaches a consistent state, Is the deployment status of Karmada credible?

@whitewindmills Karmada Deployment will keep the old status instead of aggregating inconsistent status of member1. Why do you think the deployment status of Karmada is credible?

I have a question: there doesn't seem to be a way to reflect the actual .status.observedGeneration of deployment in member clusters.

I think the value of .status.observedGeneration in the member cluster is not available for the control plane, because the resources of karmada control plane and member cluster, their changes may not be consistent.

will operation retention correct it? I find that retention just retain replicas for deployment.

You're right, deployment retention only deals with the replicas, the other fields will be overwritten by the control plane. The deployment controller in the member cluster processes the .status.observedGeneration field, and the final generation is consistent in the member cluster (unless an error occurs).

Why do you think the deployment status of Karmada is credible?

No, we do not have to guarantee it's trustworthy. What I mean is that I hope karmada can collect the state in real time instead of waiting until it reaches a consistent state.

What I mean is that I hope karmada can collect the state in real time instead of waiting until it reaches a consistent state.

My main objection is that this PR undermines this behavior.

You're right, deployment retention only deals with the replicas, the other fields will be overwritten by the control plane. The deployment controller in the member cluster processes the .status.observedGeneration field, and the final generation is consistent in the member cluster (unless an error occurs).

Thanks, according to c.ObjectWatcher.NeedsUpdate, if users change .spec of deployment in some member cluster, code here will overwrite resource in member clusters.

No, we do not have to guarantee it's trustworthy. What I mean is that I hope karmada can collect the state in real time instead of waiting until it reaches a consistent state.

We can see that overwrite is ahead of reflectStatus. Users changed spec of deployment in member cluster, then karmada-control-plane will correct it. and then we get status of deployment in member cluster by reflectStatus.

Maybe just reflect .status.observedGeneration, because .metadata.generation will be overwritten finally.

whitewindmills · 2024-04-26T06:52:18Z

pkg/controllers/status/work_status_controller.go

@@ -251,10 +251,27 @@ func (c *WorkStatusController) syncWorkStatus(key util.QueueKey) error {
 		// status changes.
 	}

+	if !c.isConsistentGenerationState(observedObj) {


Obviously skipping state collection is not a better approach, even if the collected state is not reliable until a consistent state is reached, it is better than no state at all.
For users, it is obviously a more friendly way for Karmada to provide a continuous state change before reaching a consistent state.

Previous: 0 -> 1 -> 2 -> 3 -> 4
Now: 0 -> 4

1.Skiping reflect status doesn't mean do not aggregate statuses.

2.Keeping the old status is better than aggregating untrustworthy statuses.

1.Skiping reflect status doesn't mean do not aggregate statuses.

If status is not reflected, where does the aggregated data come from?

Keeping the old status is better than aggregating untrustworthy statuses.

it's not untrustworthy status but the real status. And the old status is also untrustworthy.

@whitewindmills Perhaps this check condition can be deleted, but we should keep status.observedGeneration < generation for Karmada deployment if its member status is inconsistent.

What about that reflect .status.observedGeneration to express the real status, and keep status.observedGeneration < generation condition by AggregateStatus operaton of deployment, not in thework_status_controller.go?

I agree with @yike21
The work status controller just triggers status collection, but how status is collected is defined by resource interpreter. In addition, not all resources have the .status.observedGeneration, it's not appropriate to check it for all kinds of resources.

XiShanYongYe-Chang · 2024-04-26T08:08:43Z

I see that there is a lot of discussion about this pr and the issues are quite in-depth. This means that the PR solution is very important to Karmada. Can we organize a meeting or discussion group to discuss this issue?

@veophi @whitewindmills @yike21

RainbowMango · 2024-05-08T07:44:51Z

By the way, I updated the PR description(Removed Fixes indicator) as it tries to fix the #4866, but it just focuses on Deployment.

XiShanYongYe-Chang · 2024-05-25T09:48:24Z

/assign

karmada-bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 24, 2024

karmada-bot requested review from mrlihanbo, RainbowMango and XiShanYongYe-Chang April 24, 2024 12:26

karmada-bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Apr 24, 2024

veophi mentioned this pull request Apr 24, 2024

[BUG] generation and observedGeneration of CloneSet in karmada is not aligned. #4866

Open

karmada-bot assigned yike21 Apr 24, 2024

veophi force-pushed the bugfix/deployment-generation branch 2 times, most recently from f380479 to cf79f5b Compare April 25, 2024 02:32

karmada-bot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Apr 25, 2024

veophi changed the title ~~[WIP]: fix Deployment generation/observedGeneration logic~~ [buggix] fix Deployment generation/observedGeneration logic Apr 25, 2024

karmada-bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Apr 25, 2024

veophi changed the title ~~[buggix] fix Deployment generation/observedGeneration logic~~ [bugfix] fix Deployment generation/observedGeneration logic Apr 25, 2024

veophi changed the title ~~[bugfix] fix Deployment generation/observedGeneration logic~~ fix: Deployment generation/observedGeneration bug Apr 25, 2024

XiShanYongYe-Chang reviewed Apr 25, 2024

View reviewed changes

veophi force-pushed the bugfix/deployment-generation branch from cf79f5b to fa2364f Compare April 25, 2024 03:28

yike21 reviewed Apr 25, 2024

View reviewed changes

karmada-bot added the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2024

XiShanYongYe-Chang reviewed Apr 25, 2024

View reviewed changes

yike21 mentioned this pull request Apr 25, 2024

[Umbrella] Align generation and observedGeneration fields by reflectstatus and aggregatestatus operation #4870

Open

20 tasks

veophi force-pushed the bugfix/deployment-generation branch from fa2364f to 68e7d2f Compare April 25, 2024 13:07

karmada-bot removed the lgtm Indicates that a PR is ready to be merged. label Apr 25, 2024

karmada-bot added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 25, 2024

veophi force-pushed the bugfix/deployment-generation branch 4 times, most recently from c3eed4d to e4fff06 Compare April 26, 2024 01:43

yike21 reviewed Apr 26, 2024

View reviewed changes

XiShanYongYe-Chang reviewed Apr 26, 2024

View reviewed changes

veophi force-pushed the bugfix/deployment-generation branch 2 times, most recently from 7230b2d to 87be253 Compare April 26, 2024 02:09

fix deployment generation/observedGeneration logic

62b976f

Signed-off-by: veophi <vec.g.sun@gmail.com>

veophi force-pushed the bugfix/deployment-generation branch from 87be253 to 62b976f Compare April 26, 2024 02:41

whitewindmills reviewed Apr 26, 2024

View reviewed changes

karmada-bot assigned XiShanYongYe-Chang May 25, 2024

	klog.Infof("reflecting %s(%s/%s) status to Work(%s/%s)", observedObj.GetKind(), observedObj.GetNamespace(), observedObj.GetName(), workNamespace, workName)
	return c.reflectStatus(workObject, observedObj)

fix: Deployment generation/observedGeneration bug #4867

Are you sure you want to change the base?

fix: Deployment generation/observedGeneration bug #4867

Conversation

veophi commented Apr 24, 2024 • edited by RainbowMango

karmada-bot commented Apr 24, 2024

codecov-commenter commented Apr 24, 2024 • edited

Codecov Report

XiShanYongYe-Chang commented Apr 24, 2024

XiShanYongYe-Chang commented Apr 24, 2024

veophi commented Apr 25, 2024

XiShanYongYe-Chang commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yike21 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karmada-bot commented Apr 25, 2024

XiShanYongYe-Chang commented Apr 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiShanYongYe-Chang left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veophi Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

veophi Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yike21 Apr 26, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

XiShanYongYe-Chang commented Apr 26, 2024

RainbowMango commented May 8, 2024

XiShanYongYe-Chang commented May 25, 2024

veophi commented Apr 24, 2024 •

edited by RainbowMango

codecov-commenter commented Apr 24, 2024 •

edited

veophi Apr 26, 2024 •

edited

veophi Apr 26, 2024 •

edited

yike21 Apr 26, 2024 •

edited