Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

flux doesn't uninstall a HelmRelease if the containing namespace is deleted #4793

Open
1 task done
gprossliner opened this issue May 15, 2024 · 5 comments
Open
1 task done

Comments

@gprossliner
Copy link

Describe the bug

A HelmRelease object is created in a namespace. This causes flux to reconcile the object, which will install the configured HelmChart.
If you delete the containing namespace, flux doesn't seem to proper uninstall the HelmRelease. All namespaced objects are gone, because the namespace is gone. But global resources (a ValidatingWebhookConfiguration in my case) are left behind. So I assume the uninstall never happend.

Steps to reproduce

  • Create a namespace
  • Create a HelmRelease object for a chart which will install clusterwide resources, like the nginx-ingress chart
  • See that global resources (e.g. ValidatingWebhookConfigurations) are created
  • Delete the namespace
  • See the global resources still there

If requested I can make a detailed repo.

Expected behavior

The uninstall should be performed, allowing helm to delete all resources and run other uninstallation jobs.

Screenshots and recordings

No response

OS / Distro

23.5.0 Darwin Kernel Version 23.5.0

Flux version

flux: v2.2.3

Flux check

► checking prerequisites
✗ flux 2.2.3 <2.3.0 (new CLI version is available, please upgrade)
✗ Kubernetes version v1.24.17 does not match >=1.26.0-0
► checking version in cluster
✔ distribution: flux-v2.2.2
✔ bootstrapped: false
► checking controllers
✔ helm-controller: deployment ready
► ghcr.io/fluxcd/helm-controller:v0.37.2
✔ kustomize-controller: deployment ready
► ghcr.io/fluxcd/kustomize-controller:v1.2.1
✔ notification-controller: deployment ready
► ghcr.io/fluxcd/notification-controller:v1.2.3
✔ source-controller: deployment ready
► ghcr.io/fluxcd/source-controller:v1.2.3
► checking crds
✔ alerts.notification.toolkit.fluxcd.io/v1beta3
✔ buckets.source.toolkit.fluxcd.io/v1beta2
✔ gitrepositories.source.toolkit.fluxcd.io/v1
✔ helmcharts.source.toolkit.fluxcd.io/v1beta2
✔ helmreleases.helm.toolkit.fluxcd.io/v2beta2
✔ helmrepositories.source.toolkit.fluxcd.io/v1beta2
✔ kustomizations.kustomize.toolkit.fluxcd.io/v1
✔ ocirepositories.source.toolkit.fluxcd.io/v1beta2
✔ providers.notification.toolkit.fluxcd.io/v1beta3
✔ receivers.notification.toolkit.fluxcd.io/v1
✗ check failed

Git provider

No response

Container Registry provider

No response

Additional context

No response

Code of Conduct

  • I agree to follow this project's Code of Conduct
@gprossliner
Copy link
Author

My personal thoughts on this:

  • The controller adds a finalizer to the HelmRelease, so it has a chance to act on deletion
  • If obviously handles the finalizer, because the DELETE namespace operation runs to completion, which would not be the case if any object within the namespace still has finalizer
  • I think at the moment the controller tries to uninstall the release using the helm packages, it needs to read the state of the release. This is done by list / read the `sh.helm.release.v1.XXX.vXX' secrets
  • These secrets will most likely already be gone because there is no finalizer on them
  • So the helm operation (list, uninstall) will fail
  • The controller ignores this error, and removes the finalizer, allowing the namespace to be successfully deleted

@gprossliner
Copy link
Author

A potential PR could be implemented like:

Note: Whenever there is HelmRelease mentioned in this comment, it is not a flux HelmRelease object, but a HelmRelease object from helm.sh/helm/v3/pkg/release. So there will be multiple HelmReleases for a single flux HelmRelease, whenever an Upgrade action is performed.

  • After any action creating a new HelmRelease, like "Install" or "Upgrade", flux retrieves the secrets used for storage of the HelmRelease
  • flux adds a finalizer to the Secret
  • Flux uninstalls the HelmRelease if the corresponding HelmRelease (flux HelmRelease) is deleted. This may be caused by a DELETE operation on the object, or the namespace. Helm will delete the Secret within it's uninstall implementation, but it will be only marked for deletion bc the flux finalizer is still in place
    • How the helm will behave on uninstall if there is a finalizer on the secret. If it waits for the secret to be actually deleted on uninstall it may fail. The DELETE operation will succeed even if a finalizer is present, but helm should not wait for actual deletion. This should be tested in front.
  • After the uninstall flux action, the finalizer on the Secret is removed, allowing kubernetes to delete the Secret
  • Because there may be HelmReleases already present before this feature is implemented / enabled, we need to add the finalizer to existing Secret instances as well.

Questions for the maintainers:

  • Is there a need to get this fixed?
  • What do you think about the overall operation of the proposed PR?
  • Do you want me to contribute a PR for this?

@souleb
Copy link
Member

souleb commented May 16, 2024

If we add a finalizer to the release secret, people won't be able to use the helm cli anymore to do such operation, and we won't do that. Flux should be able to progress from any action done with helm cli and vice versa.

There are also many reasons why one would want to delete a secret without triggering an uninstall action. Which will no longer be possible with your proposal.

The proper way to uninstall a helm release installed by Flux, is through the Flux HelmRelease which has a finalizer, not by manually deleting underlying k8s objects.

Why don't handle the namespace lifecycle with Flux? Within the HelmRelease or the Kustomization that installs it?

@gprossliner
Copy link
Author

gprossliner commented May 16, 2024

Thank you for you response @souleb!

If we add a finalizer to the release secret, people won't be able to use the helm cli anymore to do such operation, and we won't do that. Flux should be able to progress from any action done with helm cli and vice versa.

Understood. What actions are to be expected to be done with the helm cli? Wouldn't flux basically undo those operations on the next reconciliation?

For example if I uninstall a release with the helm cli, flux would find the release not installed and install if afterwards?
If I change values for a release, basically the same should happen because flux has authority over the values used?

One was to mitigate this problem could be to watch for the HelmRelease secrets for deletion, and remove the finalizer if the namespace / HelmRelease is not in finalization. So any manual operation should be unaffected.

There are also many reasons why one would want to delete a secret without triggering an uninstall action. Which will no longer be possible with your proposal.

I didn't meant to say that. Flux should only, ever, uninstall the Release on the finalizer of the flux HelmRelease object. The finalizer on the Secret only ensures that it is still there when the uninstall happended because of that. I personally also needed to delete those Secrets to recover from errors / compatibility problems.

The proper way to uninstall a helm release installed by Flux, is through the Flux HelmRelease which has a finalizer, not by manually deleting underlying k8s objects.

Yes, this would not change.

Why don't handle the namespace lifecycle with Flux? Within the HelmRelease or the Kustomization that installs it?

For sure this would be a way to go. But not all clusters are 100% managed with flux. And there is a real possible problem here. If things are "left behind", like ClusterRoles / Bindings that may only be a problem if a new installation will take place.

But we had a real operational issue, rendering this cluster in a degraded state:

  • The stated nginx-ingress created a ValidationWebhookConfiguration, to validate Ingress objects
  • An used deleted the namespace, assuming a clean Uninstall would be performed
  • Because only namespaced objects are deleted, the webhook stayed registered, but the referenced objects like the Service were gone
  • This caused that modifications of all Ingress objects had failed until we manually resolved the issue by deleting all left over objects
  • This was no problem in our case, but if there are lifecycle hooks registered, those would not be called as well

@souleb
Copy link
Member

souleb commented May 21, 2024

Deleting a namespace without cleaning in it first is generally a bad idea:

  • kubernetes performs a cascading deletion of all objects in the namespace
  • When a CR dependents on a webhook served in the same namespace, the webhook might be deleted before the CR. Since the webhook is unavailable, the namespace deletion is blocked because the API server is unable to reach the webhook to process the CR.
    So cleaning your NS first should be a best practice even without Flux.

Also once a deletion start, there is no going back. So by adding the finalizer, it means that the controller has somehow to perform an action if the secret is deleted. In most cases this will be a no-op because next reconciliation will just upgrade the release and recreate the secret.

This seems complicated to me for the benefits.

A better solution is being discussed in https://github.com/kubernetes/enhancements/pull/2840/files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants