Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Webhook certs generation failing due to readonly volumeMount #2347

Open
8 tasks done
kreeuwijk opened this issue Mar 29, 2024 · 5 comments
Open
8 tasks done

Webhook certs generation failing due to readonly volumeMount #2347

kreeuwijk opened this issue Mar 29, 2024 · 5 comments
Labels

Comments

@kreeuwijk
Copy link

kreeuwijk commented Mar 29, 2024

MetalLB Version

0.13.11

Deployment method

Charts

Main CNI

Calico

Kubernetes Version

1.27.1

Cluster Distribution

kubeadm

Describe the bug

When deploying MetalLB on a fresh cluster, sometimes the controller pod is unable to generate the set of certificates for the webhook. The contents of /tmp/k8s-webhook-server/serving-certs stays empty. This eventually results in the caBundle not getting injected into the webhook config and calls to the webhook result in a x509: certificate signed by unknown authority error.

Upon troubleshooting this, I found that the helm charts for all versions of MetalLB mount this directory as readOnly for the controller deployment:

        volumeMounts:
        - mountPath: /tmp/k8s-webhook-server/serving-certs
          name: cert
          readOnly: true

If I shell into the controller pod and try to touch /tmp/k8s-webhook-server/serving-certs/test.txt I get a Read-only file system error. It seems obvious this prevents the certRotator from saving a set of certificates there.
If I set readOnly: false, the certificates get generated without a problem and the webhook calls succeed normally.

it seems to be a bug in Kubernetes that sometimes the pod can still write to the volumeMount even though the volumeMount is set as readOnly. But why is readOnly: true used in the first place?

To Reproduce

  1. Deploy a cluster, apply the MetalLB helm chart (any recent version)
  2. Shell into the controller pod and verify if the /tmp/k8s-webhook-server/serving-certs location is read-only
  3. If it is read-only, no set of certificates should be able to get generated and the controller starts logging certificate failures
  4. If the volumeMount is changed to not be read-only, certificate generation works normally

Expected Behavior

Certificate generation works normally when the controller pod starts

Additional Context

It is unclear to me why certificate generation sometimes still succeeds, even though it shouldn't. When this happens, the rotator logs

{"level":"error","ts":"2024-03-29T10:27:53Z","logger":"cert-rotation","msg":"secret is not well-formed, cannot update webhook configurations","error":"Cert secret is not well-formed, missing ca.crt","errorVerbose":"Cert secret is not well-formed, missing ca.crt\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.buildArtifactsFromSecret\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.7.0/pkg/rotator/rotator.go:428\ngithub.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile\n\t/go/pkg/mod/github.com/open-policy-agent/cert-controller@v0.7.0/pkg/rotator/rotator.go:693\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1594","stacktrace":"github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile\n\t/go/pkg/mod/github.com/
open-policy-agent/cert-controller@v0.7.0/pkg/rotator/rotator.go:695\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:122\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:323\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:274\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/go/pkg/mod/sigs.k8s.io/controller-runtime@v0.14.6/pkg/internal/controller/controller.go:235"}

Even though it then logs

{"level":"info","ts":"2024-03-29T10:27:53Z","logger":"cert-rotation","msg":"server certs refreshed"}
{"level":"info","ts":"2024-03-29T10:27:54Z","logger":"cert-rotation","msg":"certs are ready in /tmp/k8s-webhook-server/serving-certs"}
{"level":"info","ts":"2024-03-29T10:27:54Z","logger":"cert-rotation","msg":"CA certs are injected to webhooks"}

I've read and agree with the following

  • I've checked all open and closed issues and my request is not there.
  • I've checked all open and closed pull requests and my request is not there.

I've read and agree with the following

  • I've checked all open and closed issues and my issue is not there.
  • This bug is reproducible when deploying MetalLB from the main branch
  • I have read the troubleshooting guide and I am still not able to make it work
  • I checked the logs and MetalLB is not discarding the configuration as not valid
  • I enabled the debug logs, collected the information required from the cluster using the collect script and will attach them to the issue
  • I will provide the definition of my service and the related endpoint slices and attach them to this issue
@kreeuwijk kreeuwijk added the bug label Mar 29, 2024
@cyclinder
Copy link
Contributor

I think it doesn't relate to readOnly: false, Kubernetes recommends the secret is always mounted as readOnly. see https://kubernetes.io/docs/concepts/storage/volumes/#secret. Since the cert file comes from the secret, it should not be changed. We should figure out why the certificate failed to be generated.

@kreeuwijk
Copy link
Author

The secret starts out as empty though and is mounted at /tmp/k8s-webhook-server/serving-certs. The controller pod is then responsible to generate a set of certificates (using rotator) and store them in this folder, which populates the secret.

It seems to me a chicken-and-egg problem, where the folder should be read-only once the certs are generated, but writable when the certs need to be generated (initially) or rotated (at expiration).

@fedepaol
Copy link
Member

fedepaol commented Apr 3, 2024

@kreeuwijk can you clarify a bit better the issue you are having (logs aside)?
Is the controller never able to read the generated secret or does it eventually reconcile?

In case sometimes it never reconciles, can you provide the logs of that case?

@fedepaol
Copy link
Member

fedepaol commented Apr 3, 2024

it's also weird we never hit this on CI (nor users reported it)

@kreeuwijk
Copy link
Author

It is an intermittent issue, sometimes the certificates do get generated normally, even though that shouldn't be possible on a read-only filesystem. However when the issue occurs, no amount of restarting the controller pod will solve it. For some reason though, repaving the control plane (these are CAPI clusters) will result in the problem going away. If I don't perform this workaround, the only thing that works is setting the readOnly option for the volumeMount to false and running the controller once in that config, so that it can successfully store the keys in /tmp/k8s-webhook-server/serving-certs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants