Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd CNI pods not aware about the OIDC signing key auto-rotation by AKS| #12573

Open
Peeyush1989 opened this issue May 8, 2024 · 0 comments
Labels
bug env/aks Microsoft AKS

Comments

@Peeyush1989
Copy link

What is the issue?

We are using a private AKS cluster version 1.26.x, We have configured linkerd stable version 2.14.2 with linkerd-cni enabled.

The AKS cluster is enabled with OIDC which is designed to to auto rotate the signing keys periodically.

After the OIDC keys were auto rotated, all the new pods were getting stuck with following error

“FailedCreatePodSandBox (x556 over ) kubelet (combined from similar events): Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "3756782430d4016076288c700b871e4325ca2d5d6bdd7a422697c7d3b54d23e6": plugin type="linkerd-cni" name="linkerd-cni" failed (add): Unauthorized”

  • We found that the issue started after an automatic RotateServiceAccountSigningKeys operation
  • We tried reconciling the cluster, by running a “az aks update” command, but the issue persisted.
  • we tried to create a new token for the default service account in the default namespace, then created a new pod with it. but the issue persisted.
  • Then, we tried running “az aks oidc-issuer rotate-signing-keys” twice, but the issue persisted.
  • Lastly, we figured that since the new pods are failing with an unauthorized linkerd error, that would mean that the issue is being generated in the linkerd pods. Therefore, we deleted the linkerd-cni daemonset pods, which caused the new pods to get the fresh token, which caused the issue to get resolved.

After restarted the linkerd-cni daemonset were were able to deploy the new pods but the existing pods in the linkerd meshed namespace started giving invalid certificate errors and pods inter communication was impacted.

We checked the issuer certificate and it was valid. We had to redeploy linkerd to get rid of this issue

Need to you help in troubleshooting linkerd issues with OIDC

How can it be reproduced?

we need to manual auto rotated the oidc signing keys in new infra to reproduce this issues.__

Logs, error output, etc

Linkerd control plane

[ 0.105506s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 0.306969s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 0.710647s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 1.211775s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 1.713047s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 2.215585s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 2.716391s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]
[ 3.217705s] WARN ThreadId(01) watch{port=8086}:controller{addr=localhost:8090}:endpoint{addr=127.0.0.1:8090}: linkerd_reconnect: Failed to connect error=endpoint 127.0.0.1:8090: Connection refused (os error 111) error.sources=[Connection refused (os error 111)]

output of linkerd check -o short

N/A

Environment

  • K8 version 1.26
  • Env: AKS

Possible solution

No response

Additional context

No response

Would you like to work on fixing this bug?

yes

@Peeyush1989 Peeyush1989 added the bug label May 8, 2024
@olix0r olix0r added the env/aks Microsoft AKS label May 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug env/aks Microsoft AKS
Projects
None yet
Development

No branches or pull requests

2 participants