Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linkerd destination policy container stalls after connection timeout with API server #12469

Open
bc185174 opened this issue Apr 19, 2024 · 1 comment
Labels

Comments

@bc185174
Copy link

bc185174 commented Apr 19, 2024

What is the issue?

Linkerd destination policy container briefly lost connection to the API server and it stalls. The policy container never fully recovers or restarts under this scenario.

The last log from the policy container was around 2024-04-19T08:16:28Z. 2 hours later, there are still no logs and linkerd-proxy start to crash in workload pods.

Restarting the linkerd destination pod resolved the issue.

How can it be reproduced?

Block linkerd destination connection to the API server temporarily.

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: default-deny-egress
  namespace: linkerd
spec:
  podSelector:
    linkerd.io/control-plane-component: destination
  policyTypes:
  - Egress
  egress
  - to:
    - ipBlock:
      cidr: 10.96.0.1
    - port: 443
      protocol: TCP

Logs, error output, etc

{"level":"info","msg":"pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1569713355]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.191) (total time: 30001ms):\nTrace[1569713355]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.193)\nTrace[1569713355]: [30.001686889s] [30.001686889s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.Job: failed to list *v1.Job: Get \"[https://10.96.0.1:443/apis/batch/v1/jobs?resourceVersion=803143](https://10.96.0.1/apis/batch/v1/jobs?resourceVersion=803143)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[1348786945]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229 (19-Apr-2024 08:15:58.231) (total time: 30001ms):\nTrace[1348786945]: ---\"Objects listed\" error:Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout 30001ms (08:16:28.232)\nTrace[1348786945]: [30.001709201s] [30.001709201s] END","time":"2024-04-19T08:16:28Z"}
{"error":null,"level":"error","msg":"pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229: Failed to watch *v1.PartialObjectMetadata: failed to list *v1.PartialObjectMetadata: Get \"[https://10.96.0.1:443/apis/apps/v1/replicasets?resourceVersion=803115](https://10.96.0.1/apis/apps/v1/replicasets?resourceVersion=803115)\": dial tcp 10.96.0.1:443: i/o timeout","time":"2024-04-19T08:16:28Z"}
{"level":"info","msg":"Trace[477063074]: \"Reflector ListAndWatch\" name:pkg/mod/k8s.io/client-go@v0.29.0/tools/cache/reflector.go:229 (19-Apr-2024 08:16:29.773) (total time: 59380ms):\nTrace[477063074]: ---\"Objects listed\" error:\u003cnil\u003e 59380ms (08:17:29.153)\nTrace[477063074]: [59.380218341s] [59.380218341s] END","time":"2024-04-19T08:17:29Z"}

output of linkerd check -o short

N/A

Environment

  • k8s version: 1.27.7
  • linkerd version: 2.14.10
  • environment: ubuntu-distro

Possible solution

Readiness/liveness probes ideally should resolve this and restart the container if this happens.

Additional context

No response

Would you like to work on fixing this bug?

yes

@bc185174 bc185174 added the bug label Apr 19, 2024
@bc185174 bc185174 changed the title Linkerd destination controller stalls after connection loss with API server Linkerd destination policy container stalls after connection loss with API server Apr 19, 2024
@bc185174 bc185174 changed the title Linkerd destination policy container stalls after connection loss with API server Linkerd destination policy container stalls after connection timeout with API server Apr 19, 2024
@alpeb
Copy link
Member

alpeb commented Apr 25, 2024

Can you provide the logs for the policy container when that happened? (the ones you provided are from a go-based container, probably the destination container).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants