Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CPU Spikes when upgrading to 2.4.10 from 2.4.0 #12535

Open
pryorda opened this issue May 2, 2024 · 3 comments
Open

CPU Spikes when upgrading to 2.4.10 from 2.4.0 #12535

pryorda opened this issue May 2, 2024 · 3 comments
Labels

Comments

@pryorda
Copy link

pryorda commented May 2, 2024

What is the issue?

We observed CPU spikes in linkerd/controller after upgrading to linkerd 2.4.10 from 2.4.0 using the linkerd helm charts.

How can it be reproduced?

Not fully known.

Logs, error output, etc

Not sure which logs would be beneficial. Please tell me the logs you would like to have and I will obtain them.

output of linkerd check -o short

--------------
√ can initialize the client
√ can query the Kubernetes API

kubernetes-version
------------------
√ is running the minimum Kubernetes API version

linkerd-existence
-----------------
√ 'linkerd-config' config map exists
√ heartbeat ServiceAccount exist
√ control plane replica sets are ready
√ no unschedulable pods
√ control plane pods are ready
√ cluster networks contains all pods
√ cluster networks contains all services

linkerd-config
--------------
√ control plane Namespace exists
√ control plane ClusterRoles exist
√ control plane ClusterRoleBindings exist
√ control plane ServiceAccounts exist
√ control plane CustomResourceDefinitions exist
√ control plane MutatingWebhookConfigurations exist
√ control plane ValidatingWebhookConfigurations exist
√ proxy-init container runs as root user if docker container runtime is used

linkerd-identity
----------------
√ certificate config is valid
√ trust anchors are using supported crypto algorithm
√ trust anchors are within their validity period
√ trust anchors are valid for at least 60 days
√ issuer cert is using supported crypto algorithm
√ issuer cert is within its validity period
√ issuer cert is valid for at least 60 days
√ issuer cert is issued by the trust anchor

linkerd-webhooks-and-apisvc-tls
-------------------------------
√ proxy-injector webhook has valid cert
√ proxy-injector cert is valid for at least 60 days
√ sp-validator webhook has valid cert
√ sp-validator cert is valid for at least 60 days
√ policy-validator webhook has valid cert
√ policy-validator cert is valid for at least 60 days

linkerd-version
---------------
√ can determine the latest version
‼ cli is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-cli for hints

control-plane-version
---------------------
√ can retrieve the control plane version
‼ control plane is up-to-date
    unsupported version channel: stable-2.14.10
    see https://linkerd.io/2.14/checks/#l5d-version-control for hints
√ control plane and cli versions match

linkerd-control-plane-proxy
---------------------------
√ control plane proxies are healthy
‼ control plane proxies are up-to-date
    some proxies are not running the current version:
        * linkerd-destination-6567df76b5-2flsn (stable-2.14.10)
        * linkerd-destination-6567df76b5-7rpd7 (stable-2.14.10)
        * linkerd-destination-6567df76b5-8pd9s (stable-2.14.10)
        * linkerd-destination-6567df76b5-bj6x8 (stable-2.14.10)
        * linkerd-destination-6567df76b5-d9lxd (stable-2.14.10)
        * linkerd-destination-6567df76b5-gdbt2 (stable-2.14.10)
        * linkerd-destination-6567df76b5-grn76 (stable-2.14.10)
        * linkerd-destination-6567df76b5-hnb4d (stable-2.14.10)
        * linkerd-destination-6567df76b5-hvqcs (stable-2.14.10)
        * linkerd-destination-6567df76b5-hzbrb (stable-2.14.10)
        * linkerd-destination-6567df76b5-klt4v (stable-2.14.10)
        * linkerd-destination-6567df76b5-ksltv (stable-2.14.10)
        * linkerd-destination-6567df76b5-l7hqg (stable-2.14.10)
        * linkerd-destination-6567df76b5-nfj6q (stable-2.14.10)
        * linkerd-destination-6567df76b5-sjbb2 (stable-2.14.10)
        * linkerd-destination-6567df76b5-sktps (stable-2.14.10)
        * linkerd-destination-6567df76b5-tbd6c (stable-2.14.10)
        * linkerd-destination-6567df76b5-vhc4r (stable-2.14.10)
        * linkerd-destination-6567df76b5-vk48q (stable-2.14.10)
        * linkerd-destination-6567df76b5-vk4nj (stable-2.14.10)
        * linkerd-identity-65f4ccc9b6-2wj47 (stable-2.14.10)
        * linkerd-identity-65f4ccc9b6-8c5hp (stable-2.14.10)
        * linkerd-identity-65f4ccc9b6-vlf4p (stable-2.14.10)
        * linkerd-proxy-injector-6658c78c79-nzwp9 (stable-2.14.10)
        * linkerd-proxy-injector-6658c78c79-ppq8b (stable-2.14.10)
        * linkerd-proxy-injector-6658c78c79-z4ptw (stable-2.14.10)
    see https://linkerd.io/2.14/checks/#l5d-cp-proxy-version for hints
√ control plane proxies and cli versions match

linkerd-ha-checks
-----------------
√ pod injection disabled on kube-system

Status check results are √

Environment

  • Kubernetes Version: v1.28.7-eks-b9c9ed7
  • Cluster Environment: EKS
  • Host OS: Amazon Linux 2
  • Linkerd Version: 2.4.0 -> 2.4.10
  • Linkerd Installation Method: Helm Charts

Possible solution

Downgrade

Additional context

image

Purple is after downgrade to 2.4.0

Would you like to work on fixing this bug?

yes

@pryorda pryorda added the bug label May 2, 2024
@mateiidavid
Copy link
Member

mateiidavid commented May 3, 2024

@pryorda thanks for raising this! The baseline for CPU usage seems to be almost the same. It's true the peaks you're seeing in usage are higher on 2.14.10, but they don't seem really exaggerated to me. It would be helpful for us to figure out whether you're experiencing any other symptoms here, or whether you're experiencing a concrete problem as a result of this spike. From a high-level perspective, a 3rd of a CPU in usage doesn't seem to indicate any pathological behaviour. If we can narrow what the scope of the issue is (whether it's more about optimising the code or fixing what might be a regression in our codebase) we can carve out a more detailed plan on how to investigate this or assist you with your investigation.

On that note, I think more data would also be helpful here. There have been a lot of changes to the code base between 2.14.0 and 2.14.10. We've been addressing some reports of staleness in discovery data and as a result introduced more work in the destination container. If you want to try and root cause this, here are my suggestions:

  • linkerd diagnostics is a command you can use to get a metrics dump from a pod or a deployment. I'd suggest getting a dump of the destination container in both versions to see if they differ in terms of volume of traffic and informer cache sizes.
  • pprof, the Go toolchain allows you to easily collect profiling information. We expose an endpoint in the destination container. You might have to explicitly turn this on. Profiling can help us understand if we're spending more time in certain functions than we previously had and it could help us root cause the difference in CPU util.
  • bisect, like I mentioned we've had a few releases so far. I would look at whether we introduced anything significant in between the two versions. Is there anything that stands out? We could use a smaller version interval here, e.g. between 2.14.10 and 2.14.7, have we introduced any new informers, watchers, buffers, caches?

Hope this all makes sense!

@pryorda
Copy link
Author

pryorda commented May 3, 2024

Thank you for responding. I will see what information I can gather to help with diagnosing the issue. I don't know how to fully debug the linkerd side of things so any advice you have is a good start. As far as bisecting the code to see differences, that's a bit out of my realm and might be a steep ask. I've looked at the change log and nothing stood out to me. Here is what it looks like on our prod cluster since we upgraded.

image

@kflynn
Copy link
Member

kflynn commented May 30, 2024

@pryorda Did you get a chance to look any further here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants