Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[eks] [request]: Support specifying API Server Metric Cardinality Enforcement flag #2333

Open
sidewinder12s opened this issue Apr 19, 2024 · 4 comments
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue

Comments

@sidewinder12s
Copy link

Community Note

  • Please vote on this issue by adding a 馃憤 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

Let users configure this API Flag: kubernetes/enhancements#2305

Which service(s) is this request for?

EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

API Server metrics are extremely high cardinality to begin with and only gets exponentially worse the more CRDs you install into the control plane. The upstream community has provided this flag as a source side control to this.

This is extremely problematic for many installations of Prometheus or managed prometheus providers where you are paying for every series generated, such as AWS Managed Prometheus or Grafana Cloud.

Are you currently working around this issue?

Attempting to write prometheus relabel rules to drop what we don't need.

@sidewinder12s sidewinder12s added the Proposed Community submitted issue label Apr 19, 2024
@mikestef9 mikestef9 added the EKS Amazon Elastic Kubernetes Service label Apr 20, 2024
@stevehipwell
Copy link

@sidewinder12s the KPS K8s API ServiceMonitor implementation has an example of a relabeling config, is this what you're after?

@sidewinder12s
Copy link
Author

No, we specifically are interested in the Kubernetes Enhancement. Certain types of metric labels are hard to drop using relabel rules and I'd rather be able to configure exactly what we want exposed.

Specifically CRDs and Histogram buckets were pretty hard to get right/impossible with relabel rules.

@stevehipwell
Copy link

@sidewinder12s I know that being able to configure this directly would be simpler and more consistent but so far AWS have resisted the ability for end-users to customise control plane components. Without customisation it'd still be good to check that EKS is configured for consistency even if we're on a one size fits all.

Specifically CRDs and Histogram buckets were pretty hard to get right/impossible with relabel rules.

Could you elaborate a bit more about why this is the case? We're about to spend a bit of time reducing out metrics overhead and I'd assumed that this would be hard work but achievable with relabeling.

@sidewinder12s
Copy link
Author

@sidewinder12s I know that being able to configure this directly would be simpler and more consistent but so far AWS have resisted the ability for end-users to customise control plane components. Without customisation it'd still be good to check that EKS is configured for consistency even if we're on a one size fits all.

Specifically CRDs and Histogram buckets were pretty hard to get right/impossible with relabel rules.

Could you elaborate a bit more about why this is the case? We're about to spend a bit of time reducing out metrics overhead and I'd assumed that this would be hard work but achievable with relabeling.

Largely comes down to inconsistency between metrics/labels on API Server metrics.

When I summed up metrics/labels by the API Server Job, there were like 50+ different histogram buckets of varying size and relabel/regex doesn't handle trying to write a regex for it very well.

Here is an example of what we'd done so far, without those histogram drops:

            metricRelabelings:
              - sourceLabels: [__name__]
                regex: "^(kubernetes_feature_enabled)$"
                action: drop
              - sourceLabels: [__name__]
                regex: "^(apiserver_watch_events_sizes_(bucket|count|sum))$"
                action: drop
              - sourceLabels: [__name__]
                regex: "^(apiserver_response_sizes_(bucket|count|sum)|apiserver_request_body_size_.*)$"
                action: drop
              - sourceLabels: [__name__]
                regex: "^(apiserver_storage_list_duration_seconds_(bucket|count|sum))$"
                action: drop
              # Aggregator drops
              - sourceLabels: [__name__]
                regex: "^(aggregator_.*)$"
                action: drop
              # Unclear value
              - sourceLabels: [__name__]
                regex: "^(apiextensions_apiserver_.*|apiextensions_openapi_v2_regeneration_count|apiextensions_openapi_v3_regeneration_count|apiserver_encryption_.*|apiserver_envelope_.*|disabled_metrics_total|hidden_metrics_total|registered_metrics_total|get_token_.*)$"
                action: drop
              # Go metrics (others are used)
              - sourceLabels: [__name__]
                regex: "^(go_cgo_.*|go_cpu_.*|go_gc_.*|go_memory_.*|go_memstats_.*|go_sched_.*|go_sync_.*|go_threads|go_godebug.*)$"
                action: drop
              # grpc metrics
              - sourceLabels: [__name__]
                regex: "^(grpc_.*)$"
                action: drop
              # process metrics
              - sourceLabels: [__name__]
                regex: "^(process_max_fds|process_open_fds|process_start_time_seconds|process_virtual_.*)$"
                action: drop
              # Same as below rules, but etcd metrics had this weird label
              - sourceLabels: [__name__, "type"]
                separator: "@"
                regex: "(etcd_.*)@(/registry/.*)"
                action: drop
              ### The following 3 rules should likely keep a consistent drop list as we're just trying to drop metrics related to CRDs
              # The CRD metrics end up being surfaced through these 3 different source labels
              - sourceLabels: [__name__, "group"]
                separator: "@"
                regex: "(apiserver_.*)@(.*gatekeeper.sh|.*velero.io|.*istio.io|.*coreos.com|.*cert-manager.io|.*keda.sh|.*karpenter.sh|.*k8s.aws|.*amazonaws.com|kiali.io|.*grafana.com)"
                action: drop
              - sourceLabels: [__name__, "type"]
                separator: "@"
                regex: "(etcd_.*)@(.*gatekeeper.sh|.*velero.io|.*istio.io|.*coreos.com|.*cert-manager.io|.*keda.sh|.*karpenter.sh|.*k8s.aws|.*amazonaws.com|kiali.io|.*grafana.com)"
                action: drop
              - sourceLabels: [__name__, "resource"]
                separator: "@"
                regex: "(apiserver_.*|etcd_.*|watch_cache_capacity)@(.*gatekeeper.sh|.*velero.io|.*istio.io|.*coreos.com|.*cert-manager.io|.*keda.sh|.*karpenter.sh|.*k8s.aws|.*amazonaws.com|kiali.io|.*grafana.com)"
                action: drop

I had not yet tried explicitly looking at each and every metrics exported histogram bucket, but I think that's what it would take (vs trying to do a regex for all buckets over 10s/under 100ms). And that work would have to be repeated after every upgrade since metrics, especially the alpha metrics can change all the time. Either way figuring out what to drop is not easy in Prometheus, it usually requires some level of familiarity with the metrics source which usually is not the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
EKS Amazon Elastic Kubernetes Service Proposed Community submitted issue
Projects
None yet
Development

No branches or pull requests

3 participants