[eks] [request]: Support specifying API Server Metric Cardinality Enforcement flag #2333

sidewinder12s · 2024-04-19T16:23:49Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Tell us about your request

Let users configure this API Flag: kubernetes/enhancements#2305

Which service(s) is this request for?

EKS

Tell us about the problem you're trying to solve. What are you trying to do, and why is it hard?

API Server metrics are extremely high cardinality to begin with and only gets exponentially worse the more CRDs you install into the control plane. The upstream community has provided this flag as a source side control to this.

This is extremely problematic for many installations of Prometheus or managed prometheus providers where you are paying for every series generated, such as AWS Managed Prometheus or Grafana Cloud.

Are you currently working around this issue?

Attempting to write prometheus relabel rules to drop what we don't need.

stevehipwell · 2024-04-23T14:41:46Z

@sidewinder12s the KPS K8s API ServiceMonitor implementation has an example of a relabeling config, is this what you're after?

sidewinder12s · 2024-04-23T16:11:33Z

No, we specifically are interested in the Kubernetes Enhancement. Certain types of metric labels are hard to drop using relabel rules and I'd rather be able to configure exactly what we want exposed.

Specifically CRDs and Histogram buckets were pretty hard to get right/impossible with relabel rules.

stevehipwell · 2024-04-23T18:25:01Z

@sidewinder12s I know that being able to configure this directly would be simpler and more consistent but so far AWS have resisted the ability for end-users to customise control plane components. Without customisation it'd still be good to check that EKS is configured for consistency even if we're on a one size fits all.

Specifically CRDs and Histogram buckets were pretty hard to get right/impossible with relabel rules.

Could you elaborate a bit more about why this is the case? We're about to spend a bit of time reducing out metrics overhead and I'd assumed that this would be hard work but achievable with relabeling.

sidewinder12s · 2024-04-23T19:47:34Z

@sidewinder12s I know that being able to configure this directly would be simpler and more consistent but so far AWS have resisted the ability for end-users to customise control plane components. Without customisation it'd still be good to check that EKS is configured for consistency even if we're on a one size fits all.

Specifically CRDs and Histogram buckets were pretty hard to get right/impossible with relabel rules.

Could you elaborate a bit more about why this is the case? We're about to spend a bit of time reducing out metrics overhead and I'd assumed that this would be hard work but achievable with relabeling.

Largely comes down to inconsistency between metrics/labels on API Server metrics.

When I summed up metrics/labels by the API Server Job, there were like 50+ different histogram buckets of varying size and relabel/regex doesn't handle trying to write a regex for it very well.

Here is an example of what we'd done so far, without those histogram drops:

            metricRelabelings:
              - sourceLabels: [__name__]
                regex: "^(kubernetes_feature_enabled)$"
                action: drop
              - sourceLabels: [__name__]
                regex: "^(apiserver_watch_events_sizes_(bucket|count|sum))$"
                action: drop
              - sourceLabels: [__name__]
                regex: "^(apiserver_response_sizes_(bucket|count|sum)|apiserver_request_body_size_.*)$"
                action: drop
              - sourceLabels: [__name__]
                regex: "^(apiserver_storage_list_duration_seconds_(bucket|count|sum))$"
                action: drop
              # Aggregator drops
              - sourceLabels: [__name__]
                regex: "^(aggregator_.*)$"
                action: drop
              # Unclear value
              - sourceLabels: [__name__]
                regex: "^(apiextensions_apiserver_.*|apiextensions_openapi_v2_regeneration_count|apiextensions_openapi_v3_regeneration_count|apiserver_encryption_.*|apiserver_envelope_.*|disabled_metrics_total|hidden_metrics_total|registered_metrics_total|get_token_.*)$"
                action: drop
              # Go metrics (others are used)
              - sourceLabels: [__name__]
                regex: "^(go_cgo_.*|go_cpu_.*|go_gc_.*|go_memory_.*|go_memstats_.*|go_sched_.*|go_sync_.*|go_threads|go_godebug.*)$"
                action: drop
              # grpc metrics
              - sourceLabels: [__name__]
                regex: "^(grpc_.*)$"
                action: drop
              # process metrics
              - sourceLabels: [__name__]
                regex: "^(process_max_fds|process_open_fds|process_start_time_seconds|process_virtual_.*)$"
                action: drop
              # Same as below rules, but etcd metrics had this weird label
              - sourceLabels: [__name__, "type"]
                separator: "@"
                regex: "(etcd_.*)@(/registry/.*)"
                action: drop
              ### The following 3 rules should likely keep a consistent drop list as we're just trying to drop metrics related to CRDs
              # The CRD metrics end up being surfaced through these 3 different source labels
              - sourceLabels: [__name__, "group"]
                separator: "@"
                regex: "(apiserver_.*)@(.*gatekeeper.sh|.*velero.io|.*istio.io|.*coreos.com|.*cert-manager.io|.*keda.sh|.*karpenter.sh|.*k8s.aws|.*amazonaws.com|kiali.io|.*grafana.com)"
                action: drop
              - sourceLabels: [__name__, "type"]
                separator: "@"
                regex: "(etcd_.*)@(.*gatekeeper.sh|.*velero.io|.*istio.io|.*coreos.com|.*cert-manager.io|.*keda.sh|.*karpenter.sh|.*k8s.aws|.*amazonaws.com|kiali.io|.*grafana.com)"
                action: drop
              - sourceLabels: [__name__, "resource"]
                separator: "@"
                regex: "(apiserver_.*|etcd_.*|watch_cache_capacity)@(.*gatekeeper.sh|.*velero.io|.*istio.io|.*coreos.com|.*cert-manager.io|.*keda.sh|.*karpenter.sh|.*k8s.aws|.*amazonaws.com|kiali.io|.*grafana.com)"
                action: drop

I had not yet tried explicitly looking at each and every metrics exported histogram bucket, but I think that's what it would take (vs trying to do a regex for all buckets over 10s/under 100ms). And that work would have to be repeated after every upgrade since metrics, especially the alpha metrics can change all the time. Either way figuring out what to drop is not easy in Prometheus, it usually requires some level of familiarity with the metrics source which usually is not the case.

sidewinder12s added the Proposed Community submitted issue label Apr 19, 2024

mikestef9 added the EKS Amazon Elastic Kubernetes Service label Apr 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[eks] [request]: Support specifying API Server Metric Cardinality Enforcement flag #2333

[eks] [request]: Support specifying API Server Metric Cardinality Enforcement flag #2333

sidewinder12s commented Apr 19, 2024

stevehipwell commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024

stevehipwell commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024

[eks] [request]: Support specifying API Server Metric Cardinality Enforcement flag #2333

[eks] [request]: Support specifying API Server Metric Cardinality Enforcement flag #2333

Comments

sidewinder12s commented Apr 19, 2024

Community Note

stevehipwell commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024

stevehipwell commented Apr 23, 2024

sidewinder12s commented Apr 23, 2024