Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PodMonitor linkerd-proxy - Creates duplicate timestamp metric labels #12592

Open
jseiser opened this issue May 14, 2024 · 0 comments
Open

PodMonitor linkerd-proxy - Creates duplicate timestamp metric labels #12592

jseiser opened this issue May 14, 2024 · 0 comments

Comments

@jseiser
Copy link
Contributor

jseiser commented May 14, 2024

What is the issue?

When using something like mimir for long-term metric retention, the podMonitor's metrics are scraped by Prometheus and directly sent to mimir. Mimir will reject massive amounts of the metrics with the following.

failed pushing to ingester mimir-distributed-ingester-zone-b-0: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested

Mimir Docs: https://grafana.com/docs/mimir/latest/manage/mimir-runbooks/#err-mimir-sample-duplicate-timestamp

Prometheus relabelling has been configured and it causes series to clash after the relabelling. Check the error message for information about which series has received a duplicate sample.

Disabling this podmonitor, stops the errors.

How can it be reproduced?

  1. Install prometheus
  2. Install mimir
  3. Tell prometheus to remote write to mimir

Logs, error output, etc

{
  "caller": "dedupe.go:112",
  "component": "remote",
  "count": 2000,
  "err": "server returned HTTP status 400 Bad Request: failed pushing to ingester mimir-distributed-ingester-zone-b-0: user=anonymous: the sample has been rejected because another sample with the same timestamp, but a different value, has already been ingested (err-mimir-sample-duplicate-timestamp). The affected sample has timestamp 2024-05-14T22:16:17.61Z and is from series tcp_close_total{app_kubernetes_io_instance=\"kube-prometheus-stack-prometheus\", app_kubernetes_io_managed_by=\"prometheus-operator\", app_kubernetes_io_name=\"prometheus\", app_kubernetes_io_version=\"2.51.2\", apps_kubernetes_io_pod_index=\"0\", container=\"linkerd-proxy\", control_plane_ns=\"linkerd\", controller_revision_hash=\"prometheus-kube-prometheus-stack-prometheus-647889d8c\", direction=\"outbound\", dst_control_plane_ns=\"linkerd\", dst_daemonset=\"promtail\", dst_namespace=\"promtail\", dst_pod=\"promtail-f4kms\", dst_serviceaccount=\"promtail\", instance=\"10.2.25.220:4191\", job=\"linkerd/linkerd-proxy\", namespace=\"monitoring\", operator_prometheus_io_name=\"kube-prometheus-stack-prometheus\", operator_promethe",
  "exemplarCount": 0,
  "level": "error",
  "msg": "non-recoverable error",
  "remote_name": "2cbc3b",
  "ts": "2024-05-14T22:16:19.070Z",
  "url": "http://mimir-distributed-nginx.mimir.svc:80/api/v1/push"
}

output of linkerd check -o short

❯ linkerd check -o short
linkerd-config
--------------
× control plane CustomResourceDefinitions exist
    missing grpcroutes.gateway.networking.k8s.io
    see https://linkerd.io/2/checks/#l5d-existence-crd for hints

linkerd-jaeger
--------------
‼ jaeger extension proxies are up-to-date
    some proxies are not running the current version:
        * jaeger-injector-7566699689-44tfd (stable-2.14.10)
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cp-version for hints
‼ jaeger extension proxies and cli versions match
    jaeger-injector-7566699689-44tfd running stable-2.14.10 but cli running edge-24.5.2
    see https://linkerd.io/2/checks/#l5d-jaeger-proxy-cli-version for hints

linkerd-viz
-----------
‼ viz extension proxies are up-to-date
    some proxies are not running the current version:
        * metrics-api-7fd4bb899-5wczd (edge-24.5.1)
        * metrics-api-7fd4bb899-srcxk (edge-24.5.1)
        * tap-988849cc4-5drh4 (edge-24.5.1)
        * tap-988849cc4-htdg5 (edge-24.5.1)
        * tap-injector-84f85cb756-gglv7 (edge-24.5.1)
        * tap-injector-84f85cb756-zhs2n (edge-24.5.1)
        * web-5d484bb4f-xvzfs (edge-24.5.1)
        * web-5d484bb4f-zmfbh (edge-24.5.1)
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cp-version for hints
‼ viz extension proxies and cli versions match
    metrics-api-7fd4bb899-5wczd running edge-24.5.1 but cli running edge-24.5.2
    see https://linkerd.io/2/checks/#l5d-viz-proxy-cli-version for hints
‼ prometheus is installed and configured correctly
    missing ClusterRoles: linkerd-linkerd-viz-prometheus
    see https://linkerd.io/2/checks/#l5d-viz-prometheus for hints

Status check results are ×

Environment

EKS 1.28

Possible solution

I honestly do not know enough about prometheus metric relabeling, but I can indicate that of the 40+ servicemonitors we have, only this specific podMonitor causes the errors.

Additional context

No response

Would you like to work on fixing this bug?

no

@jseiser jseiser added the bug label May 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants