vmagent job flapping up/down with no errors #6203

k0nstantinv · 2024-04-29T10:11:14Z

Is your question request related to a specific component?

vmagent

Describe the question in detail

Victoriametrics cluster 1.96

Victoriametrics vmagent 1.96
clusterized 6 member vmagent

vmagent:
  enabled: true
  replicaCount: 1
  shardCount: 6
  scrapeInterval: 30s
  spec:
    image:
      tag: v1.96.0
    updateStrategy: Recreate
    extraArgs:
      loggerFormat: json
      promscrape.cluster.name: prod-euno1-devpl-0
      promscrape.disableCompression: "false"
      promscrape.discovery.concurrency: "200"
      promscrape.maxDroppedTargets: "5000"
      promscrape.maxScrapeSize: 4GB
      promscrape.noStaleMarkers: "true"
      promscrape.streamParse: "true"
      promscrape.suppressScrapeErrors: "true"
      remoteWrite.maxBlockSize: 200MB
      remoteWrite.maxRowsPerBlock: "11000"
      remoteWrite.queues: "170"
      remoteWrite.tlsInsecureSkipVerify: "true"
    externalLabels:
      cluster: prod-euno1-devpl-0
    nodeSelector:
      dedicated-to: victoriametrics
    tolerations:
    - effect: NoSchedule
      key: dedicated-to
      operator: Equal
      value: victoriametrics
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                    - vmagent
            topologyKey: kubernetes.io/hostname
    resources:
      limits:
        cpu: 5100m
        memory: 7Gi
      requests:
        cpu: 2100m
        memory: 4Gi
  ingress:
    enabled: false
  additionalScrapeConfigs: |
    - job_name: opencost
      honor_labels: true
      scrape_interval: 1m
      scrape_timeout: 10s
      metrics_path: /metrics
      scheme: http
      dns_sd_configs:
      - names:
        - opencost.opencost
        type: 'A'
        port: 9003

everything seems to be fine, but built-in dashboard always shows vmagent job is flapping up/down

this yellow bars on the graph represent this query up{job="vmagent-vm-stack"}

and VMUI really shows this job goes up and down every time

but I completely can't find the reason. There are no pod/container restarts at all

$ k -n monitoring get pods -o wide | grep vmagent
vmagent-vm-stack-0-955dfdb6c-4lfk6                 2/2     Running   0               61m     10.33.91.209     ip-10-105-149-218.eu-north-1.compute.internal   <none>           <none>
vmagent-vm-stack-1-7db4f47479-c5c26                2/2     Running   0               64m     10.33.85.121     ip-10-105-153-69.eu-north-1.compute.internal    <none>           <none>
vmagent-vm-stack-2-68c6548d75-dnfp7                2/2     Running   0               35m     10.33.105.61     ip-10-105-179-11.eu-north-1.compute.internal    <none>           <none>
vmagent-vm-stack-3-57745d6b67-l9thn                2/2     Running   0               45m     10.33.96.213     ip-10-105-179-85.eu-north-1.compute.internal    <none>           <none>
vmagent-vm-stack-4-586b756895-wz9pw                2/2     Running   0               40m     10.33.101.169    ip-10-105-152-189.eu-north-1.compute.internal   <none>           <none>
vmagent-vm-stack-5-544d5b9dc-wrk6x                 2/2     Running   0               5d1h    10.33.222.245    ip-10-105-178-188.eu-north-1.compute.internal   <none>           <none>

and pods don't even have any errors

$ k -n monitoring logs -f vmagent-vm-stack-3-57745d6b67-l9thn vmagent | grep -v info
(empty)
^C

Please, help me to debug

Troubleshooting docs

General - https://docs.victoriametrics.com/troubleshooting/
vmagent - https://docs.victoriametrics.com/vmagent/#troubleshooting
vmalert - https://docs.victoriametrics.com/vmalert/#troubleshooting

The text was updated successfully, but these errors were encountered:

Haleygo · 2024-04-29T12:53:21Z

Hello,
how do you scrape vmagent here, could you attach the scrape job config?
And how many vmagent instances showed in http://vmagent:8429/targets page, do them changed a lot?

k0nstantinv · 2024-04-30T06:40:41Z

@Haleygo thanks for you response! as I posted vmagent has 6 shards, unfortunately it's not possible to post all the targets from /targets endpoint, since each vmagent only scrapes its part of targets. Overall count of targets is about 9000 (from vmagent dashboard) and they really do change a lot. It is a dev cluster where any developer can create his own test environment. I'm ready to post anything you need to be able to help

Haleygo · 2024-04-30T06:59:15Z

No problem, I'm only concern about the vmagent scrape job.
How do you scrape vmagent, using service discovery or static configs? How many instances shown under the vmagent scrape job, are them all healthy? If not, there should be reason on /targets page as well.

k0nstantinv · 2024-04-30T08:29:53Z

we use vm-stack chart for out setup

vmagent via VMServiceScrape (kubernetes sd)

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
...
  labels:
....
    managed-by: vm-operator
  name: vmagent-vm-stack
  namespace: monitoring
  ownerReferences:
  - apiVersion: operator.victoriametrics.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: VMAgent
    name: vm-stack
    uid: dff72d61-3e36-4968-bd25-9f48cac8f339
  resourceVersion: "1104257"
  uid: 642c3ef7-fca1-4cee-b146-ad54260bab09
spec:
  endpoints:
  - attach_metadata: {}
    path: /metrics
    port: http
  namespaceSelector: {}
  selector:
    matchExpressions:
    - key: operator.victoriametrics.com/additional-service
      operator: DoesNotExist
    matchLabels:
      app.kubernetes.io/component: monitoring
      app.kubernetes.io/instance: vm-stack
      app.kubernetes.io/name: vmagent
      managed-by: vm-operator

I can't get what should I post here. Can you provide exact command or query, pls?

UI targets page is really huge due to targets count near 9000

we definitely have unhealthy targets, actually hundreds of them, is it related to vmagent's job up/down flapping?

Haleygo · 2024-04-30T09:39:43Z

UI targets page is really huge due to targets count near 9000

Targets can be filtered on ui like this, please share for vmagent job

k0nstantinv · 2024-04-30T12:06:03Z

Thanks! I can see now. Sometimes it is down

sometime it is up

no idea why it's happening

k0nstantinv · 2024-05-03T06:30:39Z

@Haleygo can you please tell is there any recommendation to fix this? I've tried to increase scrape timeout with no luck, what else context deadline exceeded could mean?

Haleygo · 2024-05-08T05:26:41Z

I can see now. Sometimes it is down

Looks like you have a lot of scrape failures(3062/4783; 2733/4846), they could be caused by resources pressure or slow network, could you also check vmagent's cpu usage?

k0nstantinv · 2024-05-13T06:53:28Z

@Haleygo thanks! VMagent shows extremely high CPU usage. Nodes almost for 100% CPU usage. I didn't expect scrape failures could be caused due to high CPU usage. How to determine a scrape failure is exactly because of lack of resources?

k0nstantinv added the question The question issue label Apr 29, 2024

Haleygo self-assigned this Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vmagent job flapping up/down with no errors #6203

vmagent job flapping up/down with no errors #6203

k0nstantinv commented Apr 29, 2024 •

edited

Haleygo commented Apr 29, 2024

k0nstantinv commented Apr 30, 2024

Haleygo commented Apr 30, 2024

k0nstantinv commented Apr 30, 2024 •

edited

Haleygo commented Apr 30, 2024

k0nstantinv commented Apr 30, 2024 •

edited

k0nstantinv commented May 3, 2024

Haleygo commented May 8, 2024

k0nstantinv commented May 13, 2024

vmagent job flapping up/down with no errors #6203

vmagent job flapping up/down with no errors #6203

Comments

k0nstantinv commented Apr 29, 2024 • edited

Is your question request related to a specific component?

Describe the question in detail

Troubleshooting docs

Haleygo commented Apr 29, 2024

k0nstantinv commented Apr 30, 2024

Haleygo commented Apr 30, 2024

k0nstantinv commented Apr 30, 2024 • edited

Haleygo commented Apr 30, 2024

k0nstantinv commented Apr 30, 2024 • edited

k0nstantinv commented May 3, 2024

Haleygo commented May 8, 2024

k0nstantinv commented May 13, 2024

k0nstantinv commented Apr 29, 2024 •

edited

k0nstantinv commented Apr 30, 2024 •

edited

k0nstantinv commented Apr 30, 2024 •

edited