Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmagent job flapping up/down with no errors #6203

Open
3 tasks done
k0nstantinv opened this issue Apr 29, 2024 · 9 comments
Open
3 tasks done

vmagent job flapping up/down with no errors #6203

k0nstantinv opened this issue Apr 29, 2024 · 9 comments
Assignees
Labels
question The question issue

Comments

@k0nstantinv
Copy link

k0nstantinv commented Apr 29, 2024

Is your question request related to a specific component?

vmagent

Describe the question in detail

Victoriametrics cluster 1.96

Victoriametrics vmagent 1.96
clusterized 6 member vmagent

vmagent:
  enabled: true
  replicaCount: 1
  shardCount: 6
  scrapeInterval: 30s
  spec:
    image:
      tag: v1.96.0
    updateStrategy: Recreate
    extraArgs:
      loggerFormat: json
      promscrape.cluster.name: prod-euno1-devpl-0
      promscrape.disableCompression: "false"
      promscrape.discovery.concurrency: "200"
      promscrape.maxDroppedTargets: "5000"
      promscrape.maxScrapeSize: 4GB
      promscrape.noStaleMarkers: "true"
      promscrape.streamParse: "true"
      promscrape.suppressScrapeErrors: "true"
      remoteWrite.maxBlockSize: 200MB
      remoteWrite.maxRowsPerBlock: "11000"
      remoteWrite.queues: "170"
      remoteWrite.tlsInsecureSkipVerify: "true"
    externalLabels:
      cluster: prod-euno1-devpl-0
    nodeSelector:
      dedicated-to: victoriametrics
    tolerations:
    - effect: NoSchedule
      key: dedicated-to
      operator: Equal
      value: victoriametrics
    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app.kubernetes.io/name
                  operator: In
                  values:
                    - vmagent
            topologyKey: kubernetes.io/hostname
    resources:
      limits:
        cpu: 5100m
        memory: 7Gi
      requests:
        cpu: 2100m
        memory: 4Gi
  ingress:
    enabled: false
  additionalScrapeConfigs: |
    - job_name: opencost
      honor_labels: true
      scrape_interval: 1m
      scrape_timeout: 10s
      metrics_path: /metrics
      scheme: http
      dns_sd_configs:
      - names:
        - opencost.opencost
        type: 'A'
        port: 9003

everything seems to be fine, but built-in dashboard always shows vmagent job is flapping up/down
image

this yellow bars on the graph represent this query up{job="vmagent-vm-stack"}

and VMUI really shows this job goes up and down every time
image

but I completely can't find the reason. There are no pod/container restarts at all

$ k -n monitoring get pods -o wide | grep vmagent
vmagent-vm-stack-0-955dfdb6c-4lfk6                 2/2     Running   0               61m     10.33.91.209     ip-10-105-149-218.eu-north-1.compute.internal   <none>           <none>
vmagent-vm-stack-1-7db4f47479-c5c26                2/2     Running   0               64m     10.33.85.121     ip-10-105-153-69.eu-north-1.compute.internal    <none>           <none>
vmagent-vm-stack-2-68c6548d75-dnfp7                2/2     Running   0               35m     10.33.105.61     ip-10-105-179-11.eu-north-1.compute.internal    <none>           <none>
vmagent-vm-stack-3-57745d6b67-l9thn                2/2     Running   0               45m     10.33.96.213     ip-10-105-179-85.eu-north-1.compute.internal    <none>           <none>
vmagent-vm-stack-4-586b756895-wz9pw                2/2     Running   0               40m     10.33.101.169    ip-10-105-152-189.eu-north-1.compute.internal   <none>           <none>
vmagent-vm-stack-5-544d5b9dc-wrk6x                 2/2     Running   0               5d1h    10.33.222.245    ip-10-105-178-188.eu-north-1.compute.internal   <none>           <none>

and pods don't even have any errors

$ k -n monitoring logs -f vmagent-vm-stack-3-57745d6b67-l9thn vmagent | grep -v info
(empty)
^C

Please, help me to debug

Troubleshooting docs

@k0nstantinv k0nstantinv added the question The question issue label Apr 29, 2024
@Haleygo
Copy link
Collaborator

Haleygo commented Apr 29, 2024

Hello,
how do you scrape vmagent here, could you attach the scrape job config?
And how many vmagent instances showed in http://vmagent:8429/targets page, do them changed a lot?

@Haleygo Haleygo self-assigned this Apr 29, 2024
@k0nstantinv
Copy link
Author

@Haleygo thanks for you response! as I posted vmagent has 6 shards, unfortunately it's not possible to post all the targets from /targets endpoint, since each vmagent only scrapes its part of targets. Overall count of targets is about 9000 (from vmagent dashboard) and they really do change a lot. It is a dev cluster where any developer can create his own test environment. I'm ready to post anything you need to be able to help

@Haleygo
Copy link
Collaborator

Haleygo commented Apr 30, 2024

No problem, I'm only concern about the vmagent scrape job.
How do you scrape vmagent, using service discovery or static configs? How many instances shown under the vmagent scrape job, are them all healthy? If not, there should be reason on /targets page as well.

@k0nstantinv
Copy link
Author

k0nstantinv commented Apr 30, 2024

we use vm-stack chart for out setup

vmagent via VMServiceScrape (kubernetes sd)

apiVersion: operator.victoriametrics.com/v1beta1
kind: VMServiceScrape
metadata:
...
  labels:
....
    managed-by: vm-operator
  name: vmagent-vm-stack
  namespace: monitoring
  ownerReferences:
  - apiVersion: operator.victoriametrics.com/v1beta1
    blockOwnerDeletion: true
    controller: true
    kind: VMAgent
    name: vm-stack
    uid: dff72d61-3e36-4968-bd25-9f48cac8f339
  resourceVersion: "1104257"
  uid: 642c3ef7-fca1-4cee-b146-ad54260bab09
spec:
  endpoints:
  - attach_metadata: {}
    path: /metrics
    port: http
  namespaceSelector: {}
  selector:
    matchExpressions:
    - key: operator.victoriametrics.com/additional-service
      operator: DoesNotExist
    matchLabels:
      app.kubernetes.io/component: monitoring
      app.kubernetes.io/instance: vm-stack
      app.kubernetes.io/name: vmagent
      managed-by: vm-operator

I can't get what should I post here. Can you provide exact command or query, pls?

UI targets page is really huge due to targets count near 9000

we definitely have unhealthy targets, actually hundreds of them, is it related to vmagent's job up/down flapping?

@Haleygo
Copy link
Collaborator

Haleygo commented Apr 30, 2024

UI targets page is really huge due to targets count near 9000

Targets can be filtered on ui like this, please share for vmagent job
image

@k0nstantinv
Copy link
Author

k0nstantinv commented Apr 30, 2024

Thanks! I can see now. Sometimes it is down
image
sometime it is up
image

no idea why it's happening

@k0nstantinv
Copy link
Author

@Haleygo can you please tell is there any recommendation to fix this? I've tried to increase scrape timeout with no luck, what else context deadline exceeded could mean?

@Haleygo
Copy link
Collaborator

Haleygo commented May 8, 2024

I can see now. Sometimes it is down

Looks like you have a lot of scrape failures(3062/4783; 2733/4846), they could be caused by resources pressure or slow network, could you also check vmagent's cpu usage?

@k0nstantinv
Copy link
Author

@Haleygo thanks! VMagent shows extremely high CPU usage. Nodes almost for 100% CPU usage. I didn't expect scrape failures could be caused due to high CPU usage. How to determine a scrape failure is exactly because of lack of resources?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants