Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vmalert pod restart when promrules refresh #6201

Open
3 tasks
ALEX-yinhao opened this issue Apr 28, 2024 · 6 comments
Open
3 tasks

vmalert pod restart when promrules refresh #6201

ALEX-yinhao opened this issue Apr 28, 2024 · 6 comments
Assignees
Labels
question The question issue

Comments

@ALEX-yinhao
Copy link

Is your question request related to a specific component?

vmalert

Describe the question in detail

i find the vmalert will be restart when the prometheusrules refresh. but sometimes i dont want to restart the pod of vmalert. so how can i fix it?
image

image

my vm version is : vmalert:v1.89.1

Troubleshooting docs

@ALEX-yinhao ALEX-yinhao added the question The question issue label Apr 28, 2024
@Haleygo
Copy link
Collaborator

Haleygo commented Apr 29, 2024

Hello,
vmalert supports hot config reload by calling /-/reload endpoint or using -configCheckInterval flag.
I'd recommend to add a config reloader sidecar in your vmalert pod, which watches the rule files and calls /-/reload when there is config update.
You can also use vm-operator to manage vmalert, which contains config reloader by default.

@Haleygo Haleygo self-assigned this Apr 29, 2024
@ALEX-yinhao
Copy link
Author

Hello, vmalert supports hot config reload by calling /-/reload endpoint or using -configCheckInterval flag. I'd recommend to add a config reloader sidecar in your vmalert pod, which watches the rule files and calls /-/reload when there is config update. You can also use vm-operator to manage vmalert, which contains config reloader by default.

thanks for your reply.

but it cant help me. my vmalert has already use vm-operator to manage, and set the flag -configCheckInterval. but pod also restart.

my helmchart config like this.

  # bare k8s deployment for vmalert
  vmalert:
    enable: true
    serviceAccount:
      # Specifies whether a service account should be created
      create: true
      # Annotations to add to the service account
      annotations: {}
      # The name of the service account to use.
      # If not set and create is true, a name is generated using the fullname template
      name: ""
    autoscaling:
      enabled: false
      minReplicas: 1
      maxReplicas: 100
      targetCPUUtilizationPercentage: 80
      # targetMemoryUtilizationPercentage: 80
    spec:
      replicaCount: 2
      image:
        repository:
        pullPolicy: Always
        tag: "v1.89.1"
      imagePullSecrets: []
      podAnnotations: {}
      podSecurityContext: {}
      #  fsGroup: 2000
      securityContext: {}
      #  capabilities:
      #    drop:
      #    - ALL
      #  readOnlyRootFilesystem: true
      #  runAsNonRoot: true
      #  runAsUser: 1000
      resources:
        limits:
          cpu: 2
          memory: 2Gi
        requests:
          cpu: 100m
          memory: 128Mi
      # Allowed values: `soft` or `hard`
      podAntiAffinityPreset: hard
      # configMap name of the prometheusRules
      promRules:
      - prometheus-app-telemetry-middleware-prometheus-rulefiles-.+
      extraArgs: {}
        # Lookback defines how far into the past to look when evaluating queries. For example, if the datasource.lookback=5m then param "time" with value now()-5m will be added to every query.
        # datasource.lookback: 5m
        # How far a value can fallback to when evaluating queries. For example, if -datasource.queryStep=15s then param "step" with value "15s" will be added to every query. If set to 0, rule's evaluation interval will be used instead. (default 5m0s)
        # datasource.queryStep: 5m
      # Interval for checking for changes in '-rule' or '-notifier.config' files. 
      # By default the checking is disabled. Send SIGHUP signal in order to force config check for changes.
      configCheckInterval: 60s
      # How often to evaluate the rules (default 1m0s)
      evaluationInterval: 30s
      # External label to be applied for each rule
      externalLabels: []
      # - "prometheus=plat-diamond-metric/diamond-monitor-prometheus"
      # - "prometheus_replica=prometheus-cluster-monitor-diamond-mo-prometheus-0"
    service:
      type: ClusterIP
      port: 8080
    notifierConfig:
      dns_sd_configs:
      - names:
          - alertmanager-operated
        type: 'A'
        port: 9093

@Haleygo
Copy link
Collaborator

Haleygo commented May 9, 2024

@ALEX-yinhao , that's not expected.
image
Although, from this pic, I don't see vmalert pods got restarted when rules are modified, but pods prometheus-rulefiles-(I assum they are not vmalert) restarted. What pods prometheus-rulefiles- do here?
Do you see any logs when vmalert pod got terminated?

@ALEX-yinhao
Copy link
Author

ALEX-yinhao commented May 9, 2024

@ALEX-yinhao , that's not expected. image Although, from this pic, I don't see vmalert pods got restarted when rules are modified, but pods prometheus-rulefiles-(I assum they are not vmalert) restarted. What pods prometheus-rulefiles- do here? Do you see any logs when vmalert pod got terminated?

prometheus-app-telemetry-middleware-prometheus-rulefiles- is configmap, this is created by prometheus-opeartor. in the past ,i use prometheus to archive alert. now i use vmalert instead of promethues , but want to use the rules of prometheus .
this rulefiles will be refresh all by prometheus-operator sometimes, and when the rulefiles refresh , vmalert wiil be stop and start a new pod , so you can see the pod restart count status is 0.

in the vmalert pod ,i cant see any error message. i only can see the log like this

2024-05-09T03:20:48.670Z	info	VictoriaMetrics/app/vmalert/main.go:189	service received signal terminated

@Haleygo
Copy link
Collaborator

Haleygo commented May 10, 2024

From the log, someone is sending terminate signal to vmalert. And since there is no config-reloader in vmalert pod, I'd guess you have some external service to do it.

my helmchart config like this.
...
# configMap name of the prometheusRules
promRules:
- prometheus-app-telemetry-middleware-prometheus-rulefiles-.+

If you already mount all the rules configMap in vmalert pod, you can just call /-/reload endpoint.

And if you're using vm-opertor, I'd suggest to use vmrule[vm-operator can auto-convert prometheusRule to vmrule] and enable ruleSelector in VMAlertSpec, which brings automatically config reload.

@ALEX-yinhao
Copy link
Author

From the log, someone is sending terminate signal to vmalert. And since there is no config-reloader in vmalert pod, I'd guess you have some external service to do it.

my helmchart config like this.
...

configMap name of the prometheusRules

promRules:

  • prometheus-app-telemetry-middleware-prometheus-rulefiles-.+

If you already mount all the rules configMap in vmalert pod, you can just call /-/reload endpoint.

And if you're using vm-opertor, I'd suggest to use vmrule[vm-operator can auto-convert prometheusRule to vmrule] and enable ruleSelector in VMAlertSpec, which brings automatically config reload.

yes,about the config-reloader, i set vmalert config reloader env in my vm-operator chart, but in the vmalert pod, i cant find the config-reloader
like this

env:
  - name: VM_VMAGENTDEFAULT_CONFIGRELOADIMAGE
    value: registry.sensetime.com/diamond/prometheus-operator/prometheus-config-reloader:v0.48.1
  - name: VM_VMAUTHDEFAULT_CONFIGRELOADIMAGE
    value: registry.sensetime.com/diamond/prometheus-operator/prometheus-config-reloader:v0.48.1
  - name: VM_VMALERTDEFAULT_CONFIGRELOADIMAGE
    value: registry.sensetime.com/diamond/jimmidyson/configmap-reload:v0.3.0
  - name: VM_PODWAITREADYTIMEOUT
    value: "180s"
  - name: VM_PODWAITREADYINTERVALCHECK
    value: "15s"
  - name: VM_PODWAITREADYINITDELAY
    value: "30s"
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question The question issue
Projects
None yet
Development

No branches or pull requests

2 participants