[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532

dk03051996 · 2024-05-10T12:36:21Z

Describe the bug a clear and concise description of what the bug is.

I have configured kube-prometheus-stack with thanos and then configured pagerduty with alertmanager. I am seeing false error for
KubeControllerManager,KubeProxyDown,KubeSchedulerDown and kube-state-metrics. I can also see target as unhealthy for Code-DNS which is throwing the result as 0/2.

I am using GKE 1.28.2 cluster.

What's your helm version?

3.14.2

What's your kubectl version?

1.29.3

Which chart?

kube-prometheus-stack

What's the chart version?

55.8.1

What happened?

While configuring Alert-manager, I am seeing multiple issues.

I set resolve_timeout to 5min but it takes 10 min to resolve before rechecking if issue persist then resolve it and take around 20 minutes to trigger alert.
I am getting false alarm as explained above.
If I want to configure alerts and creating a manfiest with kind as PrometheusRules I cannot see any alerts based on that.
Also, its a bit confusing where all default alerts are being managed and can be modified.

What you expected to happen?

I expected resolve timeout to take place within 5 min instead of 10 min along with issues getting triggered taking more time than expected.

How to reproduce it?

You can use below yaml and configure alertmanager and see the alerts on port 9093 and target on 9090

Enter the changed values of values.yaml?

#configuration for prometheus and thanos sidecar setup
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
disableCompaction: true
retention: 2d
retentionSize: "1GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: standard #you can change storage type depend on cloud
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi #you can change storage configuration
thanos:
image: quay.io/thanos/thanos:v0.28.1 #image of thanos to run as sidecar
objectStorageConfig:
existingSecret:
name: thanos-objstore-config #secret that we will create further
key: key.json

thanosService:
enabled: true # this will enable a service for service discovery
annotations: {}

labels: {}
externalTrafficPolicy: Cluster
type: ClusterIP
portName: grpc
port: 10901
targetPort: "grpc"
httpPortName: http
httpPort: 10902
targetHttpPort: "http"
clusterIP: ""
nodePort: 30901
httpNodePort: 30902

grafana:
enabled: false # we are using grafana as separate chart

#configuration for alert-manager
alertmanager:
alertmanagerSpec:
replicas: 1
forceEnableClusterMode: true #this should be true to alertmanager status to be ready
config:
global:
pagerduty_url: https://events.pagerduty.com/v2/enqueue
resolve_timeout: 5m
receivers:
- name: "null"
- name: pagerduty-notifications
pagerduty_configs:
- send_resolved: true # this is helpful in resolving pagerduty incident if alert is resolved in prometheus
service_key:<> #this key should be
route:
receiver: pagerduty-notifications
routes: #below configurations can be used if you want to configure multiple receiver based on match
- match:
alertname: "watchdog" # null should come as default
receiver: 'null'
templates:
- /etc/alertmanager/config/*.tmpl

Enter the command that you execute and failing/misfunctioning.

helm upgrade --install prometheus --create-namespace -n monitoring prometheus-community/kube-prometheus-stack --version 55.8.1 --values values-custom.yaml

Anything else we need to know?

No response

The text was updated successfully, but these errors were encountered:

jkroepke · 2024-05-14T17:03:52Z

The default configurations is designed for an standard self-managed kubernetes installation.

For managed clusters like GKE, if you to adjust and manually disable the alerts.

If you create PrometheusRules, ensure the manifests have an label release and the value of the value must equal to the helm release name.

dk03051996 · 2024-05-14T18:24:23Z

hi @jkroepke , can we configure prometheus to scrap metrics for kube-scheduler, controller, kube-proxy?
I set them to false to remove the alerts.
kubeScheduler: enabled: false kubeControllerManager: enabled: false kubeProxy: enabled: false

dk03051996 · 2024-05-14T18:34:03Z

Also, can you please help me to understand behaviour of resolve_timeout. I am getting issue auto-resolved after 10 min if I have fixed the issue. I am expecting it to be finished in 5 min as configured in values.yaml

jkroepke · 2024-05-14T19:48:24Z

For example with kubeControllerManager.enabled, you can disable to scrape the controllerManager. You will find the other values in the values.yaml

Also, can you please help me to understand behaviour of resolve_timeout. I am getting issue auto-resolved after 10 min if I have fixed the issue. I am expecting it to be finished in 5 min as configured in values.yaml

For alertmanager question, I would recommend to ask the community at slack -> cloud-native.slack.com there is an channel named #prometheus-alertmanager

dk03051996 · 2024-05-15T10:05:30Z

hi @jkroepke , thanks for responding. I have joined the CNCF channel via slack but I can see only general, hallway and random as channels.

dk03051996 · 2024-05-15T18:15:31Z

hi @jkroepke , most of my issue is solved in terms of alertmanager. I need one help in understanding the resolve time behaviour. Is it based on expression we write in Prometheusrule because for component decides the time before sending alerts but can we change behaviour of auto resolve if issue is resolved from our side.

dk03051996 added the bug Something isn't working label May 10, 2024

dk03051996 closed this as completed May 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532

[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532

dk03051996 commented May 10, 2024 •

edited

jkroepke commented May 14, 2024 •

edited

dk03051996 commented May 14, 2024

dk03051996 commented May 14, 2024

jkroepke commented May 14, 2024

dk03051996 commented May 15, 2024

dk03051996 commented May 15, 2024

[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532

[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532

Comments

dk03051996 commented May 10, 2024 • edited

Describe the bug a clear and concise description of what the bug is.

What's your helm version?

What's your kubectl version?

Which chart?

What's the chart version?

What happened?

What you expected to happen?

How to reproduce it?

Enter the changed values of values.yaml?

Enter the command that you execute and failing/misfunctioning.

Anything else we need to know?

jkroepke commented May 14, 2024 • edited

dk03051996 commented May 14, 2024

dk03051996 commented May 14, 2024

jkroepke commented May 14, 2024

dk03051996 commented May 15, 2024

dk03051996 commented May 15, 2024

dk03051996 commented May 10, 2024 •

edited

jkroepke commented May 14, 2024 •

edited