Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532

Closed
dk03051996 opened this issue May 10, 2024 · 6 comments
Closed

[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532

dk03051996 opened this issue May 10, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@dk03051996
Copy link

dk03051996 commented May 10, 2024

Screenshot from 2024-05-10 18-09-12

Describe the bug a clear and concise description of what the bug is.

I have configured kube-prometheus-stack with thanos and then configured pagerduty with alertmanager. I am seeing false error for
KubeControllerManager,KubeProxyDown,KubeSchedulerDown and kube-state-metrics. I can also see target as unhealthy for Code-DNS which is throwing the result as 0/2.

I am using GKE 1.28.2 cluster.

What's your helm version?

3.14.2

What's your kubectl version?

1.29.3

Which chart?

kube-prometheus-stack

What's the chart version?

55.8.1

What happened?

While configuring Alert-manager, I am seeing multiple issues.

  1. I set resolve_timeout to 5min but it takes 10 min to resolve before rechecking if issue persist then resolve it and take around 20 minutes to trigger alert.
  2. I am getting false alarm as explained above.
  3. If I want to configure alerts and creating a manfiest with kind as PrometheusRules I cannot see any alerts based on that.
    Also, its a bit confusing where all default alerts are being managed and can be modified.

What you expected to happen?

I expected resolve timeout to take place within 5 min instead of 10 min along with issues getting triggered taking more time than expected.

How to reproduce it?

You can use below yaml and configure alertmanager and see the alerts on port 9093 and target on 9090

Enter the changed values of values.yaml?

#configuration for prometheus and thanos sidecar setup
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
disableCompaction: true
retention: 2d
retentionSize: "1GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: standard #you can change storage type depend on cloud
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi #you can change storage configuration
thanos:
image: quay.io/thanos/thanos:v0.28.1 #image of thanos to run as sidecar
objectStorageConfig:
existingSecret:
name: thanos-objstore-config #secret that we will create further
key: key.json

thanosService:
enabled: true # this will enable a service for service discovery
annotations: {}

labels: {}
externalTrafficPolicy: Cluster
type: ClusterIP
portName: grpc
port: 10901
targetPort: "grpc"
httpPortName: http
httpPort: 10902
targetHttpPort: "http"
clusterIP: ""
nodePort: 30901
httpNodePort: 30902

grafana:
enabled: false # we are using grafana as separate chart

#configuration for alert-manager
alertmanager:
alertmanagerSpec:
replicas: 1
forceEnableClusterMode: true #this should be true to alertmanager status to be ready
config:
global:
pagerduty_url: https://events.pagerduty.com/v2/enqueue
resolve_timeout: 5m
receivers:
- name: "null"
- name: pagerduty-notifications
pagerduty_configs:
- send_resolved: true # this is helpful in resolving pagerduty incident if alert is resolved in prometheus
service_key:<> #this key should be
route:
receiver: pagerduty-notifications
routes: #below configurations can be used if you want to configure multiple receiver based on match
- match:
alertname: "watchdog" # null should come as default
receiver: 'null'
templates:
- /etc/alertmanager/config/*.tmpl

Enter the command that you execute and failing/misfunctioning.

helm upgrade --install prometheus --create-namespace -n monitoring prometheus-community/kube-prometheus-stack --version 55.8.1 --values values-custom.yaml

Anything else we need to know?

No response

@dk03051996 dk03051996 added the bug Something isn't working label May 10, 2024
@jkroepke
Copy link
Member

jkroepke commented May 14, 2024

The default configurations is designed for an standard self-managed kubernetes installation.

For managed clusters like GKE, if you to adjust and manually disable the alerts.

If you create PrometheusRules, ensure the manifests have an label release and the value of the value must equal to the helm release name.

@dk03051996
Copy link
Author

hi @jkroepke , can we configure prometheus to scrap metrics for kube-scheduler, controller, kube-proxy?
I set them to false to remove the alerts.
kubeScheduler: enabled: false kubeControllerManager: enabled: false kubeProxy: enabled: false

@dk03051996
Copy link
Author

Also, can you please help me to understand behaviour of resolve_timeout. I am getting issue auto-resolved after 10 min if I have fixed the issue. I am expecting it to be finished in 5 min as configured in values.yaml

@jkroepke
Copy link
Member

For example with kubeControllerManager.enabled, you can disable to scrape the controllerManager. You will find the other values in the values.yaml

Also, can you please help me to understand behaviour of resolve_timeout. I am getting issue auto-resolved after 10 min if I have fixed the issue. I am expecting it to be finished in 5 min as configured in values.yaml

For alertmanager question, I would recommend to ask the community at slack -> cloud-native.slack.com there is an channel named #prometheus-alertmanager

@dk03051996
Copy link
Author

hi @jkroepke , thanks for responding. I have joined the CNCF channel via slack but I can see only general, hallway and random as channels.

@dk03051996
Copy link
Author

hi @jkroepke , most of my issue is solved in terms of alertmanager. I need one help in understanding the resolve time behaviour. Is it based on expression we write in Prometheusrule because for component decides the time before sending alerts but can we change behaviour of auto resolve if issue is resolved from our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants