-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[kube-prometheus-stack] Getting false alert of Core-DNS Down #4532
Comments
The default configurations is designed for an standard self-managed kubernetes installation. For managed clusters like GKE, if you to adjust and manually disable the alerts. If you create PrometheusRules, ensure the manifests have an label |
hi @jkroepke , can we configure prometheus to scrap metrics for kube-scheduler, controller, kube-proxy? |
Also, can you please help me to understand behaviour of resolve_timeout. I am getting issue auto-resolved after 10 min if I have fixed the issue. I am expecting it to be finished in 5 min as configured in values.yaml |
For example with
For alertmanager question, I would recommend to ask the community at slack -> cloud-native.slack.com there is an channel named #prometheus-alertmanager |
hi @jkroepke , thanks for responding. I have joined the CNCF channel via slack but I can see only general, hallway and random as channels. |
hi @jkroepke , most of my issue is solved in terms of alertmanager. I need one help in understanding the resolve time behaviour. Is it based on expression we write in Prometheusrule because for component decides the time before sending alerts but can we change behaviour of auto resolve if issue is resolved from our side. |
Describe the bug a clear and concise description of what the bug is.
I have configured kube-prometheus-stack with thanos and then configured pagerduty with alertmanager. I am seeing false error for
KubeControllerManager,KubeProxyDown,KubeSchedulerDown and kube-state-metrics. I can also see target as unhealthy for Code-DNS which is throwing the result as 0/2.
I am using GKE 1.28.2 cluster.
What's your helm version?
3.14.2
What's your kubectl version?
1.29.3
Which chart?
kube-prometheus-stack
What's the chart version?
55.8.1
What happened?
While configuring Alert-manager, I am seeing multiple issues.
Also, its a bit confusing where all default alerts are being managed and can be modified.
What you expected to happen?
I expected resolve timeout to take place within 5 min instead of 10 min along with issues getting triggered taking more time than expected.
How to reproduce it?
You can use below yaml and configure alertmanager and see the alerts on port 9093 and target on 9090
Enter the changed values of values.yaml?
#configuration for prometheus and thanos sidecar setup
prometheus:
prometheusSpec:
serviceMonitorSelectorNilUsesHelmValues: false
podMonitorSelectorNilUsesHelmValues: false
disableCompaction: true
retention: 2d
retentionSize: "1GB"
storageSpec:
volumeClaimTemplate:
spec:
storageClassName: standard #you can change storage type depend on cloud
accessModes: ["ReadWriteOnce"]
resources:
requests:
storage: 20Gi #you can change storage configuration
thanos:
image: quay.io/thanos/thanos:v0.28.1 #image of thanos to run as sidecar
objectStorageConfig:
existingSecret:
name: thanos-objstore-config #secret that we will create further
key: key.json
thanosService:
enabled: true # this will enable a service for service discovery
annotations: {}
grafana:
enabled: false # we are using grafana as separate chart
#configuration for alert-manager
alertmanager:
alertmanagerSpec:
replicas: 1
forceEnableClusterMode: true #this should be true to alertmanager status to be ready
config:
global:
pagerduty_url: https://events.pagerduty.com/v2/enqueue
resolve_timeout: 5m
receivers:
- name: "null"
- name: pagerduty-notifications
pagerduty_configs:
- send_resolved: true # this is helpful in resolving pagerduty incident if alert is resolved in prometheus
service_key:<> #this key should be
route:
receiver: pagerduty-notifications
routes: #below configurations can be used if you want to configure multiple receiver based on match
- match:
alertname: "watchdog" # null should come as default
receiver: 'null'
templates:
- /etc/alertmanager/config/*.tmpl
Enter the command that you execute and failing/misfunctioning.
helm upgrade --install prometheus --create-namespace -n monitoring prometheus-community/kube-prometheus-stack --version 55.8.1 --values values-custom.yaml
Anything else we need to know?
No response
The text was updated successfully, but these errors were encountered: