Alerts OOB¶

Alerts OOB
Monitoring-operator
Cert-exporter

Monitoring-operator¶

Heartbeat¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
DeadMansSwitch	An always-firing Dead Man's Switch alert (instance {{ $labels.instance }})	3m	information	vector(1)	This is an alert meant to ensure that the entire alerting pipeline is functional. This alert should always be firing. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

SelfMonitoring¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
PrometheusJobMissing	Prometheus job missing (instance {{ $labels.instance }})	5m	warning	absent(up{job=~".*prometheus-pod-monitor"})	A Prometheus job has disappeared VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTargetMissing	Prometheus target missing (instance {{ $labels.instance }})	5m	high	up == 0	A Prometheus target has disappeared. An exporter might be crashed. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusAllTargetsMissing	Prometheus all targets missing (job {{ $labels.job }})	5m	critical	count by(job) (up) == count by(job) (up == 0)	A Prometheus job does not have living target anymore. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusConfigurationReloadFailure	Prometheus configuration reload failure (instance {{ $labels.instance }})	5m	warning	prometheus_config_last_reload_successful != 1	Prometheus configuration reload error VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTooManyRestarts	Prometheus too many restarts (instance {{ $labels.instance }})	5m	warning	changes(process_start_time_seconds{job=~".*prometheus-pod-monitor"}[15m]) > 2	Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusRuleEvaluationFailures	Prometheus rule evaluation failures (instance {{ $labels.instance }})	5m	critical	increase(prometheus_rule_evaluation_failures_total[3m]) > 0	Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTemplateTextExpansionFailures	Prometheus template text expansion failures (instance {{ $labels.instance }})	5m	critical	increase(prometheus_template_text_expansion_failures_total[3m]) > 0	Prometheus encountered {{ $value }} template text expansion failures VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusRuleEvaluationSlow	Prometheus rule evaluation slow (instance {{ $labels.instance }})	5m	warning	prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds	Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusNotificationsBacklog	Prometheus notifications backlog (instance {{ $labels.instance }})	5m	warning	min_over_time(prometheus_notifications_queue_length[10m]) > 0	The Prometheus notification queue has not been empty for 10 minutes VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTargetEmpty	Prometheus target empty (instance {{ $labels.instance }})	5m	critical	prometheus_sd_discovered_targets == 0	Prometheus has no target in service discovery VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTargetScrapingSlowTwoMinutes	Prometheus target scraping slow (instance {{ $labels.instance }}) for 2 minutes	5m	warning	(prometheus_target_interval_length_seconds{interval="2m0s", quantile="0.9"}) > 135	Prometheus is scraping exporters slowly VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTargetScrapingSlowOneMinute	Prometheus target scraping slow (instance {{ $labels.instance }}) for 1 minute	5m	warning	(prometheus_target_interval_length_seconds{interval="1m0s", quantile="0.9"}) > 70	Prometheus is scraping exporters slowly VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTargetScrapingSlowThirtySeconds	Prometheus target scraping slow (instance {{ $labels.instance }}) for 30 seconds	5m	warning	(prometheus_target_interval_length_seconds{interval="30s", quantile="0.9"}) > 35	Prometheus is scraping exporters slowly VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusLargeScrape	Prometheus large scrape (instance {{ $labels.instance }})	5m	warning	increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10	Prometheus has many scrapes that exceed the sample limit VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTargetScrapeDuplicate	Prometheus target scrape duplicate (instance {{ $labels.instance }})	5m	warning	increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0	Prometheus has many samples rejected due to duplicate timestamps but different values VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTsdbCheckpointCreationFailures	Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }})	5m	critical	increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0	Prometheus encountered {{ $value }} checkpoint creation failures VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTsdbCheckpointDeletionFailures	Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }})	5m	critical	increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0	Prometheus encountered {{ $value }} checkpoint deletion failures VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTsdbCompactionsFailed	Prometheus TSDB compactions failed (instance {{ $labels.instance }})	5m	critical	increase(prometheus_tsdb_compactions_failed_total[3m]) > 0	Prometheus encountered {{ $value }} TSDB compactions failures VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTsdbHeadTruncationsFailed	Prometheus TSDB head truncations failed (instance {{ $labels.instance }})	5m	critical	increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0	Prometheus encountered {{ $value }} TSDB head truncation failures VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTsdbReloadFailures	Prometheus TSDB reload failures (instance {{ $labels.instance }})	5m	critical	increase(prometheus_tsdb_reloads_failures_total[3m]) > 0	Prometheus encountered {{ $value }} TSDB reload failures VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTsdbWalCorruptions	Prometheus TSDB WAL corruptions (instance {{ $labels.instance }})	5m	critical	increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0	Prometheus encountered {{ $value }} TSDB WAL corruptions VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusTsdbWalTruncationsFailed	Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }})	5m	critical	increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0	Prometheus encountered {{ $value }} TSDB WAL truncation failures VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

AlertManager¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
PrometheusAlertmanagerConfigurationReloadFailure	Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }})	5m	warning	alertmanager_config_last_reload_successful != 1	AlertManager configuration reload error VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusNotConnectedToAlertmanager	Prometheus not connected to alertmanager (instance {{ $labels.instance }})	5m	critical	prometheus_notifications_alertmanagers_discovered < 1	Prometheus cannot connect the alertmanager VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
PrometheusAlertmanagerNotificationFailing	Prometheus AlertManager notification failing (instance {{ $labels.instance }})	5m	high	rate(alertmanager_notifications_failed_total[2m]) > 0	Alertmanager is failing sending notifications VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

KubernetesAlerts¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
KubernetesNodeReady	Kubernetes Node ready (instance {{ $labels.instance }})	5m	critical	kube_node_status_condition{condition="Ready",status="true"} == 0	Node {{ $labels.node }} has been unready for a long time VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesMemoryPressure	Kubernetes memory pressure (instance {{ $labels.instance }})	5m	critical	kube_node_status_condition{condition="MemoryPressure",status="true"} == 1	{{ $labels.node }} has MemoryPressure condition VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesDiskPressure	Kubernetes disk pressure (instance {{ $labels.instance }})	5m	critical	kube_node_status_condition{condition="DiskPressure",status="true"} == 1	{{ $labels.node }} has DiskPressure condition VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesOutOfDisk	Kubernetes out of disk (instance {{ $labels.instance }})	5m	critical	kube_node_status_condition{condition="OutOfDisk",status="true"} == 1	{{ $labels.node }} has OutOfDisk condition VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesJobFailed	Kubernetes Job failed (instance {{ $labels.instance }})	5m	warning	kube_job_status_failed > 0	Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesCronjobSuspended	Kubernetes CronJob suspended (instance {{ $labels.instance }})	5m	warning	kube_cronjob_spec_suspend != 0	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesPersistentvolumeclaimPending	Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }})	5m	warning	kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1	PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesPersistentvolumeError	Kubernetes PersistentVolume error (instance {{ $labels.instance }})	5m	critical	(kube_persistentvolume_status_phase{phase=~"Failed\|Pending",job="kube-state-metrics"}) > 0	Persistent volume is in bad state VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesVolumeOutOfDiskSpaceWarning	Kubernetes Volume out of disk space (instance {{ $labels.instance }})	2m	warning	(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 25	Volume is almost full (< 25% left) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesVolumeOutOfDiskSpaceHigh	Kubernetes Volume out of disk space (instance {{ $labels.instance }})	2m	high	(kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 10	Volume is almost full (< 10% left) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesVolumeFullInFourDays	Kubernetes Volume full in four days (instance {{ $labels.instance }})	10m	warning	predict_linear(kubelet_volume_stats_available_bytes[6h], 345600) < 0	{{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value	humanize }}% is available. VALUE = {{ $value }} LABELS: {{ $labels }}	{}
KubernetesStatefulsetDown	Kubernetes StatefulSet down (instance {{ $labels.instance }})	5m	critical	kube_statefulset_replicas - kube_statefulset_status_replicas_ready != 0	A StatefulSet went down VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesPodNotHealthy	Kubernetes Pod not healthy (instance {{ $labels.instance }})	5m	critical	min_over_time(sum by (exported_namespace, exported_pod) (kube_pod_status_phase{phase=~"Pending\|Unknown\|Failed"})[1h:1m]) > 0	Pod has been in a non-ready state for longer than an hour. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesPodCrashLooping	Kubernetes pod crash looping (instance {{ $labels.instance }})	5m	warning	(rate(kube_pod_container_status_restarts_total[15m]) * 60) * 5 > 5	Pod {{ $labels.pod }} is crash looping VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesReplicassetMismatch	Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }})	5m	warning	kube_replicaset_spec_replicas - kube_replicaset_status_ready_replicas != 0	Deployment Replicas mismatch VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesDeploymentReplicasMismatch	Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }})	5m	warning	kube_deployment_spec_replicas - kube_deployment_status_replicas_available != 0	Deployment Replicas mismatch VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesStatefulsetReplicasMismatch	Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }})	5m	warning	kube_statefulset_status_replicas_ready - kube_statefulset_status_replicas != 0	A StatefulSet has not matched the expected number of replicas for longer than 15 minutes. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesDeploymentGenerationMismatch	Kubernetes Deployment generation mismatch (instance {{ $labels.instance }})	5m	critical	kube_deployment_status_observed_generation - kube_deployment_metadata_generation != 0	A Deployment has failed but has not been rolled back. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesStatefulsetGenerationMismatch	Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }})	5m	critical	kube_statefulset_status_observed_generation - kube_statefulset_metadata_generation != 0	A StatefulSet has failed but has not been rolled back. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesStatefulsetUpdateNotRolledOut	Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }})	5m	critical	max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated)	StatefulSet update has not been rolled out. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesDaemonsetRolloutStuck	Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }})	5m	critical	(((kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled) * 100) < 100) or (kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0)	Some Pods of DaemonSet are not scheduled or not ready VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesDaemonsetMisscheduled	Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }})	5m	critical	kube_daemonset_status_number_misscheduled > 0	Some DaemonSet Pods are running where they are not supposed to run VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesCronjobTooLong	Kubernetes CronJob too long (instance {{ $labels.instance }})	5m	warning	time() - kube_cronjob_next_schedule_time > 3600	CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesJobCompletion	Kubernetes job completion (instance {{ $labels.instance }})	5m	critical	(kube_job_spec_completions - kube_job_status_succeeded > 0) or (kube_job_status_failed > 0)	Kubernetes Job failed to complete VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesApiServerErrors	Kubernetes API server errors (instance {{ $labels.instance }})	5m	critical	(sum(rate(apiserver_request_count{job="kube-apiserver",code=~"(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="kube-apiserver"}[2m]))) * 100 > 3	Kubernetes API server is experiencing high error rate VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
ApiServerRequestsSlow	API Server requests are slow(instance {{ $labels.instance }})	5m	warning	histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m])) > 0.2	HTTP requests slowing down, 99^th quantile is over 0.2s for 5 minutes\n VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
ControllerWorkQueueDepth	Controller work queue depth is more than 10 (instance {{ $labels.instance }})	5m	warning	sum(workqueue_depth) > 10	Controller work queue depth is more than 10 VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesApiClientErrors	Kubernetes API client errors (instance {{ $labels.instance }})	5m	critical	(sum(rate(rest_client_requests_total{code=~"(4\|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 5	Kubernetes API client is experiencing high error rate VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesClientCertificateExpiresNextWeek	Kubernetes client certificate expires next week (instance {{ $labels.instance }})	5m	warning	(apiserver_client_certificate_expiration_seconds_count{job="kubelet"}) > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="kubelet"}[5m]))) < 604800	A client certificate used to authenticate to the apiserver is expiring next week. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
KubernetesClientCertificateExpiresSoon	Kubernetes client certificate expires soon (instance {{ $labels.instance }})	5m	critical	(apiserver_client_certificate_expiration_seconds_count{job="kubelet"}) > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="kubelet"}[5m]))) < 86400	A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

NodeProcesses¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
CountPidsAndThreadOutOfLimit	Host high PIDs and Threads usage (instance {{ $labels.instance }})	5m	high	(sum(container_processes) by (node) + on (node) label_replace(node_processes_threads * on(instance) group_left(nodename) (node_uname_info), "node", "$1", "nodename", "(.+)")) / on (node) label_replace(node_processes_max_processes * on(instance) group_left(nodename) (node_uname_info), "node", "$1", "nodename", "(.+)") * 100 > 80	Sum of node's pids and threads is filling up (< 20% left) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

NodeExporters¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
NodeDiskUsageIsMoreThanThreshold	Disk usage on node > 70% (instance {{ $labels.node }})	5m	warning	(node_filesystem_size_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."} - node_filesystem_free_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."}) * 100 / (node_filesystem_avail_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."} + (node_filesystem_size_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."} - node_filesystem_free_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod.*"})) > 70	Node {{ $labels.node }} disk usage of {{ $labels.mountpoint }} is VALUE = {{ $value }}%	{}	{}
NodeDiskUsageIsMoreThanThreshold	Disk usage on node > 90% (instance {{ $labels.node }})	5m	high	(node_filesystem_size_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."} - node_filesystem_free_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."}) * 100 / (node_filesystem_avail_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."} + (node_filesystem_size_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod."} - node_filesystem_free_bytes{fstype=~"ext.\|xfs", mountpoint !~".pod.*"})) > 90	Node {{ $labels.node }} disk usage of {{ $labels.mountpoint }} is VALUE = {{ $value }}%	{}	{}
HostOutOfMemory	Host out of memory (instance {{ $labels.instance }})	5m	warning	((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) * on(instance) group_left(nodename) node_uname_info < 10	Node memory is filling up (< 10% left) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostMemoryUnderMemoryPressure	Host memory under memory pressure (instance {{ $labels.instance }})	5m	warning	rate(node_vmstat_pgmajfault[2m]) * on(instance) group_left(nodename) node_uname_info > 1000	The node is under heavy memory pressure. High rate of major page faults VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostUnusualNetworkThroughputIn	Host unusual network throughput in (instance {{ $labels.instance }})	5m	warning	((sum by (instance) (irate(node_network_receive_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024) / 1024 > 100	Host network interfaces are probably receiving too much data (> 100 MB/s) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostUnusualNetworkThroughputOut	Host unusual network throughput out (instance {{ $labels.instance }})	5m	warning	((sum by (instance) (irate(node_network_transmit_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024) / 1024 > 100	Host network interfaces are probably sending too much data (> 100 MB/s) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostUnusualDiskReadRate	Host unusual disk read rate (instance {{ $labels.instance }})	5m	warning	(sum by (instance) (irate(node_disk_read_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024 / 1024 > 50	Disk is probably reading too much data (> 50 MB/s) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostUnusualDiskWriteRate	Host unusual disk write rate (instance {{ $labels.instance }})	5m	warning	((sum by (instance) (irate(node_disk_written_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024) / 1024 > 50	Disk is probably writing too much data (> 50 MB/s) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostOutOfDiskSpace	Host out of disk space (instance {{ $labels.instance }})	5m	warning	((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"}) * on(instance) group_left(nodename) node_uname_info < 10	Disk is almost full (< 10% left) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostDiskWillFillIn4Hours	Host disk will fill in 4 hours (instance {{ $labels.instance }})	5m	warning	predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 14400) * on(instance) group_left(nodename) node_uname_info < 0	Disk will fill in 4 hours at current write rate VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostOutOfInodes	Host out of inodes (instance {{ $labels.instance }})	5m	warning	((node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint ="/"}) * 100) * on(instance) group_left(nodename) node_uname_info < 10	Disk is almost running out of available inodes (< 10% left) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostUnusualDiskReadLatency	Host unusual disk read latency (instance {{ $labels.instance }})	5m	warning	(rate(node_disk_read_time_seconds_total[2m]) / rate(node_disk_reads_completed_total[2m])) * on(instance) group_left(nodename) node_uname_info > 100	Disk latency is growing (read operations > 100ms) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostUnusualDiskWriteLatency	Host unusual disk write latency (instance {{ $labels.instance }})	5m	warning	(rate(node_disk_write_time_seconds_total[2m]) / rate(node_disk_writes_completed_total[2m])) * on(instance) group_left(nodename) node_uname_info > 100	Disk latency is growing (write operations > 100ms) VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HostHighCpuLoad	Host high CPU load (instance {{ $labels.instance }})	5m	warning	100 - ((avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) * on (instance) group_left (nodename) node_uname_info) > 80	CPU load is > 80% VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

DockerContainers¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
ContainerKilled	Container killed (instance {{ $labels.instance }})	5m	warning	time() - container_last_seen > 60	A container has disappeared VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
ContainerVolumeUsage	Container Volume usage (instance {{ $labels.instance }})	5m	warning	(1 - (sum(container_fs_inodes_free) BY (node) / sum(container_fs_inodes_total) BY (node))) * 100 > 80	Container Volume usage is above 80% VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
ContainerVolumeIoUsage	Container Volume IO usage (instance {{ $labels.instance }})	5m	warning	(sum(container_fs_io_current) BY (node, name) * 100) > 80	Container Volume IO usage is above 80% VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
ContainerHighThrottleRate	Container high throttle rate (instance {{ $labels.instance }})	5m	warning	rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1	Container is being throttled VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

HAmode¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
NotHAKubernetesDeploymentAvailableReplicas	Not HA mode: Deployment Available Replicas < 2 (instance {{ $labels.instance }})	5m	warning	`kube_deployment_status_replicas_available < 2`	Not HA mode: Kubernetes Deployment has less than 2 available replicas VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
NotHAKubernetesStatefulSetAvailableReplicas	Not HA mode: StatefulSet Available Replicas < 2 (instance {{ $labels.instance }})	5m	warning	`kube_statefulset_status_replicas_available < 2`	Not HA mode: Kubernetes StatefulSet has less than 2 available replicas VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
NotHAKubernetesDeploymentDesiredReplicas	Not HA mode: Deployment Desired Replicas < 2 (instance {{ $labels.instance }})	5m	warning	`kube_deployment_status_replicas < 2`	Not HA mode: Kubernetes Deployment has less than 2 desired replicas VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
NotHAKubernetesStatefulSetDesiredReplicas	Not HA mode: StatefulSet Desired Replicas < 2 (instance {{ $labels.instance }})	5m	warning	`kube_statefulset_status_replicas < 2`	Not HA mode: Kubernetes StatefulSet has less than 2 desired replicas VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
NotHAKubernetesDeploymentMultiplePodsPerNode	Not HA mode: Deployment Has Multiple Pods per Node (instance {{ $labels.instance }})	5m	warning	`count(sum(kube_pod_info{node=\~".\+", created_by_kind="ReplicaSet"}) by (namespace, node, created_by_name) \> 1) \> 0`	Not HA mode: Kubernetes Deployment has 2 or more replicas on the same node VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
NotHAKubernetesStatefulSetMultiplePodsPerNode	Not HA mode: StatefulSet Has Multiple Pods per Node (instance {{ $labels.instance }})	5m	warning	`count(sum(kube_pod_info{node=\~".\+", created_by_kind="StatefulSet"}) by (namespace, node, created_by_name) \> 1) \> 0`	Not HA mode: Kubernetes StatefulSet has 2 or more replicas on the same node VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

HAproxy¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
HaproxyDown	HAProxy down (instance {{ $labels.instance }})	5m	critical	haproxy_up == 0	HAProxy down VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyBackendConnectionErrors	HAProxy backend connection errors (instance {{ $labels.instance }})	5m	critical	sum by (backend) (rate(haproxy_backend_connection_errors_total[2m])) > 10	Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 10 req/s). Request throughput may be to high. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyServerResponseErrors	HAProxy server response errors (instance {{ $labels.instance }})	5m	critical	sum by (server) (rate(haproxy_server_response_errors_total[2m])) > 5	Too many response errors to {{ $labels.server }} server (> 5 req/s). VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyServerConnectionErrors	HAProxy server connection errors (instance {{ $labels.instance }})	5m	critical	sum by (server) (rate(haproxy_server_connection_errors_total[2m])) > 10	Too many connection errors to {{ $labels.server }} server (> 10 req/s). Request throughput may be to high. VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyPendingRequests	HAProxy pending requests (instance {{ $labels.instance }})	5m	warning	sum by (backend) (haproxy_backend_current_queue) > 0	Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyHttpSlowingDown	HAProxy HTTP slowing down (instance {{ $labels.instance }})	5m	warning	avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 2	Average request time is increasing VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyRetryHigh	HAProxy retry high (instance {{ $labels.instance }})	5m	warning	sum by (backend) (rate(haproxy_backend_retry_warnings_total[5m])) > 10	High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyBackendDown	HAProxy backend down (instance {{ $labels.instance }})	5m	critical	haproxy_backend_up == 0	HAProxy backend is down VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyServerDown	HAProxy server down (instance {{ $labels.instance }})	5m	critical	haproxy_server_up == 0	HAProxy server is down VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyFrontendSecurityBlockedRequests	HAProxy frontend security blocked requests (instance {{ $labels.instance }})	5m	warning	sum by (frontend) (rate(haproxy_frontend_requests_denied_total[5m])) > 10	HAProxy is blocking requests for security reason VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
HaproxyServerHealthcheckFailure	HAProxy server healthcheck failure (instance {{ $labels.instance }})	5m	warning	increase(haproxy_server_check_failures_total[5m]) > 0	Some server healthcheck are failing on {{ $labels.server }} VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

Etcd¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
EtcdInsufficientMembers	Etcd insufficient Members (instance {{ $labels.instance }})	5m	critical	count(etcd_server_id{job="etcd"}) % 2 == 0	Etcd cluster should have an odd number of members VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdNoLeader	Etcd no Leader (instance {{ $labels.instance }})	5m	critical	etcd_server_has_leader == 0	Etcd cluster have no leader VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdHighNumberOfLeaderChanges	Etcd high number of leader changes (instance {{ $labels.instance }})	5m	warning	increase(etcd_server_leader_changes_seen_total[1h]) > 3	Etcd leader changed more than 3 times during last hour VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdHighNumberOfFailedGrpcRequests	Etcd high number of failed GRPC requests (instance {{ $labels.instance }})	5m	warning	sum(rate(grpc_server_handled_total{job="etcd",grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total{job="etcd"}[5m])) BY (grpc_service, grpc_method) > 0.01	More than 1% GRPC request failure detected in Etcd for 5 minutes VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdHighNumberOfFailedGrpcRequests	Etcd high number of failed GRPC requests (instance {{ $labels.instance }})	5m	critical	sum(rate(grpc_server_handled_total{job="etcd",grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total{job="etcd"}[5m])) BY (grpc_service, grpc_method) > 0.05	More than 5% GRPC request failure detected in Etcd for 5 minutes VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdGrpcRequestsSlow	Etcd GRPC requests slow (instance {{ $labels.instance }})	5m	warning	histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd",grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15	GRPC requests slowing down, 99^th percentil is over 0.15s for 5 minutes VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdMemberCommunicationSlow	Etcd member communication slow (instance {{ $labels.instance }})	5m	warning	histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job="etcd"}[5m])) > 0.15	Etcd member communication slowing down, 99^th percentil is over 0.15s for 5 minutes VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdHighNumberOfFailedProposals	Etcd high number of failed proposals (instance {{ $labels.instance }})	5m	warning	increase(etcd_server_proposals_failed_total[1h]) > 5	Etcd server got more than 5 failed proposals past hour VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdHighFsyncDurations	Etcd high fsync durations (instance {{ $labels.instance }})	5m	warning	histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5	Etcd WAL fsync duration increasing, 99^th percentil is over 0.5s for 5 minutes VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}
EtcdHighCommitDurations	Etcd high commit durations (instance {{ $labels.instance }})	5m	warning	histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25	Etcd commit duration increasing, 99^th percentil is over 0.25s for 5 minutes VALUE = {{ $value }} LABELS: {{ $labels }}	{}	{}

NginxIngressAlerts¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
NginxHighHttp4xxErrorRate	Nginx high HTTP 4xx error rate (node: {{ $labels.node }}, namespace: {{ $labels.exported_namespace }}, ingress: {{ $labels.ingress }})	1m	high	sum by (ingress, exported_namespace, node) (rate(nginx_ingress_controller_requests{status=~"^4.."}[2m])) / sum by (ingress, exported_namespace, node)(rate(nginx_ingress_controller_requests[2m])) * 100 > 5	Too many HTTP requests with status 4xx (> 5%) VALUE = {{ $value }} LABELS = {{ $labels }}	{}	{}
NginxHighHttp5xxErrorRate	Nginx high HTTP 5xx error rate (node: {{ $labels.node }}, namespace: {{ $labels.exported_namespace }}, ingress: {{ $labels.ingress }})	1m	high	sum by (ingress, exported_namespace, node) (rate(nginx_ingress_controller_requests{status=~"^5.."}[2m])) / sum by (ingress, exported_namespace, node) (rate(nginx_ingress_controller_requests[2m])) * 100 > 5	Too many HTTP requests with status 5xx (> 5%) VALUE = {{ $value }} LABELS = {{ $labels }}	{}	{}
NginxLatencyHigh	Nginx latency high (node: {{ $labels.node }}, host: {{ $labels.host }})	2m	warning	histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3	Nginx p99 latency is higher than 3 seconds VALUE = {{ $value }} LABELS = {{ $labels }}	{}	{}

CoreDnsAlerts¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
CorednsPanicCount	CoreDNS Panic Count (instance {{ $labels.instance }})	0m	critical	increase(coredns_panics_total[1m]) > 0	Number of CoreDNS panics encountered VALUE = {{ $value }} LABELS = {{ $labels }}	{}	{}
CoreDNSLatencyHigh	CoreDNS have High Latency	5m	critical	histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[2m])) by(server, zone, le)) > 3	CoreDNS has 99^th percentile latency of {{ $value }} seconds for server {{ $labels.server }} zone {{ $labels.zone }}	{}	{}
CoreDNSForwardHealthcheckFailureCount	CoreDNS health checks have failed to upstream server	5m	warning	sum(rate(coredns_forward_healthcheck_broken_total[2m])) > 0	CoreDNS health checks have failed to upstream server {{ $labels.to }}	{}	{}
CoreDNSForwardHealthcheckBrokenCount	CoreDNS health checks have failed for all upstream servers	5m	warning	sum(rate(coredns_forward_healthcheck_broken_total[2m])) > 0	CoreDNS health checks failed for all upstream servers LABELS = {{ $labels }}	{}	{}
CoreDNSErrorsHigh	CoreDNS is returning SERVFAIL	5m	critical	sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_dns_responses_total[2m])) > 0.03	CoreDNS is returning SERVFAIL for {{ $value	humanizePercentage }} of requests	{}
CoreDNSErrorsHigh	CoreDNS is returning SERVFAIL	5m	warning	sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_dns_responses_total[2m])) > 0.01	CoreDNS is returning SERVFAIL for {{ $value	humanizePercentage }} of requests	{}
CoreDNSForwardLatencyHigh	CoreDNS has 99^th percentile latency for forwarding requests	5m	critical	histogram_quantile(0.99, sum(rate(coredns_forward_request_duration_seconds_bucket[2m])) by(to, le)) > 3	CoreDNS has 99^th percentile latency of {{ $value }} seconds forwarding requests to {{ $labels.to }}	{}	{}
CoreDNSForwardErrorsHigh	CoreDNS is returning SERVFAIL for forward requests	5m	critical	sum(rate(coredns_forward_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_forward_responses_total[2m])) > 0.03	CoreDNS is returning SERVFAIL for {{ $value	humanizePercentage }} of forward requests to {{ $labels.to }}	{}
CoreDNSForwardErrorsHigh	CoreDNS is returning SERVFAIL for forward requests	5m	warning	sum(rate(coredns_forward_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_forward_responses_total[2m])) > 0.01	CoreDNS is returning SERVFAIL for {{ $value	humanizePercentage }} of forward requests to {{ $labels.to }}	{}

DRAlerts¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
ProbeFailed	Probe failed (instance: {{ $labels.instance }})	5m	critical	probe_success == 0	Probe failed\n VALUE = {{ $value }}\n LABELS: {{ $labels }}	{}	{}
SlowProbe	Slow probe (instance: {{ $labels.instance }})	5m	warning	avg_over_time(probe_duration_seconds[1m]) > 1	Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }}	{}	{}
HttpStatusCode	HTTP Status Code (instance: {{ $labels.instance }})	5m	high	probe_http_status_code <= 199 OR probe_http_status_code >= 400	HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS: {{ $labels }}	{}	{}
HttpSlowRequests	HTTP slow requests (instance: {{ $labels.instance }})	5m	warning	avg_over_time(probe_http_duration_seconds[1m]) > 1	HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }}	{}	{}

BackupAlerts¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
Last Backup Failed	Last backup made by pod {{ $labels.pod }} in namespace {{ $labels.namespace }} failed.	1m	warning	backup_storage_last_failed != 0	Last backup made by pod {{ $labels.pod }} in namespace {{ $labels.namespace }} failed.\n VALUE = {{ $value }}\n LABELS: {{ $labels }}	{}	{}

Cert-exporter¶

Cert-exporter rules¶

Alerting rules¶

Name	Summary	For	Severity	Expression	Description	Other labels	Other annotations
FileCerts30DaysRemaining	Certificates from files expire within 30 days	10m	warning	count(86400 * 7 < cert_exporter_cert_expires_in_seconds <= 86400 * 30) > 0	Some certificates from files expire within 30 days.	{}	{}
FileCerts7DaysRemaining	Certificates from files expire within 7 days	10m	high	count(0 < cert_exporter_cert_expires_in_seconds <= 86400 * 7) > 0	Some certificates from files expire within 7 days.	{}	{}
FileCertsExpired	Certificates from files expired	10m	critical	count(cert_exporter_cert_expires_in_seconds <= 0) > 0	Some certificates from files already expired.	{}	{}
KubeconfigCerts30DaysRemaining	Certificates from kubeconfig expire within 30 days	10m	warning	count(86400 * 7 < cert_exporter_kubeconfig_expires_in_seconds <= 86400 * 30) > 0	Some certificates from kubeconfig expire within 30 days.	{}	{}
KubeconfigCerts7DaysRemaining	Certificates from kubeconfig expire within 7 days	10m	high	count(0 < cert_exporter_kubeconfig_expires_in_seconds <= 86400 * 7) > 0	Some certificates from kubeconfig expire within 7 days.	{}	{}
KubeconfigCertsExpired	Certificates from kubeconfig expired	10m	critical	count(cert_exporter_kubeconfig_expires_in_seconds <= 0) > 0	Some certificates from kubeconfig already expired.	{}	{}
SecretCerts30DaysRemaining	Certificates from secrets expire within 30 days	10m	warning	count(86400 * 7 < cert_exporter_secret_expires_in_seconds <= 86400 * 30) > 0	Some certificates from secrets expire within 30 days.	{}	{}
SecretCerts7DaysRemaining	Certificates from secrets expire within 7 days	10m	high	count(0 < cert_exporter_secret_expires_in_seconds <= 86400 * 7) > 0	Some certificates from secrets expire within 7 days.	{}	{}
SecretCertsExpired	Certificates from secrets expired	10m	critical	count(cert_exporter_secret_expires_in_seconds <= 0) > 0	Some certificates from secrets already expired.	{}	{}