Skip to content

Alerts OOB

Monitoring-operator

Heartbeat

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
DeadMansSwitch An always-firing Dead Man's Switch alert (instance {{ $labels.instance }}) 3m information vector(1) This is an alert meant to ensure that the entire alerting pipeline is functional. This alert should always be firing.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

SelfMonitoring

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
PrometheusJobMissing Prometheus job missing (instance {{ $labels.instance }}) 5m warning absent(up{job=~".*prometheus-pod-monitor"}) A Prometheus job has disappeared
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTargetMissing Prometheus target missing (instance {{ $labels.instance }}) 5m high up == 0 A Prometheus target has disappeared. An exporter might be crashed.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusAllTargetsMissing Prometheus all targets missing (job {{ $labels.job }}) 5m critical count by(job) (up) == count by(job) (up == 0) A Prometheus job does not have living target anymore.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusConfigurationReloadFailure Prometheus configuration reload failure (instance {{ $labels.instance }}) 5m warning prometheus_config_last_reload_successful != 1 Prometheus configuration reload error
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTooManyRestarts Prometheus too many restarts (instance {{ $labels.instance }}) 5m warning changes(process_start_time_seconds{job=~".*prometheus-pod-monitor"}[15m]) > 2 Prometheus has restarted more than twice in the last 15 minutes. It might be crashlooping.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusRuleEvaluationFailures Prometheus rule evaluation failures (instance {{ $labels.instance }}) 5m critical increase(prometheus_rule_evaluation_failures_total[3m]) > 0 Prometheus encountered {{ $value }} rule evaluation failures, leading to potentially ignored alerts.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTemplateTextExpansionFailures Prometheus template text expansion failures (instance {{ $labels.instance }}) 5m critical increase(prometheus_template_text_expansion_failures_total[3m]) > 0 Prometheus encountered {{ $value }} template text expansion failures
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusRuleEvaluationSlow Prometheus rule evaluation slow (instance {{ $labels.instance }}) 5m warning prometheus_rule_group_last_duration_seconds > prometheus_rule_group_interval_seconds Prometheus rule evaluation took more time than the scheduled interval. I indicates a slower storage backend access or too complex query.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusNotificationsBacklog Prometheus notifications backlog (instance {{ $labels.instance }}) 5m warning min_over_time(prometheus_notifications_queue_length[10m]) > 0 The Prometheus notification queue has not been empty for 10 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTargetEmpty Prometheus target empty (instance {{ $labels.instance }}) 5m critical prometheus_sd_discovered_targets == 0 Prometheus has no target in service discovery
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTargetScrapingSlowTwoMinutes Prometheus target scraping slow (instance {{ $labels.instance }}) for 2 minutes 5m warning (prometheus_target_interval_length_seconds{interval="2m0s", quantile="0.9"}) > 135 Prometheus is scraping exporters slowly
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTargetScrapingSlowOneMinute Prometheus target scraping slow (instance {{ $labels.instance }}) for 1 minute 5m warning (prometheus_target_interval_length_seconds{interval="1m0s", quantile="0.9"}) > 70 Prometheus is scraping exporters slowly
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTargetScrapingSlowThirtySeconds Prometheus target scraping slow (instance {{ $labels.instance }}) for 30 seconds 5m warning (prometheus_target_interval_length_seconds{interval="30s", quantile="0.9"}) > 35 Prometheus is scraping exporters slowly
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusLargeScrape Prometheus large scrape (instance {{ $labels.instance }}) 5m warning increase(prometheus_target_scrapes_exceeded_sample_limit_total[10m]) > 10 Prometheus has many scrapes that exceed the sample limit
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTargetScrapeDuplicate Prometheus target scrape duplicate (instance {{ $labels.instance }}) 5m warning increase(prometheus_target_scrapes_sample_duplicate_timestamp_total[5m]) > 0 Prometheus has many samples rejected due to duplicate timestamps but different values
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTsdbCheckpointCreationFailures Prometheus TSDB checkpoint creation failures (instance {{ $labels.instance }}) 5m critical increase(prometheus_tsdb_checkpoint_creations_failed_total[3m]) > 0 Prometheus encountered {{ $value }} checkpoint creation failures
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTsdbCheckpointDeletionFailures Prometheus TSDB checkpoint deletion failures (instance {{ $labels.instance }}) 5m critical increase(prometheus_tsdb_checkpoint_deletions_failed_total[3m]) > 0 Prometheus encountered {{ $value }} checkpoint deletion failures
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTsdbCompactionsFailed Prometheus TSDB compactions failed (instance {{ $labels.instance }}) 5m critical increase(prometheus_tsdb_compactions_failed_total[3m]) > 0 Prometheus encountered {{ $value }} TSDB compactions failures
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTsdbHeadTruncationsFailed Prometheus TSDB head truncations failed (instance {{ $labels.instance }}) 5m critical increase(prometheus_tsdb_head_truncations_failed_total[3m]) > 0 Prometheus encountered {{ $value }} TSDB head truncation failures
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTsdbReloadFailures Prometheus TSDB reload failures (instance {{ $labels.instance }}) 5m critical increase(prometheus_tsdb_reloads_failures_total[3m]) > 0 Prometheus encountered {{ $value }} TSDB reload failures
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTsdbWalCorruptions Prometheus TSDB WAL corruptions (instance {{ $labels.instance }}) 5m critical increase(prometheus_tsdb_wal_corruptions_total[3m]) > 0 Prometheus encountered {{ $value }} TSDB WAL corruptions
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusTsdbWalTruncationsFailed Prometheus TSDB WAL truncations failed (instance {{ $labels.instance }}) 5m critical increase(prometheus_tsdb_wal_truncations_failed_total[3m]) > 0 Prometheus encountered {{ $value }} TSDB WAL truncation failures
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

AlertManager

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
PrometheusAlertmanagerConfigurationReloadFailure Prometheus AlertManager configuration reload failure (instance {{ $labels.instance }}) 5m warning alertmanager_config_last_reload_successful != 1 AlertManager configuration reload error
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusNotConnectedToAlertmanager Prometheus not connected to alertmanager (instance {{ $labels.instance }}) 5m critical prometheus_notifications_alertmanagers_discovered < 1 Prometheus cannot connect the alertmanager
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
PrometheusAlertmanagerNotificationFailing Prometheus AlertManager notification failing (instance {{ $labels.instance }}) 5m high rate(alertmanager_notifications_failed_total[2m]) > 0 Alertmanager is failing sending notifications
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

KubernetesAlerts

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
KubernetesNodeReady Kubernetes Node ready (instance {{ $labels.instance }}) 5m critical kube_node_status_condition{condition="Ready",status="true"} == 0 Node {{ $labels.node }} has been unready for a long time
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesMemoryPressure Kubernetes memory pressure (instance {{ $labels.instance }}) 5m critical kube_node_status_condition{condition="MemoryPressure",status="true"} == 1 {{ $labels.node }} has MemoryPressure condition
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesDiskPressure Kubernetes disk pressure (instance {{ $labels.instance }}) 5m critical kube_node_status_condition{condition="DiskPressure",status="true"} == 1 {{ $labels.node }} has DiskPressure condition
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesOutOfDisk Kubernetes out of disk (instance {{ $labels.instance }}) 5m critical kube_node_status_condition{condition="OutOfDisk",status="true"} == 1 {{ $labels.node }} has OutOfDisk condition
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesJobFailed Kubernetes Job failed (instance {{ $labels.instance }}) 5m warning kube_job_status_failed > 0 Job {{$labels.namespace}}/{{$labels.exported_job}} failed to complete
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesCronjobSuspended Kubernetes CronJob suspended (instance {{ $labels.instance }}) 5m warning kube_cronjob_spec_suspend != 0 CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is suspended
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesPersistentvolumeclaimPending Kubernetes PersistentVolumeClaim pending (instance {{ $labels.instance }}) 5m warning kube_persistentvolumeclaim_status_phase{phase="Pending"} == 1 PersistentVolumeClaim {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is pending
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesPersistentvolumeError Kubernetes PersistentVolume error (instance {{ $labels.instance }}) 5m critical (kube_persistentvolume_status_phase{phase=~"Failed|Pending",job="kube-state-metrics"}) > 0 Persistent volume is in bad state
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesVolumeOutOfDiskSpaceWarning Kubernetes Volume out of disk space (instance {{ $labels.instance }}) 2m warning (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 25 Volume is almost full (< 25% left)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesVolumeOutOfDiskSpaceHigh Kubernetes Volume out of disk space (instance {{ $labels.instance }}) 2m high (kubelet_volume_stats_available_bytes / kubelet_volume_stats_capacity_bytes) * 100 < 10 Volume is almost full (< 10% left)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesVolumeFullInFourDays Kubernetes Volume full in four days (instance {{ $labels.instance }}) 10m warning predict_linear(kubelet_volume_stats_available_bytes[6h], 345600) < 0 {{ $labels.namespace }}/{{ $labels.persistentvolumeclaim }} is expected to fill up within four days. Currently {{ $value humanize }}% is available.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{}
KubernetesStatefulsetDown Kubernetes StatefulSet down (instance {{ $labels.instance }}) 5m critical kube_statefulset_replicas - kube_statefulset_status_replicas_ready != 0 A StatefulSet went down
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesPodNotHealthy Kubernetes Pod not healthy (instance {{ $labels.instance }}) 5m critical min_over_time(sum by (exported_namespace, exported_pod) (kube_pod_status_phase{phase=~"Pending|Unknown|Failed"})[1h:1m]) > 0 Pod has been in a non-ready state for longer than an hour.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesPodCrashLooping Kubernetes pod crash looping (instance {{ $labels.instance }}) 5m warning (rate(kube_pod_container_status_restarts_total[15m]) * 60) * 5 > 5 Pod {{ $labels.pod }} is crash looping
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesReplicassetMismatch Kubernetes ReplicasSet mismatch (instance {{ $labels.instance }}) 5m warning kube_replicaset_spec_replicas - kube_replicaset_status_ready_replicas != 0 Deployment Replicas mismatch
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesDeploymentReplicasMismatch Kubernetes Deployment replicas mismatch (instance {{ $labels.instance }}) 5m warning kube_deployment_spec_replicas - kube_deployment_status_replicas_available != 0 Deployment Replicas mismatch
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesStatefulsetReplicasMismatch Kubernetes StatefulSet replicas mismatch (instance {{ $labels.instance }}) 5m warning kube_statefulset_status_replicas_ready - kube_statefulset_status_replicas != 0 A StatefulSet has not matched the expected number of replicas for longer than 15 minutes.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesDeploymentGenerationMismatch Kubernetes Deployment generation mismatch (instance {{ $labels.instance }}) 5m critical kube_deployment_status_observed_generation - kube_deployment_metadata_generation != 0 A Deployment has failed but has not been rolled back.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesStatefulsetGenerationMismatch Kubernetes StatefulSet generation mismatch (instance {{ $labels.instance }}) 5m critical kube_statefulset_status_observed_generation - kube_statefulset_metadata_generation != 0 A StatefulSet has failed but has not been rolled back.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesStatefulsetUpdateNotRolledOut Kubernetes StatefulSet update not rolled out (instance {{ $labels.instance }}) 5m critical max without (revision) (kube_statefulset_status_current_revision unless kube_statefulset_status_update_revision) * (kube_statefulset_replicas != kube_statefulset_status_replicas_updated) StatefulSet update has not been rolled out.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesDaemonsetRolloutStuck Kubernetes DaemonSet rollout stuck (instance {{ $labels.instance }}) 5m critical (((kube_daemonset_status_number_ready / kube_daemonset_status_desired_number_scheduled) * 100) < 100) or (kube_daemonset_status_desired_number_scheduled - kube_daemonset_status_current_number_scheduled > 0) Some Pods of DaemonSet are not scheduled or not ready
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesDaemonsetMisscheduled Kubernetes DaemonSet misscheduled (instance {{ $labels.instance }}) 5m critical kube_daemonset_status_number_misscheduled > 0 Some DaemonSet Pods are running where they are not supposed to run
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesCronjobTooLong Kubernetes CronJob too long (instance {{ $labels.instance }}) 5m warning time() - kube_cronjob_next_schedule_time > 3600 CronJob {{ $labels.namespace }}/{{ $labels.cronjob }} is taking more than 1h to complete.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesJobCompletion Kubernetes job completion (instance {{ $labels.instance }}) 5m critical (kube_job_spec_completions - kube_job_status_succeeded > 0) or (kube_job_status_failed > 0) Kubernetes Job failed to complete
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesApiServerErrors Kubernetes API server errors (instance {{ $labels.instance }}) 5m critical (sum(rate(apiserver_request_count{job="kube-apiserver",code=~"(?:5..)$"}[2m])) / sum(rate(apiserver_request_count{job="kube-apiserver"}[2m]))) * 100 > 3 Kubernetes API server is experiencing high error rate
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
ApiServerRequestsSlow API Server requests are slow(instance {{ $labels.instance }}) 5m warning histogram_quantile(0.99, rate(apiserver_request_duration_seconds_bucket{verb!="WATCH"}[5m])) > 0.2 HTTP requests slowing down, 99th quantile is over 0.2s for 5 minutes\n VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
ControllerWorkQueueDepth Controller work queue depth is more than 10 (instance {{ $labels.instance }}) 5m warning sum(workqueue_depth) > 10 Controller work queue depth is more than 10
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesApiClientErrors Kubernetes API client errors (instance {{ $labels.instance }}) 5m critical (sum(rate(rest_client_requests_total{code=~"(4|5).."}[2m])) by (instance, job) / sum(rate(rest_client_requests_total[2m])) by (instance, job)) * 100 > 5 Kubernetes API client is experiencing high error rate
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesClientCertificateExpiresNextWeek Kubernetes client certificate expires next week (instance {{ $labels.instance }}) 5m warning (apiserver_client_certificate_expiration_seconds_count{job="kubelet"}) > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="kubelet"}[5m]))) < 604800 A client certificate used to authenticate to the apiserver is expiring next week.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
KubernetesClientCertificateExpiresSoon Kubernetes client certificate expires soon (instance {{ $labels.instance }}) 5m critical (apiserver_client_certificate_expiration_seconds_count{job="kubelet"}) > 0 and histogram_quantile(0.01, sum by (job, le) (rate(apiserver_client_certificate_expiration_seconds_bucket{job="kubelet"}[5m]))) < 86400 A client certificate used to authenticate to the apiserver is expiring in less than 24.0 hours.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

NodeProcesses

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
CountPidsAndThreadOutOfLimit Host high PIDs and Threads usage (instance {{ $labels.instance }}) 5m high (sum(container_processes) by (node) + on (node) label_replace(node_processes_threads * on(instance) group_left(nodename) (node_uname_info), "node", "$1", "nodename", "(.+)")) / on (node) label_replace(node_processes_max_processes * on(instance) group_left(nodename) (node_uname_info), "node", "$1", "nodename", "(.+)") * 100 > 80 Sum of node's pids and threads is filling up (< 20% left)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

NodeExporters

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
NodeDiskUsageIsMoreThanThreshold Disk usage on node > 70% (instance {{ $labels.node }}) 5m warning (node_filesystem_size_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"} - node_filesystem_free_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"}) * 100 / (node_filesystem_avail_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"} + (node_filesystem_size_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"} - node_filesystem_free_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"})) > 70 Node {{ $labels.node }} disk usage of {{ $labels.mountpoint }} is
VALUE = {{ $value }}%
{} {}
NodeDiskUsageIsMoreThanThreshold Disk usage on node > 90% (instance {{ $labels.node }}) 5m high (node_filesystem_size_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"} - node_filesystem_free_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"}) * 100 / (node_filesystem_avail_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"} + (node_filesystem_size_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"} - node_filesystem_free_bytes{fstype=~"ext.*|xfs", mountpoint !~".*pod.*"})) > 90 Node {{ $labels.node }} disk usage of {{ $labels.mountpoint }} is
VALUE = {{ $value }}%
{} {}
HostOutOfMemory Host out of memory (instance {{ $labels.instance }}) 5m warning ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) * on(instance) group_left(nodename) node_uname_info < 10 Node memory is filling up (< 10% left)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostMemoryUnderMemoryPressure Host memory under memory pressure (instance {{ $labels.instance }}) 5m warning rate(node_vmstat_pgmajfault[2m]) * on(instance) group_left(nodename) node_uname_info > 1000 The node is under heavy memory pressure. High rate of major page faults
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostUnusualNetworkThroughputIn Host unusual network throughput in (instance {{ $labels.instance }}) 5m warning ((sum by (instance) (irate(node_network_receive_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024) / 1024 > 100 Host network interfaces are probably receiving too much data (> 100 MB/s)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostUnusualNetworkThroughputOut Host unusual network throughput out (instance {{ $labels.instance }}) 5m warning ((sum by (instance) (irate(node_network_transmit_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024) / 1024 > 100 Host network interfaces are probably sending too much data (> 100 MB/s)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostUnusualDiskReadRate Host unusual disk read rate (instance {{ $labels.instance }}) 5m warning (sum by (instance) (irate(node_disk_read_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024 / 1024 > 50 Disk is probably reading too much data (> 50 MB/s)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostUnusualDiskWriteRate Host unusual disk write rate (instance {{ $labels.instance }}) 5m warning ((sum by (instance) (irate(node_disk_written_bytes_total[2m])) * on(instance) group_left(nodename) node_uname_info) / 1024) / 1024 > 50 Disk is probably writing too much data (> 50 MB/s)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostOutOfDiskSpace Host out of disk space (instance {{ $labels.instance }}) 5m warning ((node_filesystem_avail_bytes{mountpoint="/"} * 100) / node_filesystem_size_bytes{mountpoint="/"}) * on(instance) group_left(nodename) node_uname_info < 10 Disk is almost full (< 10% left)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostDiskWillFillIn4Hours Host disk will fill in 4 hours (instance {{ $labels.instance }}) 5m warning predict_linear(node_filesystem_free_bytes{fstype!~"tmpfs"}[1h], 14400) * on(instance) group_left(nodename) node_uname_info < 0 Disk will fill in 4 hours at current write rate
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostOutOfInodes Host out of inodes (instance {{ $labels.instance }}) 5m warning ((node_filesystem_files_free{mountpoint ="/"} / node_filesystem_files{mountpoint ="/"}) * 100) * on(instance) group_left(nodename) node_uname_info < 10 Disk is almost running out of available inodes (< 10% left)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostUnusualDiskReadLatency Host unusual disk read latency (instance {{ $labels.instance }}) 5m warning (rate(node_disk_read_time_seconds_total[2m]) / rate(node_disk_reads_completed_total[2m])) * on(instance) group_left(nodename) node_uname_info > 100 Disk latency is growing (read operations > 100ms)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostUnusualDiskWriteLatency Host unusual disk write latency (instance {{ $labels.instance }}) 5m warning (rate(node_disk_write_time_seconds_total[2m]) / rate(node_disk_writes_completed_total[2m])) * on(instance) group_left(nodename) node_uname_info > 100 Disk latency is growing (write operations > 100ms)
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HostHighCpuLoad Host high CPU load (instance {{ $labels.instance }}) 5m warning 100 - ((avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by (instance) * 100) * on (instance) group_left (nodename) node_uname_info) > 80 CPU load is > 80%
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

DockerContainers

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
ContainerKilled Container killed (instance {{ $labels.instance }}) 5m warning time() - container_last_seen > 60 A container has disappeared
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
ContainerVolumeUsage Container Volume usage (instance {{ $labels.instance }}) 5m warning (1 - (sum(container_fs_inodes_free) BY (node) / sum(container_fs_inodes_total) BY (node))) * 100 > 80 Container Volume usage is above 80%
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
ContainerVolumeIoUsage Container Volume IO usage (instance {{ $labels.instance }}) 5m warning (sum(container_fs_io_current) BY (node, name) * 100) > 80 Container Volume IO usage is above 80%
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
ContainerHighThrottleRate Container high throttle rate (instance {{ $labels.instance }}) 5m warning rate(container_cpu_cfs_throttled_seconds_total[3m]) > 1 Container is being throttled
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

HAmode

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
NotHAKubernetesDeploymentAvailableReplicas Not HA mode: Deployment Available Replicas < 2 (instance {{ $labels.instance }}) 5m warning kube_deployment_status_replicas_available < 2 Not HA mode: Kubernetes Deployment has less than 2 available replicas
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
NotHAKubernetesStatefulSetAvailableReplicas Not HA mode: StatefulSet Available Replicas < 2 (instance {{ $labels.instance }}) 5m warning kube_statefulset_status_replicas_available < 2 Not HA mode: Kubernetes StatefulSet has less than 2 available replicas
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
NotHAKubernetesDeploymentDesiredReplicas Not HA mode: Deployment Desired Replicas < 2 (instance {{ $labels.instance }}) 5m warning kube_deployment_status_replicas < 2 Not HA mode: Kubernetes Deployment has less than 2 desired replicas
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
NotHAKubernetesStatefulSetDesiredReplicas Not HA mode: StatefulSet Desired Replicas < 2 (instance {{ $labels.instance }}) 5m warning kube_statefulset_status_replicas < 2 Not HA mode: Kubernetes StatefulSet has less than 2 desired replicas
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
NotHAKubernetesDeploymentMultiplePodsPerNode Not HA mode: Deployment Has Multiple Pods per Node (instance {{ $labels.instance }}) 5m warning count(sum(kube_pod_info{node=\~".\+", created_by_kind="ReplicaSet"}) by (namespace, node, created_by_name) \> 1) \> 0 Not HA mode: Kubernetes Deployment has 2 or more replicas on the same node
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
NotHAKubernetesStatefulSetMultiplePodsPerNode Not HA mode: StatefulSet Has Multiple Pods per Node (instance {{ $labels.instance }}) 5m warning count(sum(kube_pod_info{node=\~".\+", created_by_kind="StatefulSet"}) by (namespace, node, created_by_name) \> 1) \> 0 Not HA mode: Kubernetes StatefulSet has 2 or more replicas on the same node
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

HAproxy

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
HaproxyDown HAProxy down (instance {{ $labels.instance }}) 5m critical haproxy_up == 0 HAProxy down
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyBackendConnectionErrors HAProxy backend connection errors (instance {{ $labels.instance }}) 5m critical sum by (backend) (rate(haproxy_backend_connection_errors_total[2m])) > 10 Too many connection errors to {{ $labels.fqdn }}/{{ $labels.backend }} backend (> 10 req/s). Request throughput may be to high.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyServerResponseErrors HAProxy server response errors (instance {{ $labels.instance }}) 5m critical sum by (server) (rate(haproxy_server_response_errors_total[2m])) > 5 Too many response errors to {{ $labels.server }} server (> 5 req/s).
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyServerConnectionErrors HAProxy server connection errors (instance {{ $labels.instance }}) 5m critical sum by (server) (rate(haproxy_server_connection_errors_total[2m])) > 10 Too many connection errors to {{ $labels.server }} server (> 10 req/s). Request throughput may be to high.
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyPendingRequests HAProxy pending requests (instance {{ $labels.instance }}) 5m warning sum by (backend) (haproxy_backend_current_queue) > 0 Some HAProxy requests are pending on {{ $labels.fqdn }}/{{ $labels.backend }} backend
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyHttpSlowingDown HAProxy HTTP slowing down (instance {{ $labels.instance }}) 5m warning avg by (backend) (haproxy_backend_http_total_time_average_seconds) > 2 Average request time is increasing
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyRetryHigh HAProxy retry high (instance {{ $labels.instance }}) 5m warning sum by (backend) (rate(haproxy_backend_retry_warnings_total[5m])) > 10 High rate of retry on {{ $labels.fqdn }}/{{ $labels.backend }} backend
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyBackendDown HAProxy backend down (instance {{ $labels.instance }}) 5m critical haproxy_backend_up == 0 HAProxy backend is down
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyServerDown HAProxy server down (instance {{ $labels.instance }}) 5m critical haproxy_server_up == 0 HAProxy server is down
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyFrontendSecurityBlockedRequests HAProxy frontend security blocked requests (instance {{ $labels.instance }}) 5m warning sum by (frontend) (rate(haproxy_frontend_requests_denied_total[5m])) > 10 HAProxy is blocking requests for security reason
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
HaproxyServerHealthcheckFailure HAProxy server healthcheck failure (instance {{ $labels.instance }}) 5m warning increase(haproxy_server_check_failures_total[5m]) > 0 Some server healthcheck are failing on {{ $labels.server }}
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

Etcd

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
EtcdInsufficientMembers Etcd insufficient Members (instance {{ $labels.instance }}) 5m critical count(etcd_server_id{job="etcd"}) % 2 == 0 Etcd cluster should have an odd number of members
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdNoLeader Etcd no Leader (instance {{ $labels.instance }}) 5m critical etcd_server_has_leader == 0 Etcd cluster have no leader
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdHighNumberOfLeaderChanges Etcd high number of leader changes (instance {{ $labels.instance }}) 5m warning increase(etcd_server_leader_changes_seen_total[1h]) > 3 Etcd leader changed more than 3 times during last hour
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdHighNumberOfFailedGrpcRequests Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) 5m warning sum(rate(grpc_server_handled_total{job="etcd",grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total{job="etcd"}[5m])) BY (grpc_service, grpc_method) > 0.01 More than 1% GRPC request failure detected in Etcd for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdHighNumberOfFailedGrpcRequests Etcd high number of failed GRPC requests (instance {{ $labels.instance }}) 5m critical sum(rate(grpc_server_handled_total{job="etcd",grpc_code!="OK"}[5m])) BY (grpc_service, grpc_method) / sum(rate(grpc_server_handled_total{job="etcd"}[5m])) BY (grpc_service, grpc_method) > 0.05 More than 5% GRPC request failure detected in Etcd for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdGrpcRequestsSlow Etcd GRPC requests slow (instance {{ $labels.instance }}) 5m warning histogram_quantile(0.99, sum(rate(grpc_server_handling_seconds_bucket{job="etcd",grpc_type="unary"}[5m])) by (grpc_service, grpc_method, le)) > 0.15 GRPC requests slowing down, 99th percentil is over 0.15s for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdMemberCommunicationSlow Etcd member communication slow (instance {{ $labels.instance }}) 5m warning histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket{job="etcd"}[5m])) > 0.15 Etcd member communication slowing down, 99th percentil is over 0.15s for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdHighNumberOfFailedProposals Etcd high number of failed proposals (instance {{ $labels.instance }}) 5m warning increase(etcd_server_proposals_failed_total[1h]) > 5 Etcd server got more than 5 failed proposals past hour
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdHighFsyncDurations Etcd high fsync durations (instance {{ $labels.instance }}) 5m warning histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[5m])) > 0.5 Etcd WAL fsync duration increasing, 99th percentil is over 0.5s for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}
EtcdHighCommitDurations Etcd high commit durations (instance {{ $labels.instance }}) 5m warning histogram_quantile(0.99, rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])) > 0.25 Etcd commit duration increasing, 99th percentil is over 0.25s for 5 minutes
VALUE = {{ $value }}
LABELS: {{ $labels }}
{} {}

NginxIngressAlerts

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
NginxHighHttp4xxErrorRate Nginx high HTTP 4xx error rate (node: {{ $labels.node }}, namespace: {{ $labels.exported_namespace }}, ingress: {{ $labels.ingress }}) 1m high sum by (ingress, exported_namespace, node) (rate(nginx_ingress_controller_requests{status=~"^4.."}[2m])) / sum by (ingress, exported_namespace, node)(rate(nginx_ingress_controller_requests[2m])) * 100 > 5 Too many HTTP requests with status 4xx (> 5%)
VALUE = {{ $value }}
LABELS = {{ $labels }}
{} {}
NginxHighHttp5xxErrorRate Nginx high HTTP 5xx error rate (node: {{ $labels.node }}, namespace: {{ $labels.exported_namespace }}, ingress: {{ $labels.ingress }}) 1m high sum by (ingress, exported_namespace, node) (rate(nginx_ingress_controller_requests{status=~"^5.."}[2m])) / sum by (ingress, exported_namespace, node) (rate(nginx_ingress_controller_requests[2m])) * 100 > 5 Too many HTTP requests with status 5xx (> 5%)
VALUE = {{ $value }}
LABELS = {{ $labels }}
{} {}
NginxLatencyHigh Nginx latency high (node: {{ $labels.node }}, host: {{ $labels.host }}) 2m warning histogram_quantile(0.99, sum(rate(nginx_ingress_controller_request_duration_seconds_bucket[2m])) by (host, node, le)) > 3 Nginx p99 latency is higher than 3 seconds
VALUE = {{ $value }}
LABELS = {{ $labels }}
{} {}

CoreDnsAlerts

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
CorednsPanicCount CoreDNS Panic Count (instance {{ $labels.instance }}) 0m critical increase(coredns_panics_total[1m]) > 0 Number of CoreDNS panics encountered
VALUE = {{ $value }}
LABELS = {{ $labels }}
{} {}
CoreDNSLatencyHigh CoreDNS have High Latency 5m critical histogram_quantile(0.99, sum(rate(coredns_dns_request_duration_seconds_bucket[2m])) by(server, zone, le)) > 3 CoreDNS has 99th percentile latency of {{ $value }} seconds for server {{ $labels.server }} zone {{ $labels.zone }} {} {}
CoreDNSForwardHealthcheckFailureCount CoreDNS health checks have failed to upstream server 5m warning sum(rate(coredns_forward_healthcheck_broken_total[2m])) > 0 CoreDNS health checks have failed to upstream server {{ $labels.to }} {} {}
CoreDNSForwardHealthcheckBrokenCount CoreDNS health checks have failed for all upstream servers 5m warning sum(rate(coredns_forward_healthcheck_broken_total[2m])) > 0 CoreDNS health checks failed for all upstream servers LABELS = {{ $labels }} {} {}
CoreDNSErrorsHigh CoreDNS is returning SERVFAIL 5m critical sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_dns_responses_total[2m])) > 0.03 CoreDNS is returning SERVFAIL for {{ $value humanizePercentage }} of requests {}
CoreDNSErrorsHigh CoreDNS is returning SERVFAIL 5m warning sum(rate(coredns_dns_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_dns_responses_total[2m])) > 0.01 CoreDNS is returning SERVFAIL for {{ $value humanizePercentage }} of requests {}
CoreDNSForwardLatencyHigh CoreDNS has 99th percentile latency for forwarding requests 5m critical histogram_quantile(0.99, sum(rate(coredns_forward_request_duration_seconds_bucket[2m])) by(to, le)) > 3 CoreDNS has 99th percentile latency of {{ $value }} seconds forwarding requests to {{ $labels.to }} {} {}
CoreDNSForwardErrorsHigh CoreDNS is returning SERVFAIL for forward requests 5m critical sum(rate(coredns_forward_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_forward_responses_total[2m])) > 0.03 CoreDNS is returning SERVFAIL for {{ $value humanizePercentage }} of forward requests to {{ $labels.to }} {}
CoreDNSForwardErrorsHigh CoreDNS is returning SERVFAIL for forward requests 5m warning sum(rate(coredns_forward_responses_total{rcode="SERVFAIL"}[2m])) / sum(rate(coredns_forward_responses_total[2m])) > 0.01 CoreDNS is returning SERVFAIL for {{ $value humanizePercentage }} of forward requests to {{ $labels.to }} {}

DRAlerts

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
ProbeFailed Probe failed (instance: {{ $labels.instance }}) 5m critical probe_success == 0 Probe failed\n VALUE = {{ $value }}\n LABELS: {{ $labels }} {} {}
SlowProbe Slow probe (instance: {{ $labels.instance }}) 5m warning avg_over_time(probe_duration_seconds[1m]) > 1 Blackbox probe took more than 1s to complete\n VALUE = {{ $value }}\n LABELS: {{ $labels }} {} {}
HttpStatusCode HTTP Status Code (instance: {{ $labels.instance }}) 5m high probe_http_status_code <= 199 OR probe_http_status_code >= 400 HTTP status code is not 200-399\n VALUE = {{ $value }}\n LABELS: {{ $labels }} {} {}
HttpSlowRequests HTTP slow requests (instance: {{ $labels.instance }}) 5m warning avg_over_time(probe_http_duration_seconds[1m]) > 1 HTTP request took more than 1s\n VALUE = {{ $value }}\n LABELS: {{ $labels }} {} {}

BackupAlerts

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
Last Backup Failed Last backup made by pod {{ $labels.pod }} in namespace {{ $labels.namespace }} failed. 1m warning backup_storage_last_failed != 0 Last backup made by pod {{ $labels.pod }} in namespace {{ $labels.namespace }} failed.\n VALUE = {{ $value }}\n LABELS: {{ $labels }} {} {}

Cert-exporter

Cert-exporter rules

Alerting rules

Name Summary For Severity Expression Description Other labels Other annotations
FileCerts30DaysRemaining Certificates from files expire within 30 days 10m warning count(86400 * 7 < cert_exporter_cert_expires_in_seconds <= 86400 * 30) > 0 Some certificates from files expire within 30 days. {} {}
FileCerts7DaysRemaining Certificates from files expire within 7 days 10m high count(0 < cert_exporter_cert_expires_in_seconds <= 86400 * 7) > 0 Some certificates from files expire within 7 days. {} {}
FileCertsExpired Certificates from files expired 10m critical count(cert_exporter_cert_expires_in_seconds <= 0) > 0 Some certificates from files already expired. {} {}
KubeconfigCerts30DaysRemaining Certificates from kubeconfig expire within 30 days 10m warning count(86400 * 7 < cert_exporter_kubeconfig_expires_in_seconds <= 86400 * 30) > 0 Some certificates from kubeconfig expire within 30 days. {} {}
KubeconfigCerts7DaysRemaining Certificates from kubeconfig expire within 7 days 10m high count(0 < cert_exporter_kubeconfig_expires_in_seconds <= 86400 * 7) > 0 Some certificates from kubeconfig expire within 7 days. {} {}
KubeconfigCertsExpired Certificates from kubeconfig expired 10m critical count(cert_exporter_kubeconfig_expires_in_seconds <= 0) > 0 Some certificates from kubeconfig already expired. {} {}
SecretCerts30DaysRemaining Certificates from secrets expire within 30 days 10m warning count(86400 * 7 < cert_exporter_secret_expires_in_seconds <= 86400 * 30) > 0 Some certificates from secrets expire within 30 days. {} {}
SecretCerts7DaysRemaining Certificates from secrets expire within 7 days 10m high count(0 < cert_exporter_secret_expires_in_seconds <= 86400 * 7) > 0 Some certificates from secrets expire within 7 days. {} {}
SecretCertsExpired Certificates from secrets expired 10m critical count(cert_exporter_secret_expires_in_seconds <= 0) > 0 Some certificates from secrets already expired. {} {}