This document describes how to configure Prometheus alerts in Monitoring.

Platform Monitoring alerts¶

Out-of-box alerts delivered along with Monitoring.

Enabling and disabling alerts¶

OOB alerts installation can be customised by changing prometheusRules section.

You can enable or disable installation of all alerts by changing prometheusRules.install option. The default value is true which means that OOB alerts will be enabled in general by default.

Alerts divided into groups. Alerts are grouped by meaning for monitoring purposes: SelfMonitoring, NodeExporters, Etcd, etc. Each alert group can be enabled or disabled individually by changing prometheusRules.ruleGroups parameter. This parameter contains a list of alert groups that should be installed.

OOB alerts include the groups mentioned in the table below, but not all alert groups are enabled by default, therefore you should pay attention if you want to use alerts disabled by default (e.g. Heartbeat alert):

Alert group	Enabled by default
Heartbeat	✗ No
SelfMonitoring	✓ Yes
AlertManager	✓ Yes
KubebernetesAlerts	✓ Yes
NodeProcesses	✓ Yes
NodeExporters	✓ Yes
DockerContainers	✓ Yes
HAmode	✗ No
HAproxy	✗ No
Etcd	✓ Yes
NginxIngressAlerts	✓ Yes
CoreDnsAlerts	✓ Yes
DRAlerts	✓ Yes
BackupAlerts	✓ Yes

Full list of all alerts can be found in the alerts-oob document.

If you want to enable alerts for HAmode or HAproxy, you should add HAmode or HAproxy respectively to prometheusRules.ruleGroups parameter.

If you want to enable Dead Man's Switch (Heartbeat) alert, you should add Heartbeat to prometheusRules.ruleGroups parameter.

You can find examples of configuration in the appropriate section.

Dead Man's Switch alert¶

Dead Man's Switch alert is a special always-firing alert that meant to ensure that the entire alerting pipeline is functional. If this alert stops firing, it means that some of the critical monitoring and/or alerting components have failed.

Platform Monitoring's Dead Man's Switch alert is placed in the Heartbeat alert group and called DeadMansSwitch. It uses the simplest expression possible under the hood: vector(1), and the lowest severity: information. Even the simplest expressions are calculated on the monitoring back-end (Prometheus/VMSingle), so the alert checks that side of the monitoring. The alert has for: 0s, so it should start fire immediately since all base monitoring and alerting components are installed.

Fields for, expr and severity can be overridden as well as every other OOB alert.

This type of alert is disabled by default. You can enable the Dead Man's Switch alert by adding Heartbeat alert group to the prometheusRules.ruleGroups parameter. Example of configuration with enabled Dead Man's Switch can be found here.

Attention: If you want to enable Dead Man's Switch alert and have a connected notification system for alerts, proactively make sure that your system is ready for a constantly firing alert and will not encounter a flood of notifications because of this. This can be done, for example, by ignoring alerts with a severity below warning.

HAmode alerts¶

HAmode is an alert group includes alert rules that report that some Deployments and StatefulSets do not comply with working conditions in HA (High Availability) mode. Services in HA mode should have 2 or more available or at least desired replicas, and these replicas should be placed on different nodes.

HAmode will be triggered if:

Some Deployments or StatefulSets have less than 2 desired replicas
Some Deployments or StatefulSets have less than 2 available replicas
Some Deployments or StatefulSets have 2 or more replicas placed on the same node

This alert group is disabled by default in case if clusters have a lot of services that don't follow conditions for the HA mode.

Platform Monitoring has a dashboard called HA services for the same purposes as the HAmode alert group.

Alerts overriding¶

Monitoring has a mechanism that allows changing specific alert(-s) from the OOB set. If you want to override an alert, you can use prometheusRules.override parameter. This parameter includes list of objects with definitions for alerts that should be overridden.

You can override fields for, expr and severity. The prometheusRules.override parameter looks like this:

prometheusRules:
  override:
    - group: SelfMonitoring
      alert: PrometheusNotificationsBacklog
      for: 0s
      expr: "min_over_time(prometheus_notifications_queue_length[20m]) > 0"
      severity: high
    - ...

You can either override all 3 fields for the alert or only one of them. Every item in the prometheusRules.override parameter may have the following fields:

Parameter	Description	Required
group	Name of alert group where the overridden alert from.	true
alert	Name of the overridden alert.	true
for	Alerts are considered firing once they have been returned for this long. Alerts which have not yet fired for long enough are considered pending.	false
expr	How long an alert will continue firing after the condition that triggered it has cleared.	false
severity	Shows the level of importance for the alert. Recommended levels: critical, high, warning, information.	false

You can find more information about alert configuring process in the alert best practice document.

Examples¶

Default configuration example¶

prometheusRules:
  install: true
  ruleGroups:
    - SelfMonitoring
    - AlertManager
    - KubebernetesAlerts
    - NodeProcesses
    - NodeExporters
    - DockerContainers
    - Etcd
    - NginxIngressAlerts
    - CoreDnsAlerts
    - DRAlerts
    - BackupAlerts

Configuration with enabled additional groups¶

prometheusRules:
  install: true
  ruleGroups:
    - Heartbeat
    - HAmode
    - HAproxy
    - SelfMonitoring
    - AlertManager
    - KubebernetesAlerts
    - NodeProcesses
    - NodeExporters
    - DockerContainers
    - Etcd
    - NginxIngressAlerts
    - CoreDnsAlerts
    - DRAlerts
    - BackupAlerts

Configuration with overridden alert¶

The following configuration changes for to 0s, expr to min_over_time(prometheus_notifications_queue_length[20m]) > 0 and severity to high for alert PrometheusNotificationsBacklog name from SelfMonitoring group.

prometheusRules:
  install: true
  ruleGroups:
    - SelfMonitoring
    - AlertManager
    - KubebernetesAlerts
    - NodeProcesses
    - NodeExporters
    - DockerContainers
    - Etcd
    - NginxIngressAlerts
    - CoreDnsAlerts
    - DRAlerts
    - BackupAlerts
  override:
    - group: SelfMonitoring
      alert: PrometheusNotificationsBacklog
      for: 0s
      expr: "min_over_time(prometheus_notifications_queue_length[20m]) > 0"
      severity: high