Alerting¶

Kube-burner includes an alerting feature able to evaluate Prometheus expressions in order to fire and index alerts.

Configuration¶

Alerting is configured through a configuration file pointed by the flag --alert-profile or -a, which is a YAML formatted file with the following shape:

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
  description: 5 minutes avg. etcd fsync latency on {{$labels.pod}} higher than 10ms {{$value}}
  severity: error

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))[5m:]) > 0.1
  description: 5 minutes avg. etcd netowrk peer round trip on {{$labels.pod}} higher than 100ms {{$value}}
  severity: error

- expr: increase(etcd_server_leader_changes_seen_total[2m]) > 0
  description: etcd leader changes observed
  severity: error

Where expr holds the PromQL to evaluate and description holds a description of the alert, that will be printed/indexed when the alert fires. In the description field, you can use Prometheus labels to increase alert readability by using the syntax {{$labels.<label_name>}} and also print value of the value that fired the alarm using {{$value}}.

You can configure alerts with a severity. Each severity level has different effects. These are:

info: Prints an info message with the alarm description to stdout. By default all expressions have this severity.
warning: Prints a warning message with the alarm description to stdout.
error: Prints an error message with the alarm description to stdout and makes kube-burner rc = 1
critical: Prints a fatal message with the alarm description to stdout and aborts execution immediately with rc =1 0

Using the elapsed variable¶

There is a special go-template variable that can be used within the Prometheus expression, the variable elapsed is set to the value of the job duration (or the range given to check-alerts). This variable is especially useful in expressions using aggregations over time functions. i.e:

- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[{{ .elapsed }}:]) > 0.01
  description: avg. etcd fsync latency on {{$labels.pod}} higher than 10ms {{$value}}
  severity: error

Checking alerts¶

It is possible to look for alerts without triggering a kube-burner workload by using the check-alerts subcommand. Similar to the index CLI option, this option accepts the flags --start and --end to evaluate the alerts at a given time range.

$ kube-burner check-alerts -u https://prometheus.url.com -t ${token} -a alert-profile.yml
INFO[2020-12-10 11:47:23] 👽 Initializing prometheus client
INFO[2020-12-10 11:47:24] 🔔 Initializing alert manager
INFO[2020-12-10 11:47:24] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01'
ERRO[2020-12-10 11:47:24] Alert triggered at 2020-12-10 11:01:53 +0100 CET: '5 minutes avg. etcd fsync latency on etcd-ip-10-0-213-209.us-west-2.compute.internal higher than 10ms 0.010281314285714311'
INFO[2020-12-10 11:47:24] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))[5m:]) > 0.1'
INFO[2020-12-10 11:47:24] Evaluating expression: 'increase(etcd_server_leader_changes_seen_total[2m]) > 0'
INFO[2020-12-10 11:47:24] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb=~"POST|PUT|DELETE|PATCH|CREATE"}) by (verb,resource,subresource,le))[5m
:]) > 1'
INFO[2020-12-10 11:47:25] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb="GET",scope="resource"}[2m])) by (verb,resource,subresource,le))[5
m:]) > 1'
INFO[2020-12-10 11:47:25] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb="LIST",scope="namespace"}[2m])) by (verb,resource,subresource,le))
[5m:]) > 5'
INFO[2020-12-10 11:47:26] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb="LIST",scope="cluster"}[2m])) by (verb,resource,subresource,le))[5
m:]) > 30'

Indexing alerts¶

When indexing is enabled, the alerts sent by kube-burner are automatically indexed by the provided indexer. The documents generated by these alerts have the following structure:

{
  "timestamp": "2023-01-19T22:20:10+01:00",
  "uuid": "c0dd0d60-ddf5-488e-bf2f-b8960fc2b5ab",
  "severity": "warning",
  "description": "5 minutes avg. 99th etcd fsync latency on etcd-ip-10-0-133-30.us-west-2.compute.internal higher than 10ms. 0.004s",
  "metricName": "alert"
}