Alerting¶
Kube-burner includes an alerting feature able to evaluate Prometheus expressions in order to fire and index alerts.
Configuration¶
Alerting is configured through a configuration file pointed by the flag --alert-profile
or -a
, which is a YAML formatted file with the following shape:
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01
description: 5 minutes avg. etcd fsync latency on {{$labels.pod}} higher than 10ms {{$value}}
severity: error
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))[5m:]) > 0.1
description: 5 minutes avg. etcd netowrk peer round trip on {{$labels.pod}} higher than 100ms {{$value}}
severity: error
- expr: increase(etcd_server_leader_changes_seen_total[2m]) > 0
description: etcd leader changes observed
severity: error
Where expr
holds the PromQL to evaluate and description
holds a description of the alert, that will be printed/indexed when the alert fires. In the description
field, you can use Prometheus labels to increase alert readability by using the syntax {{$labels.<label_name>}}
and also print value of the value that fired the alarm using {{$value}}
.
You can configure alerts with a severity. Each severity level has different effects. These are:
info
: Prints an info message with the alarm description to stdout. By default all expressions have this severity.warning
: Prints a warning message with the alarm description to stdout.error
: Prints an error message with the alarm description to stdout and makes kube-burner rc = 1critical
: Prints a fatal message with the alarm description to stdout and aborts execution immediately with rc =1 0
Using the elapsed variable¶
There is a special go-template variable that can be used within the Prometheus expression, the variable elapsed is set to the value of the job duration (or the range given to check-alerts). This variable is especially useful in expressions using aggregations over time functions. i.e:
- expr: avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[{{ .elapsed }}:]) > 0.01
description: avg. etcd fsync latency on {{$labels.pod}} higher than 10ms {{$value}}
severity: error
Checking alerts¶
It is possible to look for alerts without triggering a kube-burner workload by using the check-alerts
subcommand. Similar to the index
CLI option, this option accepts the flags --start
and --end
to evaluate the alerts at a given time range.
$ kube-burner check-alerts -u https://prometheus.url.com -t ${token} -a alert-profile.yml
INFO[2020-12-10 11:47:23] 👽 Initializing prometheus client
INFO[2020-12-10 11:47:24] 🔔 Initializing alert manager
INFO[2020-12-10 11:47:24] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, rate(etcd_disk_wal_fsync_duration_seconds_bucket[2m]))[5m:]) > 0.01'
ERRO[2020-12-10 11:47:24] Alert triggered at 2020-12-10 11:01:53 +0100 CET: '5 minutes avg. etcd fsync latency on etcd-ip-10-0-213-209.us-west-2.compute.internal higher than 10ms 0.010281314285714311'
INFO[2020-12-10 11:47:24] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[5m]))[5m:]) > 0.1'
INFO[2020-12-10 11:47:24] Evaluating expression: 'increase(etcd_server_leader_changes_seen_total[2m]) > 0'
INFO[2020-12-10 11:47:24] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb=~"POST|PUT|DELETE|PATCH|CREATE"}) by (verb,resource,subresource,le))[5m
:]) > 1'
INFO[2020-12-10 11:47:25] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb="GET",scope="resource"}[2m])) by (verb,resource,subresource,le))[5
m:]) > 1'
INFO[2020-12-10 11:47:25] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb="LIST",scope="namespace"}[2m])) by (verb,resource,subresource,le))
[5m:]) > 5'
INFO[2020-12-10 11:47:26] Evaluating expression: 'avg_over_time(histogram_quantile(0.99, sum(rate(apiserver_request_duration_seconds_bucket{apiserver="kube-apiserver",verb="LIST",scope="cluster"}[2m])) by (verb,resource,subresource,le))[5
m:]) > 30'
Indexing alerts¶
When indexing is enabled, the alerts sent by kube-burner are automatically indexed by the provided indexer
. The documents generated by these alerts have the following structure:
{
"timestamp": "2023-01-19T22:20:10+01:00",
"uuid": "c0dd0d60-ddf5-488e-bf2f-b8960fc2b5ab",
"severity": "warning",
"description": "5 minutes avg. 99th etcd fsync latency on etcd-ip-10-0-133-30.us-west-2.compute.internal higher than 10ms. 0.004s",
"metricName": "alert"
}