08 - 告警规则编写
08 - 告警规则编写
8.1 告警规则语法
规则文件结构
# /etc/prometheus/rules/alerts.yml
groups:
- name: <group_name> # 规则组名称
interval: <duration> # 评估间隔(可选,覆盖全局)
rules:
- alert: <alert_name> # 告警名称
expr: <promql_expr> # PromQL 表达式
for: <duration> # 持续时间
labels: # 标签
severity: warning
annotations: # 注解(用于通知模板)
summary: "..."
description: "..."
字段说明
| 字段 | 必填 | 说明 |
|---|
alert | ✅ | 告警名称,大驼峰命名 |
expr | ✅ | PromQL 表达式,结果非空即触发 |
for | ❌ | 持续多久后触发(默认立即) |
labels | ❌ | 附加标签(如 severity) |
annotations | ❌ | 描述信息(用于通知模板) |
基本示例
groups:
- name: instance
rules:
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 宕机"
description: "{{ $labels.job }} 的实例 {{ $labels.instance }} 已不可达超过 1 分钟"
8.2 模板语法
变量引用
| 变量 | 说明 |
|---|
$labels | 告警的标签键值对 |
$value | 触发时的查询值 |
$externalLabels | 全局 external_labels |
annotations:
summary: "CPU 使用率过高: {{ $labels.instance }}"
description: "实例 {{ $labels.instance }} CPU 使用率为 {{ $value | printf \"%.1f\" }}%"
模板函数
# 标签引用
{{ $labels.instance }}
{{ $labels.job }}
# 值格式化
{{ $value }}
{{ $value | printf "%.2f" }}
# 条件判断
{{ if eq $labels.severity "critical" }}🔴{{ else }}🟡{{ end }}
# 循环
{{ range $label, $value := $labels }}
{{ $label }}={{ $value }}
{{ end }}
# 内置函数
{{ humanize $value }} # 1.23K, 4.56M
{{ humanize1024 $value }} # 1.23Ki, 4.56Mi
{{ humanizeDuration $value }} # 2h 30m
{{ toUpper $labels.alertname }}
{{ toLower $labels.alertname }}
{{ title $labels.alertname }}
{{ match ".*error.*" $labels.alertname }}
完整模板示例
groups:
- name: cpu
rules:
- alert: HighCPU
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: |
实例 {{ $labels.instance }} CPU 使用率已达到 {{ $value | printf "%.1f" }}%,
持续时间超过 5 分钟。
当前主机: {{ $labels.instance }}
建议操作: 检查进程列表,排查 CPU 占用最高的进程。
8.3 for 持续时间
for 字段定义告警必须持续满足条件多长时间才会触发。
时间轴 ──────────────────────────────────────────────►
expr: up == 0 满足:
│ T=0 满足 → pending (不发送)
│ T=30s 满足 → pending
│ T=1m 满足 → firing! (发送告警)
│ T=2m 满足 → 继续 firing
│ T=3m 不满足 → resolved (发送恢复通知)
expr: up == 0 满足但很快恢复:
│ T=0 满足 → pending
│ T=20s 不满足 → 取消 pending (不发送)
for 设置建议
| 告警级别 | 建议 for 时间 | 说明 |
|---|
| critical | 1-5 分钟 | 快速响应 |
| warning | 5-15 分钟 | 避免抖动 |
| info | 15-60 分钟 | 仅供参考 |
注意:for 设为 0s 或省略会导致立即触发,可能产生大量瞬态告警。建议至少设置 1 分钟。
8.4 告警等级设计
推荐等级划分
| 等级 | 英文 | 含义 | 响应要求 |
|---|
| P0 | critical | 服务不可用/数据丢失 | 立即响应(电话/短信) |
| P1 | warning | 服务降级/接近阈值 | 30分钟内响应 |
| P2 | info | 需关注但不紧急 | 下一工作日处理 |
标签规范
labels:
severity: critical # critical / warning / info
priority: P0 # P0 / P1 / P2
team: backend # 负责团队
service: api # 所属服务
environment: prod # 环境
8.5 常见告警规则
基础设施告警
groups:
- name: infrastructure
rules:
# 实例宕机
- alert: InstanceDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "实例 {{ $labels.instance }} 宕机"
description: "{{ $labels.job }} 的 {{ $labels.instance }} 已不可达超过 1 分钟"
# CPU 使用率过高
- alert: HighCPUUsage
expr: |
100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "CPU 使用率过高"
description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.1f\" }}%"
# 内存使用率过高
- alert: HighMemoryUsage
expr: |
(1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "内存使用率过高"
description: "{{ $labels.instance }} 内存使用率 {{ $value | printf \"%.1f\" }}%"
# 磁盘空间不足
- alert: DiskSpaceLow
expr: |
(node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "磁盘空间不足"
description: "{{ $labels.instance }} 根分区剩余空间 {{ $value | humanizePercentage }}"
# 磁盘即将耗尽(预测 4 小时内)
- alert: DiskWillFillIn4Hours
expr: |
predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0
for: 30m
labels:
severity: critical
annotations:
summary: "磁盘即将耗尽"
description: "{{ $labels.instance }} 预计 4 小时内磁盘空间将耗尽"
# 系统负载过高
- alert: HighSystemLoad
expr: node_load15 / count without(cpu, mode) (node_cpu_seconds_total{mode="idle"}) > 2
for: 15m
labels:
severity: warning
annotations:
summary: "系统负载过高"
description: "{{ $labels.instance }} 15分钟负载 {{ $value | printf \"%.2f\" }}"
网络告警
groups:
- name: network
rules:
- alert: HighNetworkTraffic
expr: |
rate(node_network_receive_bytes_total{device!="lo"}[5m]) > 100 * 1024 * 1024
for: 5m
labels:
severity: warning
annotations:
summary: "网络入流量异常"
description: "{{ $labels.instance }} {{ $labels.device }} 入流量 {{ $value | humanize }}B/s"
- alert: NetworkInterfaceDown
expr: node_network_up{device!="lo"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "网络接口断开"
description: "{{ $labels.instance }} 的 {{ $labels.device }} 接口已断开"
应用告警
groups:
- name: application
rules:
# 请求错误率
- alert: HighErrorRate
expr: |
sum by(job) (rate(http_requests_total{status=~"5.."}[5m]))
/ sum by(job) (rate(http_requests_total[5m]))
> 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "请求错误率过高"
description: "{{ $labels.job }} 5xx 错误率 {{ $value | humanizePercentage }}"
# P99 延迟过高
- alert: HighLatency
expr: |
histogram_quantile(0.99,
sum by(job, le) (rate(http_request_duration_seconds_bucket[5m]))
) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P99 延迟过高"
description: "{{ $labels.job }} P99 延迟 {{ $value | printf \"%.2f\" }}s"
# 请求速率突降
- alert: LowRequestRate
expr: |
sum by(job) (rate(http_requests_total[5m]))
< 0.5 * sum by(job) (rate(http_requests_total[5m] offset 1h))
for: 15m
labels:
severity: warning
annotations:
summary: "请求速率异常下降"
description: "{{ $labels.job }} 当前 QPS 比 1 小时前下降超过 50%"
数据库告警
groups:
- name: database
rules:
- alert: MySQLDown
expr: mysql_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "MySQL 实例宕机"
- alert: MySQLHighConnections
expr: mysql_global_status_threads_connected > 200
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL 连接数过高"
description: "当前连接数 {{ $value }}"
- alert: MySQLSlowQueries
expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "MySQL 慢查询过多"
description: "慢查询速率 {{ $value }}/s"
- alert: RedisDown
expr: redis_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Redis 实例宕机"
- alert: RedisHighMemory
expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "Redis 内存使用率过高"
description: "使用率 {{ $value | humanizePercentage }}"
Prometheus 自身告警
groups:
- name: prometheus
rules:
- alert: PrometheusTargetDown
expr: up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "监控目标不可达"
- alert: PrometheusConfigReloadFailed
expr: prometheus_config_last_reload_successful == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Prometheus 配置重载失败"
- alert: PrometheusTSDBCompactionsFailing
expr: increase(prometheus_tsdb_compactions_failed_total[1h]) > 0
labels:
severity: warning
annotations:
summary: "TSDB 压缩失败"
- alert: PrometheusHighQueryLoad
expr: avg_over_time(prometheus_engine_query_duration_seconds_count[5m]) > 100
for: 15m
labels:
severity: warning
annotations:
summary: "Prometheus 查询负载过高"
- alert: PrometheusStorageNearFull
expr: |
prometheus_tsdb_storage_blocks_bytes
/ (1024 * 1024 * 1024 * 100) > 0.85
for: 15m
labels:
severity: warning
annotations:
summary: "Prometheus 存储空间接近上限"
8.6 告警规则模板
通知模板(Alertmanager Templates)
// /etc/alertmanager/templates/email.tmpl
{{ define "email.subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.SortedPairs.Values | join " " }}
{{ end }}
{{ define "email.html" }}
<html>
<body>
<h2>{{ if eq .Status "resolved" }}✅ 已恢复{{ else }}🔴 告警中{{ end }}</h2>
<table border="1" cellpadding="5">
<tr>
<th>告警名称</th>
<th>实例</th>
<th>级别</th>
<th>描述</th>
<th>触发时间</th>
</tr>
{{ range .Alerts }}
<tr>
<td>{{ .Labels.alertname }}</td>
<td>{{ .Labels.instance }}</td>
<td>{{ .Labels.severity }}</td>
<td>{{ .Annotations.description }}</td>
<td>{{ .StartsAt.Format "2006-01-02 15:04:05" }}</td>
</tr>
{{ end }}
</table>
<p>分组标签: {{ .GroupLabels.SortedPairs.Values | join ", " }}</p>
</body>
</html>
{{ end }}
钉钉通知模板
// /etc/alertmanager/templates/dingtalk.tmpl
{{ define "dingtalk.message" }}
{{ if eq .Status "firing" }}🔴 告警触发{{ else }}✅ 告警恢复{{ end }}
{{ range .Alerts }}
**告警名称**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**实例**: {{ .Labels.instance }}
**描述**: {{ .Annotations.description }}
**触发时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if eq .Status "resolved" }}
**恢复时间**: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
---
{{ end }}
{{ end }}
8.7 告警测试
# test_alerts.yml
rule_files:
- alerts.yml
evaluation_interval: 1m
tests:
- interval: 1m
input_series:
- series: 'up{job="api", instance="node1:9090"}'
values: '1 1 1 0 0 0 0'
alert_rule_test:
- eval_time: 2m
alertname: InstanceDown
exp_alerts: []
- eval_time: 5m
alertname: InstanceDown
exp_alerts:
- exp_labels:
job: api
instance: "node1:9090"
severity: critical
exp_annotations:
summary: "实例 node1:9090 宕机"
- interval: 1m
input_series:
- series: 'rate(http_requests_total{job="api", status="500"}[5m])'
values: '0+0.01x10'
- series: 'rate(http_requests_total{job="api"}[5m])'
values: '0+0.1x10'
alert_rule_test:
- eval_time: 10m
alertname: HighErrorRate
exp_alerts:
- exp_labels:
job: api
severity: critical
# 运行测试
promtool test rules test_alerts.yml
# 测试告警会匹配哪条路由
amtool config routes test \
alertname=InstanceDown \
severity=critical \
cluster=production
# 验证配置语法
amtool check-config /etc/alertmanager/alertmanager.yml
8.8 本章小结
| 要点 | 说明 |
|---|
| 规则语法 | groups → rules → alert/expr/for/labels/annotations |
| 模板变量 | $labels, $value |
| for 字段 | 过滤瞬态告警,建议 ≥ 1m |
| 告警等级 | critical(立即), warning(30min), info(次日) |
| 测试 | promtool test rules |
扩展阅读
上一章:07 - 告警管理
下一章:09 - 录制规则