强曰为道

与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

08 - 告警规则编写

08 - 告警规则编写

8.1 告警规则语法

规则文件结构

# /etc/prometheus/rules/alerts.yml
groups:
  - name: <group_name>        # 规则组名称
    interval: <duration>       # 评估间隔(可选,覆盖全局)
    rules:
      - alert: <alert_name>    # 告警名称
        expr: <promql_expr>    # PromQL 表达式
        for: <duration>        # 持续时间
        labels:                # 标签
          severity: warning
        annotations:           # 注解(用于通知模板)
          summary: "..."
          description: "..."

字段说明

字段必填说明
alert告警名称,大驼峰命名
exprPromQL 表达式,结果非空即触发
for持续多久后触发(默认立即)
labels附加标签(如 severity)
annotations描述信息(用于通知模板)

基本示例

groups:
  - name: instance
    rules:
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 宕机"
          description: "{{ $labels.job }} 的实例 {{ $labels.instance }} 已不可达超过 1 分钟"

8.2 模板语法

变量引用

变量说明
$labels告警的标签键值对
$value触发时的查询值
$externalLabels全局 external_labels
annotations:
  summary: "CPU 使用率过高: {{ $labels.instance }}"
  description: "实例 {{ $labels.instance }} CPU 使用率为 {{ $value | printf \"%.1f\" }}%"

模板函数

# 标签引用
{{ $labels.instance }}
{{ $labels.job }}

# 值格式化
{{ $value }}
{{ $value | printf "%.2f" }}

# 条件判断
{{ if eq $labels.severity "critical" }}🔴{{ else }}🟡{{ end }}

# 循环
{{ range $label, $value := $labels }}
  {{ $label }}={{ $value }}
{{ end }}

# 内置函数
{{ humanize $value }}          # 1.23K, 4.56M
{{ humanize1024 $value }}      # 1.23Ki, 4.56Mi
{{ humanizeDuration $value }}  # 2h 30m
{{ toUpper $labels.alertname }}
{{ toLower $labels.alertname }}
{{ title $labels.alertname }}
{{ match ".*error.*" $labels.alertname }}

完整模板示例

groups:
  - name: cpu
    rules:
      - alert: HighCPU
        expr: |
          100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: |
            实例 {{ $labels.instance }} CPU 使用率已达到 {{ $value | printf "%.1f" }}%,
            持续时间超过 5 分钟。
            
            当前主机: {{ $labels.instance }}
            建议操作: 检查进程列表,排查 CPU 占用最高的进程。

8.3 for 持续时间

for 字段定义告警必须持续满足条件多长时间才会触发。

时间轴 ──────────────────────────────────────────────►

expr: up == 0 满足:
  │ T=0    满足 → pending (不发送)
  │ T=30s  满足 → pending
  │ T=1m   满足 → firing! (发送告警)
  │ T=2m   满足 → 继续 firing
  │ T=3m   不满足 → resolved (发送恢复通知)

expr: up == 0 满足但很快恢复:
  │ T=0    满足 → pending
  │ T=20s  不满足 → 取消 pending (不发送)

for 设置建议

告警级别建议 for 时间说明
critical1-5 分钟快速响应
warning5-15 分钟避免抖动
info15-60 分钟仅供参考

注意for 设为 0s 或省略会导致立即触发,可能产生大量瞬态告警。建议至少设置 1 分钟。


8.4 告警等级设计

推荐等级划分

等级英文含义响应要求
P0critical服务不可用/数据丢失立即响应(电话/短信)
P1warning服务降级/接近阈值30分钟内响应
P2info需关注但不紧急下一工作日处理

标签规范

labels:
  severity: critical    # critical / warning / info
  priority: P0          # P0 / P1 / P2
  team: backend         # 负责团队
  service: api          # 所属服务
  environment: prod     # 环境

8.5 常见告警规则

基础设施告警

groups:
  - name: infrastructure
    rules:
      # 实例宕机
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 宕机"
          description: "{{ $labels.job }} 的 {{ $labels.instance }} 已不可达超过 1 分钟"

      # CPU 使用率过高
      - alert: HighCPUUsage
        expr: |
          100 - (avg by(instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "CPU 使用率过高"
          description: "{{ $labels.instance }} CPU 使用率 {{ $value | printf \"%.1f\" }}%"

      # 内存使用率过高
      - alert: HighMemoryUsage
        expr: |
          (1 - node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100 > 90
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "内存使用率过高"
          description: "{{ $labels.instance }} 内存使用率 {{ $value | printf \"%.1f\" }}%"

      # 磁盘空间不足
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_avail_bytes{mountpoint="/"} / node_filesystem_size_bytes{mountpoint="/"}) < 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "磁盘空间不足"
          description: "{{ $labels.instance }} 根分区剩余空间 {{ $value | humanizePercentage }}"

      # 磁盘即将耗尽(预测 4 小时内)
      - alert: DiskWillFillIn4Hours
        expr: |
          predict_linear(node_filesystem_avail_bytes{mountpoint="/"}[1h], 4*3600) < 0
        for: 30m
        labels:
          severity: critical
        annotations:
          summary: "磁盘即将耗尽"
          description: "{{ $labels.instance }} 预计 4 小时内磁盘空间将耗尽"

      # 系统负载过高
      - alert: HighSystemLoad
        expr: node_load15 / count without(cpu, mode) (node_cpu_seconds_total{mode="idle"}) > 2
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "系统负载过高"
          description: "{{ $labels.instance }} 15分钟负载 {{ $value | printf \"%.2f\" }}"

网络告警

groups:
  - name: network
    rules:
      - alert: HighNetworkTraffic
        expr: |
          rate(node_network_receive_bytes_total{device!="lo"}[5m]) > 100 * 1024 * 1024
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "网络入流量异常"
          description: "{{ $labels.instance }} {{ $labels.device }} 入流量 {{ $value | humanize }}B/s"

      - alert: NetworkInterfaceDown
        expr: node_network_up{device!="lo"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "网络接口断开"
          description: "{{ $labels.instance }} 的 {{ $labels.device }} 接口已断开"

应用告警

groups:
  - name: application
    rules:
      # 请求错误率
      - alert: HighErrorRate
        expr: |
          sum by(job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by(job) (rate(http_requests_total[5m]))
          > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "请求错误率过高"
          description: "{{ $labels.job }} 5xx 错误率 {{ $value | humanizePercentage }}"

      # P99 延迟过高
      - alert: HighLatency
        expr: |
          histogram_quantile(0.99,
            sum by(job, le) (rate(http_request_duration_seconds_bucket[5m]))
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "P99 延迟过高"
          description: "{{ $labels.job }} P99 延迟 {{ $value | printf \"%.2f\" }}s"

      # 请求速率突降
      - alert: LowRequestRate
        expr: |
          sum by(job) (rate(http_requests_total[5m]))
          < 0.5 * sum by(job) (rate(http_requests_total[5m] offset 1h))
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "请求速率异常下降"
          description: "{{ $labels.job }} 当前 QPS 比 1 小时前下降超过 50%"

数据库告警

groups:
  - name: database
    rules:
      - alert: MySQLDown
        expr: mysql_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "MySQL 实例宕机"

      - alert: MySQLHighConnections
        expr: mysql_global_status_threads_connected > 200
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 连接数过高"
          description: "当前连接数 {{ $value }}"

      - alert: MySQLSlowQueries
        expr: rate(mysql_global_status_slow_queries[5m]) > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "MySQL 慢查询过多"
          description: "慢查询速率 {{ $value }}/s"

      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Redis 实例宕机"

      - alert: RedisHighMemory
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.9
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis 内存使用率过高"
          description: "使用率 {{ $value | humanizePercentage }}"

Prometheus 自身告警

groups:
  - name: prometheus
    rules:
      - alert: PrometheusTargetDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "监控目标不可达"

      - alert: PrometheusConfigReloadFailed
        expr: prometheus_config_last_reload_successful == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Prometheus 配置重载失败"

      - alert: PrometheusTSDBCompactionsFailing
        expr: increase(prometheus_tsdb_compactions_failed_total[1h]) > 0
        labels:
          severity: warning
        annotations:
          summary: "TSDB 压缩失败"

      - alert: PrometheusHighQueryLoad
        expr: avg_over_time(prometheus_engine_query_duration_seconds_count[5m]) > 100
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus 查询负载过高"

      - alert: PrometheusStorageNearFull
        expr: |
          prometheus_tsdb_storage_blocks_bytes
          / (1024 * 1024 * 1024 * 100) > 0.85
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Prometheus 存储空间接近上限"

8.6 告警规则模板

通知模板(Alertmanager Templates)

// /etc/alertmanager/templates/email.tmpl
{{ define "email.subject" }}
[{{ .Status | toUpper }}{{ if eq .Status "firing" }}:{{ .Alerts.Firing | len }}{{ end }}]
{{ .GroupLabels.SortedPairs.Values | join " " }}
{{ end }}

{{ define "email.html" }}
<html>
<body>
<h2>{{ if eq .Status "resolved" }} 已恢复{{ else }}🔴 告警中{{ end }}</h2>

<table border="1" cellpadding="5">
<tr>
  <th>告警名称</th>
  <th>实例</th>
  <th>级别</th>
  <th>描述</th>
  <th>触发时间</th>
</tr>
{{ range .Alerts }}
<tr>
  <td>{{ .Labels.alertname }}</td>
  <td>{{ .Labels.instance }}</td>
  <td>{{ .Labels.severity }}</td>
  <td>{{ .Annotations.description }}</td>
  <td>{{ .StartsAt.Format "2006-01-02 15:04:05" }}</td>
</tr>
{{ end }}
</table>

<p>分组标签: {{ .GroupLabels.SortedPairs.Values | join ", " }}</p>
</body>
</html>
{{ end }}

钉钉通知模板

// /etc/alertmanager/templates/dingtalk.tmpl
{{ define "dingtalk.message" }}
{{ if eq .Status "firing" }}🔴 告警触发{{ else }} 告警恢复{{ end }}

{{ range .Alerts }}
**告警名称**: {{ .Labels.alertname }}
**告警级别**: {{ .Labels.severity }}
**实例**: {{ .Labels.instance }}
**描述**: {{ .Annotations.description }}
**触发时间**: {{ .StartsAt.Format "2006-01-02 15:04:05" }}
{{ if eq .Status "resolved" }}
**恢复时间**: {{ .EndsAt.Format "2006-01-02 15:04:05" }}
{{ end }}
---
{{ end }}
{{ end }}

8.7 告警测试

使用 promtool 测试

# test_alerts.yml
rule_files:
  - alerts.yml

evaluation_interval: 1m

tests:
  - interval: 1m
    input_series:
      - series: 'up{job="api", instance="node1:9090"}'
        values: '1 1 1 0 0 0 0'
    alert_rule_test:
      - eval_time: 2m
        alertname: InstanceDown
        exp_alerts: []
      - eval_time: 5m
        alertname: InstanceDown
        exp_alerts:
          - exp_labels:
              job: api
              instance: "node1:9090"
              severity: critical
            exp_annotations:
              summary: "实例 node1:9090 宕机"

  - interval: 1m
    input_series:
      - series: 'rate(http_requests_total{job="api", status="500"}[5m])'
        values: '0+0.01x10'
      - series: 'rate(http_requests_total{job="api"}[5m])'
        values: '0+0.1x10'
    alert_rule_test:
      - eval_time: 10m
        alertname: HighErrorRate
        exp_alerts:
          - exp_labels:
              job: api
              severity: critical
# 运行测试
promtool test rules test_alerts.yml

使用 amtool 验证路由

# 测试告警会匹配哪条路由
amtool config routes test \
  alertname=InstanceDown \
  severity=critical \
  cluster=production

# 验证配置语法
amtool check-config /etc/alertmanager/alertmanager.yml

8.8 本章小结

要点说明
规则语法groups → rules → alert/expr/for/labels/annotations
模板变量$labels, $value
for 字段过滤瞬态告警,建议 ≥ 1m
告警等级critical(立即), warning(30min), info(次日)
测试promtool test rules

扩展阅读


上一章07 - 告警管理 下一章09 - 录制规则