强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

VictoriaMetrics 完全指南 / 08 - 告警配置

08 · 告警配置

本章目标

  • 了解 vmalert 的架构与工作原理
  • 掌握告警规则(Alerting Rules)的编写
  • 配置 vmalert 与 Alertmanager 集成
  • 学会使用 Recording Rules 优化查询性能
  • 掌握告警测试与调试技巧

8.1 vmalert 简介

vmalert 是 VictoriaMetrics 提供的告警引擎,功能类似于 Prometheus 的 alerting/recording rules 评估器。

┌──────────────────────────────────────────────┐
│                 vmalert                       │
│                                              │
│  ┌──────────────────┐  ┌──────────────────┐  │
│  │  Alerting Rules  │  │ Recording Rules  │  │
│  │  (告警规则)       │  │ (记录规则)        │  │
│  └────────┬─────────┘  └────────┬─────────┘  │
│           │                      │            │
│  定期评估 查询 MetricsQL  写入结果到 VM        │
└───────────┼──────────────────────┼────────────┘
            │                      │
            ▼                      ▼
    ┌──────────────┐       ┌──────────────┐
    │ Alertmanager │       │VictoriaMetrics│
    │  (告警路由)   │       │  (存储结果)   │
    └──────┬───────┘       └──────────────┘
           │
     ┌─────┼─────┐
     ▼     ▼     ▼
   Email  Slack  钉钉

vmalert vs Prometheus Alertmanager

特性Prometheus 内置vmalert
集成方式内嵌于 Prometheus独立进程
查询引擎PrometheusVictoriaMetrics
租户支持✅ (集群版)
外部标签有限完整支持
Recording Rules
回填支持需 promtool原生支持

8.2 安装与启动

8.2.1 下载安装

VM_VERSION="v1.106.0"
curl -LO "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/vmalert-linux-amd64-${VM_VERSION}.tar.gz"
tar xzf "vmalert-linux-amd64-${VM_VERSION}.tar.gz"
sudo mv vmalert-prod /usr/local/bin/vmalert
chmod +x /usr/local/bin/vmalert

8.2.2 基础启动

vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://alertmanager:9093 \
    -external.label=env=prod \
    -external.label=region=cn-north \
    -evaluationInterval=15s \
    -httpListenAddr=:8880

8.2.3 systemd 服务

# /etc/systemd/system/vmalert.service
[Unit]
Description=VictoriaMetrics Alert Engine
After=network.target victoria-metrics.service

[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://localhost:9093 \
    -external.label=env=prod \
    -evaluationInterval=15s \
    -httpListenAddr=:8880
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

8.3 告警规则编写

8.3.1 规则文件格式

# /etc/vmalert/rules/infra.yml
groups:
  - name: infrastructure
    interval: 30s          # 评估间隔(可选,覆盖全局)
    concurrency: 2         # 并发评估数(可选)
    rules:
      - alert: HighCPU
        expr: avg by (host) (cpu_usage) > 80
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "主机 {{ $labels.host }} CPU 使用率过高"
          description: "当前 CPU 使用率: {{ $value | printf \"%.1f\" }}%"

8.3.2 规则字段详解

字段必填说明
alert告警名称
exprMetricsQL 查询表达式
for持续触发多久后才发送告警
labels附加到告警上的标签
annotations告警的描述信息(支持模板)
keep_firing_for数据消失后保持触发的时长

8.3.3 模板变量

annotations 中可以使用以下模板变量:

变量说明示例
{{ $value }}查询结果值{{ $value }}
{{ $labels.xxx }}标签值{{ $labels.host }}
{{ $externalLabels.xxx }}外部标签{{ $externalLabels.env }}
{{ $alertName }}告警名称{{ $alertName }}

8.4 常用告警规则

8.4.1 基础设施告警

groups:
  - name: infrastructure-alerts
    rules:
      # 主机存活
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 不可达"

      # CPU 使用率
      - alert: HighCPUUsage
        expr: avg by (host) (cpu_usage) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.host }} CPU 使用率过高"
          description: "当前值: {{ $value | printf \"%.1f\" }}%"

      # 内存使用率
      - alert: HighMemoryUsage
        expr: memory_usage > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.host }} 内存使用率超过 90%"

      # 磁盘空间
      - alert: DiskSpaceRunningOut
        expr: predict_linear(disk_usage[7d], 3600*24*7) > 95
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.host }} 磁盘预计 7 天内将满"

      # 磁盘空间紧急
      - alert: DiskSpaceCritical
        expr: disk_usage > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.host }} 磁盘使用率超过 95%"

8.4.2 应用告警

groups:
  - name: application-alerts
    rules:
      # HTTP 错误率
      - alert: HighErrorRate
        expr: |
          100 * sum by (job) (
            rate(http_requests_total{status=~"5.."}[5m])
          ) / sum by (job) (
            rate(http_requests_total[5m])
          ) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} HTTP 5xx 错误率超过 5%"
          description: "当前错误率: {{ $value | printf \"%.2f\" }}%"

      # 请求延迟 P99
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_duration_bucket[5m]))
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.job }} P99 延迟超过 1 秒"
          description: "当前 P99: {{ $value | printf \"%.3f\" }}s"

      # 服务不可用
      - alert: ServiceDown
        expr: absent(up{job="api-server"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "api-server 服务不可用"

8.4.3 VictoriaMetrics 自身告警

groups:
  - name: vm-alerts
    rules:
      # 写入速率下降
      - alert: VMInsertRateDropped
        expr: |
          rate(vm_rows_inserted_total[5m]) <
          rate(vm_rows_inserted_total[5m] offset 1h) * 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "VictoriaMetrics 写入速率大幅下降"

      # 活跃时间序列激增
      - alert: VMActiveTimeSeriesHigh
        expr: vm_active_timeseries > 5000000
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "活跃时间序列数超过 500 万"

      # 慢查询
      - alert: VMSlowQueries
        expr: vm_slow_queries_total > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "存在慢查询"

8.5 Recording Rules

8.5.1 什么是 Recording Rules

Recording Rules 将复杂查询的结果预计算并存储为新的时间序列,提升查询性能。

原始查询(复杂、慢):
  histogram_quantile(0.99,
    sum by (le, job) (rate(http_duration_bucket[5m]))
  )

Recording Rule 预计算后:
  job:http_duration:p99  ← 简单查询,快!

8.5.2 配置示例

groups:
  - name: recording-rules
    interval: 30s
    rules:
      # 预计算 P99 延迟
      - record: job:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_duration_bucket[5m]))
          )

      # 预计算请求速率
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # 预计算错误率
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

      # 预计算 CPU 使用率(按节点)
      - record: host:cpu_usage:avg
        expr: avg by (host) (cpu_usage)

8.5.3 Recording Rules 命名规范

<level>:<metric>:<aggregation>

示例:
  job:http_request_duration_seconds:p99
  └┬┘ └──────────┬────────────┘ └┬┘
   │             │               └── 聚合类型
   │             └── 原始指标名
   └── 分组级别

级别:
  - job: 按 job 聚合
  - host: 按 host 聚合
  - cluster: 按集群聚合
  - instance: 按实例聚合

8.6 Alertmanager 集成

8.6.1 Alertmanager 安装

# Docker 方式运行 Alertmanager
docker run -d \
    --name alertmanager \
    -p 9093:9093 \
    -v /etc/alertmanager:/etc/alertmanager \
    prom/alertmanager:v0.27.0 \
    --config.file=/etc/alertmanager/alertmanager.yml

8.6.2 Alertmanager 配置

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'

route:
  # 默认路由
  receiver: 'default-receiver'
  group_by: ['alertname', 'env']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # critical 级别告警
    - match:
        severity: critical
      receiver: 'critical-receiver'
      repeat_interval: 1h

    # warning 级别告警
    - match:
        severity: warning
      receiver: 'warning-receiver'
      repeat_interval: 4h

    # VM 自身告警
    - match:
        team: vm-ops
      receiver: 'vm-ops-receiver'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'

  - name: 'critical-receiver'
    email_configs:
      - to: '[email protected]'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'
        send_resolved: true

  - name: 'warning-receiver'
    email_configs:
      - to: '[email protected]'

  - name: 'vm-ops-receiver'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/vm-ops/send'

# 静默规则(可选)
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

8.6.3 多 Alertmanager 实例

# vmalert 支持多个 Alertmanager(HA)
vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://alertmanager1:9093 \
    -notifier.url=http://alertmanager2:9093 \
    -notifier.alertmanager.timeout=10s

8.7 告警测试与调试

8.7.1 测试规则语法

# 使用 vmalert 的 -rule.validateOnly 参数
vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -rule.validateOnly

8.7.2 在 VMUI 中测试表达式

# 在 VMUI 中直接测试告警表达式
# 如果查询有返回值,告警会触发
avg by (host) (cpu_usage) > 80

8.7.3 vmalert API

# 查看所有规则
curl http://localhost:8880/api/v1/rules

# 查看活跃告警
curl http://localhost:8880/api/v1/alerts

# 查看规则组状态
curl http://localhost:8880/api/v1/rules?group=infrastructure

8.7.4 回填历史告警

# 对规则进行回填评估(查看历史数据是否有触发)
vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://localhost:9093 \
    -replay.timeFrom=2024-01-01T00:00:00Z \
    -replay.timeTo=2024-01-31T23:59:59Z \
    -replay.maxDataPoints=1000

8.8 告警最佳实践

8.8.1 告警级别定义

级别含义响应时间通知方式
critical服务中断/数据丢失5 分钟内电话 + 短信 + 即时消息
warning性能下降/资源紧张30 分钟内即时消息 + 邮件
info需要关注但不紧急下一工作日邮件 / 工单

8.8.2 避免告警疲劳

# ❌ 不推荐:太短的 for 时间导致频繁触发
- alert: HighCPU
  expr: cpu_usage > 80
  for: 10s  # 太短!抖动就会触发

# ✅ 推荐:合理的持续时间
- alert: HighCPU
  expr: avg by (host) (cpu_usage) > 80
  for: 5m  # 持续 5 分钟才触发

# ✅ 使用聚合减少告警数量
- alert: HighCPU
  expr: avg by (host) (cpu_usage) > 80  # 每个 host 一个告警
  # 而不是
  # cpu_usage > 80  # 每个指标一个告警,可能几百个

8.8.3 优雅降级

# 使用 absent 检测缺失的监控
- alert: MonitoringDown
  expr: absent(up{job="victoria-metrics"})
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "VictoriaMetrics 监控丢失 - 可能监控系统本身出现问题"

本章小结

要点内容
vmalert独立告警引擎,支持 Alerting Rules 和 Recording Rules
规则格式与 Prometheus 完全兼容
Alertmanager支持多种通知渠道,支持 HA 部署
Recording Rules预计算复杂查询,提升性能
最佳实践合理 for 时间、使用聚合、避免告警疲劳

扩展阅读