VictoriaMetrics 完全指南 / 08 - 告警配置

08 · 告警配置

本章目标

了解 vmalert 的架构与工作原理
掌握告警规则（Alerting Rules）的编写
配置 vmalert 与 Alertmanager 集成
学会使用 Recording Rules 优化查询性能
掌握告警测试与调试技巧

8.1 vmalert 简介

vmalert 是 VictoriaMetrics 提供的告警引擎，功能类似于 Prometheus 的 alerting/recording rules 评估器。

┌──────────────────────────────────────────────┐
│                 vmalert                       │
│                                              │
│  ┌──────────────────┐  ┌──────────────────┐  │
│  │  Alerting Rules  │  │ Recording Rules  │  │
│  │  (告警规则)       │  │ (记录规则)        │  │
│  └────────┬─────────┘  └────────┬─────────┘  │
│           │                      │            │
│  定期评估 查询 MetricsQL  写入结果到 VM        │
└───────────┼──────────────────────┼────────────┘
            │                      │
            ▼                      ▼
    ┌──────────────┐       ┌──────────────┐
    │ Alertmanager │       │VictoriaMetrics│
    │  (告警路由)   │       │  (存储结果)   │
    └──────┬───────┘       └──────────────┘
           │
     ┌─────┼─────┐
     ▼     ▼     ▼
   Email  Slack  钉钉

vmalert vs Prometheus Alertmanager

特性	Prometheus 内置	vmalert
集成方式	内嵌于 Prometheus	独立进程
查询引擎	Prometheus	VictoriaMetrics
租户支持	❌	✅ (集群版)
外部标签	有限	完整支持
Recording Rules	✅	✅
回填支持	需 promtool	原生支持

8.2 安装与启动

8.2.1 下载安装

VM_VERSION="v1.106.0"
curl -LO "https://github.com/VictoriaMetrics/VictoriaMetrics/releases/download/${VM_VERSION}/vmalert-linux-amd64-${VM_VERSION}.tar.gz"
tar xzf "vmalert-linux-amd64-${VM_VERSION}.tar.gz"
sudo mv vmalert-prod /usr/local/bin/vmalert
chmod +x /usr/local/bin/vmalert

8.2.2 基础启动

vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://alertmanager:9093 \
    -external.label=env=prod \
    -external.label=region=cn-north \
    -evaluationInterval=15s \
    -httpListenAddr=:8880

8.2.3 systemd 服务

# /etc/systemd/system/vmalert.service
[Unit]
Description=VictoriaMetrics Alert Engine
After=network.target victoria-metrics.service

[Service]
Type=simple
User=victoriametrics
Group=victoriametrics
ExecStart=/usr/local/bin/vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://localhost:9093 \
    -external.label=env=prod \
    -evaluationInterval=15s \
    -httpListenAddr=:8880
Restart=on-failure
RestartSec=10

[Install]
WantedBy=multi-user.target

8.3 告警规则编写

8.3.1 规则文件格式

# /etc/vmalert/rules/infra.yml
groups:
  - name: infrastructure
    interval: 30s          # 评估间隔（可选，覆盖全局）
    concurrency: 2         # 并发评估数（可选）
    rules:
      - alert: HighCPU
        expr: avg by (host) (cpu_usage) > 80
        for: 5m
        labels:
          severity: warning
          team: platform
        annotations:
          summary: "主机 {{ $labels.host }} CPU 使用率过高"
          description: "当前 CPU 使用率: {{ $value | printf \"%.1f\" }}%"

8.3.2 规则字段详解

字段	必填	说明
`alert`	是	告警名称
`expr`	是	MetricsQL 查询表达式
`for`	否	持续触发多久后才发送告警
`labels`	否	附加到告警上的标签
`annotations`	否	告警的描述信息（支持模板）
`keep_firing_for`	否	数据消失后保持触发的时长

8.3.3 模板变量

在 annotations 中可以使用以下模板变量：

变量	说明	示例
`{{ $value }}`	查询结果值	`{{ $value }}`
`{{ $labels.xxx }}`	标签值	`{{ $labels.host }}`
`{{ $externalLabels.xxx }}`	外部标签	`{{ $externalLabels.env }}`
`{{ $alertName }}`	告警名称	`{{ $alertName }}`

8.4 常用告警规则

8.4.1 基础设施告警

groups:
  - name: infrastructure-alerts
    rules:
      # 主机存活
      - alert: InstanceDown
        expr: up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "实例 {{ $labels.instance }} 不可达"

      # CPU 使用率
      - alert: HighCPUUsage
        expr: avg by (host) (cpu_usage) > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.host }} CPU 使用率过高"
          description: "当前值: {{ $value | printf \"%.1f\" }}%"

      # 内存使用率
      - alert: HighMemoryUsage
        expr: memory_usage > 90
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.host }} 内存使用率超过 90%"

      # 磁盘空间
      - alert: DiskSpaceRunningOut
        expr: predict_linear(disk_usage[7d], 3600*24*7) > 95
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.host }} 磁盘预计 7 天内将满"

      # 磁盘空间紧急
      - alert: DiskSpaceCritical
        expr: disk_usage > 95
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.host }} 磁盘使用率超过 95%"

8.4.2 应用告警

groups:
  - name: application-alerts
    rules:
      # HTTP 错误率
      - alert: HighErrorRate
        expr: |
          100 * sum by (job) (
            rate(http_requests_total{status=~"5.."}[5m])
          ) / sum by (job) (
            rate(http_requests_total[5m])
          ) > 5
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "{{ $labels.job }} HTTP 5xx 错误率超过 5%"
          description: "当前错误率: {{ $value | printf \"%.2f\" }}%"

      # 请求延迟 P99
      - alert: HighLatencyP99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_duration_bucket[5m]))
          ) > 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "{{ $labels.job }} P99 延迟超过 1 秒"
          description: "当前 P99: {{ $value | printf \"%.3f\" }}s"

      # 服务不可用
      - alert: ServiceDown
        expr: absent(up{job="api-server"} == 1)
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "api-server 服务不可用"

8.4.3 VictoriaMetrics 自身告警

groups:
  - name: vm-alerts
    rules:
      # 写入速率下降
      - alert: VMInsertRateDropped
        expr: |
          rate(vm_rows_inserted_total[5m]) <
          rate(vm_rows_inserted_total[5m] offset 1h) * 0.5
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "VictoriaMetrics 写入速率大幅下降"

      # 活跃时间序列激增
      - alert: VMActiveTimeSeriesHigh
        expr: vm_active_timeseries > 5000000
        for: 30m
        labels:
          severity: warning
        annotations:
          summary: "活跃时间序列数超过 500 万"

      # 慢查询
      - alert: VMSlowQueries
        expr: vm_slow_queries_total > 10
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "存在慢查询"

8.5 Recording Rules

8.5.1 什么是 Recording Rules

Recording Rules 将复杂查询的结果预计算并存储为新的时间序列，提升查询性能。

原始查询（复杂、慢）：
  histogram_quantile(0.99,
    sum by (le, job) (rate(http_duration_bucket[5m]))
  )

Recording Rule 预计算后：
  job:http_duration:p99  ← 简单查询，快！

8.5.2 配置示例

groups:
  - name: recording-rules
    interval: 30s
    rules:
      # 预计算 P99 延迟
      - record: job:http_request_duration_seconds:p99
        expr: |
          histogram_quantile(0.99,
            sum by (le, job) (rate(http_duration_bucket[5m]))
          )

      # 预计算请求速率
      - record: job:http_requests:rate5m
        expr: sum by (job) (rate(http_requests_total[5m]))

      # 预计算错误率
      - record: job:http_errors:ratio5m
        expr: |
          sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
          / sum by (job) (rate(http_requests_total[5m]))

      # 预计算 CPU 使用率（按节点）
      - record: host:cpu_usage:avg
        expr: avg by (host) (cpu_usage)

8.5.3 Recording Rules 命名规范

<level>:<metric>:<aggregation>

示例：
  job:http_request_duration_seconds:p99
  └┬┘ └──────────┬────────────┘ └┬┘
   │             │               └── 聚合类型
   │             └── 原始指标名
   └── 分组级别

级别：
  - job: 按 job 聚合
  - host: 按 host 聚合
  - cluster: 按集群聚合
  - instance: 按实例聚合

8.6 Alertmanager 集成

8.6.1 Alertmanager 安装

# Docker 方式运行 Alertmanager
docker run -d \
    --name alertmanager \
    -p 9093:9093 \
    -v /etc/alertmanager:/etc/alertmanager \
    prom/alertmanager:v0.27.0 \
    --config.file=/etc/alertmanager/alertmanager.yml

8.6.2 Alertmanager 配置

# /etc/alertmanager/alertmanager.yml
global:
  resolve_timeout: 5m
  smtp_smarthost: 'smtp.example.com:587'
  smtp_from: '[email protected]'
  smtp_auth_username: '[email protected]'
  smtp_auth_password: 'password'

route:
  # 默认路由
  receiver: 'default-receiver'
  group_by: ['alertname', 'env']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h

  routes:
    # critical 级别告警
    - match:
        severity: critical
      receiver: 'critical-receiver'
      repeat_interval: 1h

    # warning 级别告警
    - match:
        severity: warning
      receiver: 'warning-receiver'
      repeat_interval: 4h

    # VM 自身告警
    - match:
        team: vm-ops
      receiver: 'vm-ops-receiver'

receivers:
  - name: 'default-receiver'
    email_configs:
      - to: '[email protected]'

  - name: 'critical-receiver'
    email_configs:
      - to: '[email protected]'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/ops/send'
        send_resolved: true

  - name: 'warning-receiver'
    email_configs:
      - to: '[email protected]'

  - name: 'vm-ops-receiver'
    webhook_configs:
      - url: 'http://dingtalk-webhook:8060/dingtalk/vm-ops/send'

# 静默规则（可选）
inhibit_rules:
  - source_match:
      severity: 'critical'
    target_match:
      severity: 'warning'
    equal: ['alertname', 'instance']

8.6.3 多 Alertmanager 实例

# vmalert 支持多个 Alertmanager（HA）
vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://alertmanager1:9093 \
    -notifier.url=http://alertmanager2:9093 \
    -notifier.alertmanager.timeout=10s

8.7 告警测试与调试

8.7.1 测试规则语法

# 使用 vmalert 的 -rule.validateOnly 参数
vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -rule.validateOnly

8.7.2 在 VMUI 中测试表达式

# 在 VMUI 中直接测试告警表达式
# 如果查询有返回值，告警会触发
avg by (host) (cpu_usage) > 80

8.7.3 vmalert API

# 查看所有规则
curl http://localhost:8880/api/v1/rules

# 查看活跃告警
curl http://localhost:8880/api/v1/alerts

# 查看规则组状态
curl http://localhost:8880/api/v1/rules?group=infrastructure

8.7.4 回填历史告警

# 对规则进行回填评估（查看历史数据是否有触发）
vmalert \
    -rule=/etc/vmalert/rules/*.yml \
    -datasource.url=http://localhost:8428 \
    -notifier.url=http://localhost:9093 \
    -replay.timeFrom=2024-01-01T00:00:00Z \
    -replay.timeTo=2024-01-31T23:59:59Z \
    -replay.maxDataPoints=1000

8.8 告警最佳实践

8.8.1 告警级别定义

级别	含义	响应时间	通知方式
`critical`	服务中断/数据丢失	5 分钟内	电话 + 短信 + 即时消息
`warning`	性能下降/资源紧张	30 分钟内	即时消息 + 邮件
`info`	需要关注但不紧急	下一工作日	邮件 / 工单

8.8.2 避免告警疲劳

# ❌ 不推荐：太短的 for 时间导致频繁触发
- alert: HighCPU
  expr: cpu_usage > 80
  for: 10s  # 太短！抖动就会触发

# ✅ 推荐：合理的持续时间
- alert: HighCPU
  expr: avg by (host) (cpu_usage) > 80
  for: 5m  # 持续 5 分钟才触发

# ✅ 使用聚合减少告警数量
- alert: HighCPU
  expr: avg by (host) (cpu_usage) > 80  # 每个 host 一个告警
  # 而不是
  # cpu_usage > 80  # 每个指标一个告警，可能几百个

8.8.3 优雅降级

# 使用 absent 检测缺失的监控
- alert: MonitoringDown
  expr: absent(up{job="victoria-metrics"})
  for: 2m
  labels:
    severity: critical
  annotations:
    summary: "VictoriaMetrics 监控丢失 - 可能监控系统本身出现问题"

本章小结

要点	内容
vmalert	独立告警引擎，支持 Alerting Rules 和 Recording Rules
规则格式	与 Prometheus 完全兼容
Alertmanager	支持多种通知渠道，支持 HA 部署
Recording Rules	预计算复杂查询，提升性能
最佳实践	合理 for 时间、使用聚合、避免告警疲劳

VictoriaMetrics 完全指南 / 08 - 告警配置

08 · 告警配置

本章目标

8.1 vmalert 简介

vmalert vs Prometheus Alertmanager

8.2 安装与启动

8.2.1 下载安装

8.2.2 基础启动

8.2.3 systemd 服务

8.3 告警规则编写

8.3.1 规则文件格式

8.3.2 规则字段详解

8.3.3 模板变量

8.4 常用告警规则

8.4.1 基础设施告警

8.4.2 应用告警

8.4.3 VictoriaMetrics 自身告警

8.5 Recording Rules

8.5.1 什么是 Recording Rules

8.5.2 配置示例

8.5.3 Recording Rules 命名规范

8.6 Alertmanager 集成

8.6.1 Alertmanager 安装

8.6.2 Alertmanager 配置

8.6.3 多 Alertmanager 实例

8.7 告警测试与调试

8.7.1 测试规则语法

8.7.2 在 VMUI 中测试表达式

8.7.3 vmalert API

8.7.4 回填历史告警

8.8 告警最佳实践

8.8.1 告警级别定义

8.8.2 避免告警疲劳

8.8.3 优雅降级

本章小结

扩展阅读