12 - 自我监控
12 · 自我监控
本章目标
- 了解 VictoriaMetrics 暴露的自我监控指标
- 使用 Grafana 搭建监控仪表盘
- 配置内置告警规则
- 掌握健康检查与容量监控
12.1 自我监控指标
12.1.1 指标暴露端点
# 查看所有指标
curl http://localhost:8428/metrics
# 使用 Prometheus 格式
curl -s http://localhost:8428/metrics | head -50
12.1.2 关键指标分类
| 类别 | 指标 | 说明 |
|---|---|---|
| 写入 | vm_rows_inserted_total | 总写入行数 |
| 写入 | vm_slow_inserts_total | 慢写入计数 |
| 写入 | vm_inserts_total | 写入请求计数 |
| 查询 | vm_request_duration_seconds | 查询延迟直方图 |
| 查询 | vm_slow_queries_total | 慢查询计数 |
| 查询 | vm_concurrent_queries | 当前并发查询 |
| 存储 | vm_active_timeseries | 活跃时间序列数 |
| 存储 | vm_timeseries_created_total | 创建的序列总数 |
| 存储 | vm_parts | Part 数量 |
| 存储 | vm_rows | 总数据行数 |
| 缓存 | vm_cache_entries | 缓存条目数 |
| 缓存 | vm_cache_size_bytes | 缓存大小 |
| 系统 | process_resident_memory_bytes | RSS 内存 |
| 系统 | process_cpu_seconds_total | CPU 使用时间 |
| 系统 | process_open_fds | 打开文件描述符数 |
12.1.3 重要指标详解
# 写入吞吐率(samples/s)
rate(vm_rows_inserted_total{type="metric"}[5m])
# 查询延迟 P99
histogram_quantile(0.99,
sum(rate(vm_request_duration_seconds_bucket{path="/api/v1/query"}[5m])) by (le)
)
# 活跃序列数
vm_active_timeseries
# 磁盘空间使用率
1 - (vm_free_disk_space_bytes / vm_total_disk_space_bytes)
# 缓存命中率
vm_cache_hits_total / (vm_cache_hits_total + vm_cache_misses_total)
# 慢查询速率
rate(vm_slow_queries_total[5m])
# 合并中 Part 数量
vm_merges_total
12.2 Grafana 仪表盘
12.2.1 配置数据源
# Grafana provisioning 数据源配置
# /etc/grafana/provisioning/datasources/victoriametrics.yml
apiVersion: 1
datasources:
- name: VictoriaMetrics
type: prometheus
url: http://localhost:8428
access: proxy
isDefault: true
jsonData:
httpMethod: POST
timeInterval: "15s"
editable: true
12.2.2 推荐仪表盘
官方提供多个 Grafana 仪表盘:
| Dashboard ID | 名称 | 用途 |
|---|---|---|
| 10229 | VictoriaMetrics - single-node | 单节点监控 |
| 11176 | VictoriaMetrics - cluster | 集群监控 |
| 12683 | VictoriaMetrics - vmagent | vmagent 监控 |
| 14950 | VictoriaMetrics - vmalert | vmalert 监控 |
导入方式:
- Grafana → Dashboards → Import
- 输入 Dashboard ID
- 选择 VictoriaMetrics 数据源
- 点击 Import
12.2.3 核心面板说明
写入面板:
# 写入速率
rate(vm_rows_inserted_total{type="metric"}[5m])
# 写入延迟
rate(vm_slow_inserts_total[5m])
# 并发写入
vm_concurrent_inserts
查询面板:
# 查询 QPS
sum(rate(vm_requests_total{path=~"/api/v1/.*"}[5m]))
# 查询延迟分布
histogram_quantile(0.50, sum(rate(vm_request_duration_seconds_bucket[5m])) by (le))
histogram_quantile(0.99, sum(rate(vm_request_duration_seconds_bucket[5m])) by (le))
# 慢查询
increase(vm_slow_queries_total[1h])
存储面板:
# 活跃序列数
vm_active_timeseries
# 磁盘使用
vm_free_disk_space_bytes
# Part 数量
sum(vm_parts)
系统面板:
# 内存使用
process_resident_memory_bytes{job="victoria-metrics"}
# CPU 使用率
rate(process_cpu_seconds_total{job="victoria-metrics"}[5m]) * 100
# Goroutine 数量
go_goroutines{job="victoria-metrics"}
12.3 内置告警规则
VictoriaMetrics 官方提供了一套推荐的告警规则:
12.3.1 下载官方告警规则
# 单节点版告警规则
curl -LO https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/deployment/docker/alerts.yml
# 集群版告警规则
curl -LO https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/deployment/docker/alerts-cluster.yml
# vmalert 告警规则
curl -LO https://raw.githubusercontent.com/VictoriaMetrics/VictoriaMetrics/master/deployment/docker/alerts-vmalert.yml
12.3.2 核心告警规则
# /etc/vmalert/rules/vm-health.yml
groups:
- name: vm-health
rules:
# 实例宕机
- alert: TooManyRestarts
expr: changes(process_start_time_seconds{job=~"victoria.*"}[15m]) > 2
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.job }} 频繁重启"
description: "{{ $labels.instance }} 在 15 分钟内重启了 {{ $value }} 次"
# 写入速率下降
- alert: RowsInsertRateDrop
expr: |
(
rate(vm_rows_inserted_total[5m]) < 0.5 *
(rate(vm_rows_inserted_total[5m] offset 1h))
) and (
rate(vm_rows_inserted_total[5m]) > 0
)
for: 15m
labels:
severity: warning
annotations:
summary: "写入速率下降到 1 小时前的 50% 以下"
# 活跃序列激增
- alert: TooHighActiveTimeSeries
expr: vm_active_timeseries > 5000000
for: 30m
labels:
severity: warning
annotations:
summary: "活跃时间序列超过 500 万"
description: "当前活跃序列: {{ $value }}"
# 磁盘空间不足
- alert: DiskRunsOutOfSpaceIn24h
expr: |
predict_linear(vm_free_disk_space_bytes[1h], 24*3600) < 0
for: 30m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 磁盘预计 24 小时内耗尽"
# 磁盘空间紧急
- alert: DiskRunsOutOfSpace
expr: vm_free_disk_space_bytes < 10 * 1024 * 1024 * 1024
for: 5m
labels:
severity: critical
annotations:
summary: "{{ $labels.instance }} 剩余磁盘空间不足 10 GB"
# 内存使用过高
- alert: TooHighMemoryUsage
expr: |
process_resident_memory_bytes /
(node_memory_MemTotal_bytes or process_resident_memory_bytes * 1.5) > 0.9
for: 5m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} 内存使用率超过 90%"
# 慢查询过多
- alert: TooManySlowQueries
expr: rate(vm_slow_queries_total[5m]) > 1
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} 慢查询速率 > 1/s"
# 请求错误率
- alert: TooManyErrors
expr: |
sum(rate(vm_http_request_errors_total[5m])) by (instance) >
sum(rate(vm_http_requests_total[5m])) by (instance) * 0.01
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.instance }} HTTP 错误率超过 1%"
# 查询超时
- alert: TooSlowQueries
expr: |
histogram_quantile(0.99,
sum(rate(vm_request_duration_seconds_bucket[5m])) by (le, path)
) > 30
for: 15m
labels:
severity: warning
annotations:
summary: "{{ $labels.path }} P99 查询延迟超过 30 秒"
12.4 集成 Prometheus 监控
12.4.1 使用 vmagent 采集 VM 自身
# vmagent 配置
scrape_configs:
- job_name: 'victoria-metrics'
static_configs:
- targets: ['localhost:8428']
scrape_interval: 15s
- job_name: 'vminsert'
static_configs:
- targets: ['vminsert1:8480', 'vminsert2:8480']
- job_name: 'vmselect'
static_configs:
- targets: ['vmselect1:8481', 'vmselect2:8481']
- job_name: 'vmstorage'
static_configs:
- targets: ['vmstorage1:8482', 'vmstorage2:8482', 'vmstorage3:8482']
12.4.2 自监控架构
┌─────────────────────────────────────────────┐
│ │
│ vmagent ──▶ VictoriaMetrics ──▶ Grafana │
│ │ ▲ │
│ │ │ 采集自身 │
│ └──────────────┘ │
│ │
│ vmalert ──▶ Alertmanager │
│ ▲ │
│ │ 查询 │
│ └──▶ VictoriaMetrics │
└─────────────────────────────────────────────┘
最佳实践:使用独立的 VM 实例来监控生产 VM(监控系统不应监控自身)。
12.5 运行时信息 API
# 构建信息
curl http://localhost:8428/api/v1/status/buildinfo
# TSDB 状态
curl http://localhost:8428/api/v1/status/tsdb
# 活跃查询
curl http://localhost:8428/api/v1/status/active_queries
# 健康检查
curl http://localhost:8428/health
# 进程信息
curl http://localhost:8428/metrics | grep "^process_"
本章小结
| 要点 | 内容 |
|---|---|
| 指标端点 | /metrics 暴露 Prometheus 格式指标 |
| Grafana | 使用官方仪表盘(ID: 10229/11176) |
| 告警规则 | 官方提供推荐规则,覆盖写入/查询/存储/系统 |
| 自监控架构 | 推荐独立 VM 监控生产 VM |
| API | /api/v1/status/* 提供运行时信息 |