Prometheus 完全指南 / 17 - 故障排查

17 - 故障排查

17.1 常见问题速查表

问题	可能原因	解决方案
目标显示 DOWN	网络不通/服务未启动	检查网络和端口
查询返回空	时间范围/标签不匹配	检查选择器和时间窗口
告警不触发	for 时间过长/表达式错误	测试 PromQL 表达式
内存占用高	时间序列过多	检查高基数指标
查询超时	查询过于复杂	使用录制规则优化
磁盘空间不足	保留时间过长	调整 retention
数据缺失	抓取失败/时间对齐	检查 scrape_interval

17.2 抓取问题排查

目标状态检查

# 检查目标状态
curl -s http://localhost:9090/api/v1/targets | jq '.data.activeTargets[] | {instance: .labels.instance, health: .health, lastError: .lastError}'

# 检查特定 job 的目标
curl -s 'http://localhost:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | select(.labels.job=="my-app")'

常见抓取错误

错误信息	原因	解决方案
`connection refused`	服务未启动或端口不对	检查服务状态和端口
`context deadline exceeded`	抓取超时	增加 `scrape_timeout`
`server returned HTTP status 404`	metrics_path 错误	检查 `metrics_path` 配置
`no token found`	认证失败	检查 auth 配置
`certificate signed by unknown CA`	TLS 证书问题	配置 `tls_config`
`Get "http://...": dial tcp: lookup ...`	DNS 解析失败	检查 DNS 配置

手动测试抓取

# 直接 curl 目标的 metrics 端点
curl -v http://target:9090/metrics

# 使用 Prometheus 的 target 标签
curl http://localhost:9090/api/v1/targets/metadata?match_target={job="my-app"}

# 测试认证
curl -H "Authorization: Bearer <token>" https://target:8443/metrics

17.3 PromQL 查询问题

查询返回空结果

# 1. 检查指标是否存在
curl 'http://localhost:9090/api/v1/query?query=http_requests_total'

# 2. 检查指标的标签
curl 'http://localhost:9090/api/v1/query?query=http_requests_total{job="my-app"}'

# 3. 使用 __name__ 查看所有指标
curl 'http://localhost:9090/api/v1/label/__name__/values'

# 4. 检查时间范围
curl 'http://localhost:9090/api/v1/query?query=up&time=2024-01-01T00:00:00Z'

常见查询错误

# 问题：rate() 返回 0
# 原因：rate 窗口太小（小于 2x scrape_interval）
rate(http_requests_total[10s])  # 如果 scrape_interval=15s，这会返回 0

# 解决：使用 ≥ 4x scrape_interval
rate(http_requests_total[1m])

# 问题：除法返回空
# 原因：除数为 0 或标签不匹配
rate(http_errors[5m]) / rate(http_requests[5m])

# 解决：添加向量匹配或过滤
rate(http_errors[5m]) / (rate(http_requests[5m]) > 0)

查询性能分析

# 查看查询执行时间
curl -G 'http://localhost:9090/api/v1/query' \
  --data-urlencode 'query=sum(rate(http_requests_total[5m]))' \
  --data-urlencode 'stats=true'

# 查看 TSDB 统计
curl http://localhost:9090/api/v1/status/tsdb

# 查看运行时信息
curl http://localhost:9090/api/v1/status/runtimeinfo

17.4 TSDB 问题排查

存储空间问题

# 查看数据目录大小
du -sh /var/lib/prometheus/

# 查看块数量
ls -la /var/lib/prometheus/blocks/ | wc -l

# 查看 WAL 大小
du -sh /var/lib/prometheus/wal/

# TSDB 统计 API
curl http://localhost:9090/api/v1/status/tsdb | jq .

时间序列爆炸

# 查看活跃时间序列数量
curl http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByMetricName | sort_by(-.value) | .[0:20]'

# 查看高基数指标
curl http://localhost:9090/api/v1/status/tsdb | jq '.data.seriesCountByLabelName | sort_by(-.value) | .[0:20]'

# 查看标签值基数
curl 'http://localhost:9090/api/v1/label/__name__/values' | jq '.data | length'

压缩问题

# 查看压缩状态
curl http://localhost:9090/api/v1/status/runtimeinfo | jq '.data.storage'

# 查看 TSDB 错误
grep -i "tsdb" /var/log/prometheus/prometheus.log | grep -i error

# 手动触发 compaction（需启用 admin API）
curl -X POST http://localhost:9090/api/v1/admin/tsdb/clean_tombstones

WAL 恢复

# WAL 损坏时的恢复步骤
# 1. 停止 Prometheus
sudo systemctl stop prometheus

# 2. 备份数据
cp -r /var/lib/prometheus /var/lib/prometheus.backup

# 3. 尝试删除 WAL（会丢失最近数据）
rm -rf /var/lib/prometheus/wal/*

# 4. 使用 promtool 检查
promtool tsdb list /var/lib/prometheus/

# 5. 重启
sudo systemctl start prometheus

17.5 告警问题排查

告警不触发

# 1. 检查告警规则状态
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name=="InstanceDown")'

# 2. 在 Prometheus Web UI 中测试表达式
# 访问 http://localhost:9090/graph，输入告警的 expr

# 3. 检查 Alertmanager 连接
curl http://localhost:9090/api/v1/alertmanagers

# 4. 检查 Alertmanager 状态
curl http://alertmanager:9093/api/v2/alerts

告警规则语法检查

# 使用 promtool 检查规则
promtool check rules /etc/prometheus/rules/*.yml

# 测试规则
promtool test rules test.yml

Alertmanager 问题

# 检查配置语法
amtool check-config /etc/alertmanager/alertmanager.yml

# 查看当前告警
amtool alert query --alertmanager.url=http://localhost:9093

# 查看静默
amtool silence query --alertmanager.url=http://localhost:9093

# 查看路由
amtool config routes --alertmanager.url=http://localhost:9093

17.6 性能调优

Prometheus 性能指标

# 查询延迟
prometheus_engine_query_duration_seconds{quantile="0.99"}

# 规则评估延迟
prometheus_rule_evaluation_duration_seconds{quantile="0.99"}

# 抓取延迟
scrape_duration_seconds{quantile="0.99"}

# 时间序列数量
prometheus_tsdb_head_series

# 内存使用
process_resident_memory_bytes

高基数问题

# 找出高基数指标
curl -s http://localhost:9090/api/v1/status/tsdb | \
  jq -r '.data.seriesCountByMetricName[] | "\(.value)\t\(.name)"' | \
  sort -rn | head -20

# 找出高基数标签
curl -s http://localhost:9090/api/v1/status/tsdb | \
  jq -r '.data.seriesCountByLabelValue[] | "\(.value)\t\(.name)"' | \
  sort -rn | head -20

优化建议

问题	优化方案
时间序列过多	减少高基数标签，使用 `metric_relabel_configs` 丢弃
查询太慢	使用录制规则预聚合
内存不足	减少 `--storage.tsdb.min-block-duration`
磁盘不足	减少 retention 时间，增加 retention 大小限制
抓取延迟	优化 Exporter 性能，增加 `scrape_timeout`

metric_relabel_configs 丢弃指标

scrape_configs:
  - job_name: 'my-app'
    metric_relabel_configs:
      # 丢弃不需要的指标
      - source_labels: [__name__]
        regex: 'go_.*'
        action: drop
      
      # 丢弃高基数标签
      - source_labels: [__name__]
        regex: 'http_request_duration_seconds_bucket'
        target_label: __tmp_drop
        replacement: '1'
      - source_labels: [__name__, request_id]
        regex: '.+;.+'
        action: drop

17.7 日志分析

关键日志关键词

# 查看错误日志
grep -i error /var/log/prometheus/prometheus.log

# 查看抓取失败
grep "scrape failed" /var/log/prometheus/prometheus.log

# 查看规则评估问题
grep "rule" /var/log/prometheus/prometheus.log | grep -i error

# Docker 日志
docker logs prometheus 2>&1 | grep -i error

日志级别调整

# 启动时设置日志级别
prometheus --log.level=debug

# 运行时调整（需要重启或重载）
kill -HUP $(pgrep prometheus)

17.8 本章小结

排查方向	关键命令/API
抓取状态	`/api/v1/targets`
TSDB 统计	`/api/v1/status/tsdb`
运行时信息	`/api/v1/status/runtimeinfo`
告警状态	`/api/v1/rules`
语法检查	`promtool check config/rules`
高基数	`seriesCountByMetricName`

扩展阅读

上一章：16 - Grafana 集成 下一章：18 - 最佳实践