09 - 监控与告警
09 - 监控与告警
9.1 集群健康检查
ceph 命令行监控
# 集群状态概览(最常用)
ceph -s
输出解读:
cluster:
id: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
health: HEALTH_OK ← 健康状态(OK/WARN/ERR)
services:
mon: 3 daemons, quorum node1,node2,node3 ← Monitor 法定人数
mgr: node1.active ← 活跃的 Manager
mds: myfs:1 {0=node1=up:active} ← MDS 状态
osd: 12 osds: 12 up, 12 in ← OSD 状态
task status:
scrub status:
mds.node1: idle
data:
pools: 5 pools, 500 pgs
objects: 150.23k objects, 500 GiB
usage: 1.5 TiB used, 8.5 TiB / 10 TiB avail ← 容量使用
pgs: 500 active+clean ← PG 状态
io:
client: 100 MiB/s rd, 50 MiB/s wr, 500 op/s rd, 200 op/s wr
健康状态详解
| 状态 | 含义 | 行动 |
|---|---|---|
| HEALTH_OK | 一切正常 | 无需操作 |
| HEALTH_WARN | 警告,可能有轻微问题 | 查看详情,关注趋势 |
| HEALTH_ERR | 严重错误,数据可能受影响 | 立即排查 |
# 查看健康详情
ceph health detail
# 实时监控(刷新模式)
ceph -w
# 查看容量使用
ceph df
ceph df detail
9.2 核心监控命令
OSD 监控
# OSD 状态概览
ceph osd stat
ceph osd tree
# OSD 使用率分布
ceph osd df
ceph osd df tree
ceph osd df tree sort utilization
# OSD 性能统计
ceph osd perf
# OSD 慢操作
ceph daemon osd.0 dump_ops_in_flight
# OSD 修复状态
ceph osd repair osd.0
PG 监控
# PG 状态统计
ceph pg stat
# 查看所有非 active+clean 的 PG
ceph pg ls | grep -v active+clean
# PG 诊断信息
ceph pg dump_stuck unclean
ceph pg dump_stuck inactive
ceph pg dump_stuck stale
ceph pg dump_stuck undersized
# PG 映射详情
ceph pg map <pgid>
存储池监控
# 池统计
ceph osd pool stats
# 池详情
ceph osd pool ls detail
# 池使用率
ceph df | grep -A 20 "POOLS"
9.3 Prometheus + Grafana 监控
启用 Prometheus 模块
# 启用 Prometheus 导出模块
ceph mgr module enable prometheus
# 验证端点
curl -s http://<mgr-node>:9287/metrics | head -20
# 默认端口 9287
# 如果需要修改端口
ceph config set mgr mgr/prometheus/server_port 9287
ceph config set mgr mgr/prometheus/server_addr 0.0.0.0
Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'ceph'
static_configs:
- targets:
- 'node1:9287'
- 'node2:9287'
metrics_path: /metrics
# Node Exporter(主机指标)
- job_name: 'node'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
关键 Ceph 指标
| 指标名 | 说明 | 告警阈值建议 |
|---|---|---|
ceph_health_status | 集群健康状态(0=OK,1=WARN,2=ERR) | > 0 告警 |
ceph_osd_in | OSD 是否 in | = 0 告警 |
ceph_osd_up | OSD 是否 up | = 0 告警 |
ceph_pg_active | active PG 数量 | - |
ceph_pg_degraded | degraded PG 数量 | > 0 告警 |
ceph_pg_stale | stale PG 数量 | > 0 告警 |
ceph_pool_used_bytes | 池已用空间 | - |
ceph_cluster_total_used_bytes | 集群总使用空间 | > 80% 警告 |
ceph_osd_op_r_latency | OSD 读延迟 | > 50ms 告警 |
ceph_osd_op_w_latency | OSD 写延迟 | > 100ms 告警 |
ceph_osd_recovery_ops | 恢复操作数 | > 1000 告警 |
Grafana 仪表板
# 推荐的 Grafana 仪表板 ID(从 grafana.com 导入)
# 5336 - Ceph - Cluster Overview
# 5342 - Ceph - OSD Details
# 7056 - Ceph - Pools Overview
# 1471 - Ceph - Prometheus Overview
# 导入方式:
# Grafana → + → Import → 输入 Dashboard ID → Load → 选择 Prometheus 数据源
9.4 告警配置
Prometheus Alertmanager 规则
# ceph_alerts.yml
groups:
- name: ceph_alerts
rules:
# 集群健康状态异常
- alert: CephHealthError
expr: ceph_health_status > 0
for: 5m
labels:
severity: critical
annotations:
summary: "Ceph 集群状态异常"
description: "Ceph 集群健康状态为 {{ $value }}(0=OK,1=WARN,2=ERR)"
# OSD 故障
- alert: CephOSDDown
expr: ceph_osd_up == 0
for: 5m
labels:
severity: critical
annotations:
summary: "OSD {{ $labels.ceph_daemon }} 故障"
# PG 降级
- alert: CephPGDegraded
expr: ceph_pg_degraded > 0
for: 10m
labels:
severity: warning
annotations:
summary: "PG 处于降级状态"
description: "降级 PG 数量: {{ $value }}"
# 容量告警
- alert: CephCapacityWarning
expr: (ceph_cluster_total_used_bytes / ceph_cluster_total_bytes) > 0.75
for: 30m
labels:
severity: warning
annotations:
summary: "Ceph 集群容量使用超过 75%"
- alert: CephCapacityCritical
expr: (ceph_cluster_total_used_bytes / ceph_cluster_total_bytes) > 0.85
for: 10m
labels:
severity: critical
annotations:
summary: "Ceph 集群容量使用超过 85%"
# OSD 延迟告警
- alert: CephOSDHighLatency
expr: ceph_osd_op_r_latency > 0.05
for: 10m
labels:
severity: warning
annotations:
summary: "OSD {{ $labels.ceph_daemon }} 读延迟过高: {{ $value }}s"
# 恢复中
- alert: CephRecoveryInProgress
expr: ceph_pg_backfilling > 0 or ceph_pg_recovering > 0
for: 0m
labels:
severity: info
annotations:
summary: "Ceph 数据恢复中,PG 数: {{ $value }}"
9.5 Ceph Dashboard
# 查看 Dashboard URL
ceph mgr services
# 设置凭据
ceph dashboard set-login-credentials admin MySecureP@ss
# 启用 SSL(自签名证书)
ceph dashboard create-self-signed-cert
# 启用 Prometheus 集成
ceph dashboard set-prometheus-api-host http://prometheus:9090
# 启用 Grafana 集成
ceph dashboard set-grafana-api-url http://grafana:3000
9.6 自定义监控脚本
#!/bin/bash
# ceph_health_check.sh - Ceph 健康检查脚本
CEPH_CMD="ceph"
LOG_FILE="/var/log/ceph/health_check.log"
# 获取集群状态
HEALTH=$($CEPH_CMD health --format json | jq -r '.status')
OSD_UP=$($CEPH_CMD osd stat --format json | jq '.num_up_osds')
OSD_IN=$($CEPH_CMD osd stat --format json | jq '.num_in_osds')
PG_DEGRADED=$($CEPH_CMD pg stat --format json | jq '.num_pg_degraded')
USAGE_PCT=$($CEPH_CMD df --format json | jq '.stats.total_used_bytes / .stats.total_bytes * 100 | floor')
echo "=== Ceph Health Check - $(date) ===" | tee -a $LOG_FILE
echo "Health: $HEALTH" | tee -a $LOG_FILE
echo "OSD Up/In: $OSD_UP/$OSD_IN" | tee -a $LOG_FILE
echo "PG Degraded: $PG_DEGRADED" | tee -a $LOG_FILE
echo "Usage: ${USAGE_PCT}%" | tee -a $LOG_FILE
# 告警判断
if [ "$HEALTH" != "HEALTH_OK" ]; then
echo "ALERT: Cluster health is $HEALTH" | tee -a $LOG_FILE
# 发送告警(邮件/Webhook/钉钉)
fi
if [ $PG_DEGRADED -gt 0 ]; then
echo "ALERT: $PG_DEGRADED degraded PGs" | tee -a $LOG_FILE
fi
if [ $USAGE_PCT -gt 80 ]; then
echo "WARNING: Disk usage at ${USAGE_PCT}%" | tee -a $LOG_FILE
fi
9.7 日志管理
# Ceph 日志位置
# /var/log/ceph/ceph-mon.*.log
# /var/log/ceph/ceph-osd.*.log
# /var/log/ceph/ceph-mgr.*.log
# 实时查看日志
tail -f /var/log/ceph/ceph-mon.node1.log
journalctl -u ceph-osd@0 -f
# 集群日志收集
ceph crash ls
ceph crash info <crash-id>
ceph crash archive <crash-id> # 归档(不再告警)
ceph crash archive-all # 归档所有
# 日志级别调整(临时)
ceph tell osd.0 config set debug_osd 0/5
ceph tell mon.node1 config set debug_mon 0/5
# 日志级别调整(永久)
ceph config set osd debug_osd 0/5
9.8 业务场景:日常巡检
#!/bin/bash
# ceph_daily_check.sh
echo "==================== Ceph Daily Report ===================="
echo "Date: $(date)"
echo ""
echo "--- Cluster Status ---"
ceph -s
echo ""
echo "--- OSD Utilization ---"
ceph osd df tree | head -20
echo ""
echo "--- Pool Usage ---"
ceph df | grep -A 30 "POOLS"
echo ""
echo "--- Stuck PGs ---"
ceph pg dump_stuck unclean 2>/dev/null | head -5
ceph pg dump_stuck inactive 2>/dev/null | head -5
ceph pg dump_stuck stale 2>/dev/null | head -5
echo ""
echo "--- Slow OSDs ---"
ceph osd perf | sort -k3 -rn | head -10
echo ""
echo "--- Recent Crashes ---"
ceph crash ls | head -10
echo ""
echo "==================== End of Report ===================="
扩展阅读
下一章:10 - 性能调优 — 学习 PG 数量优化、OSD 调优、BlueStore 配置和网络优化策略。