第 12 章:监控与告警
第 12 章:监控与告警
监控是保障 RabbitMQ 稳定运行的基石。本章将构建从指标采集、可视化到告警的完整监控体系。
12.1 监控指标分类
| 类别 | 关键指标 | 说明 |
|---|---|---|
| 节点 | 内存使用、磁盘空间、Erlang 进程数 | 节点健康状况 |
| 连接 | 连接数、通道数、消费者数 | 客户端状态 |
| 队列 | 队列深度、就绪消息、未确认消息 | 消息堆积情况 |
| 消息 | 发布速率、投递速率、确认速率 | 消息流转效率 |
| 集群 | 节点状态、网络分区、仲裁队列状态 | 集群健康状况 |
12.2 管理 API 监控
关键 API 端点
# 系统概览
curl -u admin:admin123 http://localhost:15672/api/overview
# 节点信息
curl -u admin:admin123 http://localhost:15672/api/nodes
# 队列列表
curl -u admin:admin123 http://localhost:15672/api/queues
# 连接列表
curl -u admin:admin123 http://localhost:15672/api/connections
# 通道列表
curl -u admin:admin123 http://localhost:15672/api/channels
# 交换机列表
curl -u admin:admin123 http://localhost:15672/api/exchanges
# 健康检查
curl -u admin:admin123 http://localhost:15672/api/health/checks/alarms
curl -u admin:admin123 http://localhost:15672/api/health/checks/local-alarms
curl -u admin:admin123 http://localhost:15672/api/health/checks/protocol-listeners
解析关键指标
import requests
import json
RABBITMQ_API = "http://localhost:15672/api"
AUTH = ("admin", "admin123")
def get_metrics():
# 节点信息
nodes = requests.get(f"{RABBITMQ_API}/nodes", auth=AUTH).json()
for node in nodes:
print(f"节点: {node['name']}")
print(f" 内存使用: {node['mem_used'] / 1024**3:.2f} GB")
print(f" 磁盘可用: {node['disk_free'] / 1024**3:.2f} GB")
print(f" Erlang 进程: {node['proc_used']}/{node['proc_total']}")
print(f" 文件描述符: {node['fd_used']}/{node['fd_total']}")
# 队列信息
queues = requests.get(f"{RABBITMQ_API}/queues", auth=AUTH).json()
for q in queues:
if q['messages'] > 0:
print(f"队列: {q['name']}, 消息: {q['messages']}, "
f"就绪: {q['messages_ready']}, 未确认: {q['messages_unacknowledged']}")
# 连接信息
connections = requests.get(f"{RABBITMQ_API}/connections", auth=AUTH).json()
print(f"当前连接数: {len(connections)}")
get_metrics()
12.3 Prometheus + Grafana 监控
启用 Prometheus 插件
rabbitmq-plugins enable rabbitmq_prometheus
Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'rabbitmq'
static_configs:
- targets: ['rabbitmq:15692']
metrics_path: /metrics/per-object
params:
family:
- queue_messages
- queue_messages_ready
- queue_messages_unacked
关键 Prometheus 指标
| 指标 | 说明 |
|---|---|
rabbitmq_queue_messages | 队列总消息数 |
rabbitmq_queue_messages_ready | 就绪消息数 |
rabbitmq_queue_messages_unacked | 未确认消息数 |
rabbitmq_queue_messages_published_total | 发布消息总数 |
rabbitmq_queue_messages_delivered_total | 投递消息总数 |
rabbitmq_connections | 当前连接数 |
rabbitmq_channels | 当前通道数 |
rabbitmq_process_resident_memory_bytes | 进程内存 |
rabbitmq_disk_space_available_bytes | 可用磁盘空间 |
Grafana Dashboard
{
"dashboard": {
"title": "RabbitMQ Dashboard",
"panels": [
{
"title": "Queue Depth",
"type": "stat",
"targets": [{
"expr": "sum(rabbitmq_queue_messages)",
"legendFormat": "Total Messages"
}]
},
{
"title": "Message Rate",
"type": "graph",
"targets": [
{
"expr": "rate(rabbitmq_channel_messages_published_total[5m])",
"legendFormat": "Publish Rate"
},
{
"expr": "rate(rabbitmq_channel_messages_delivered_total[5m])",
"legendFormat": "Deliver Rate"
}
]
},
{
"title": "Connections",
"type": "stat",
"targets": [{
"expr": "rabbitmq_connections"
}]
},
{
"title": "Memory Usage",
"type": "gauge",
"targets": [{
"expr": "rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100",
"legendFormat": "Memory %"
}]
}
]
}
}
12.4 告警规则
Prometheus 告警规则
# rabbitmq_alerts.yml
groups:
- name: rabbitmq
rules:
# 队列堆积告警
- alert: RabbitMQQueueDepthHigh
expr: rabbitmq_queue_messages > 10000
for: 5m
labels:
severity: warning
annotations:
summary: "队列 {{ $labels.queue }} 消息堆积: {{ $value }}"
# 未确认消息过多
- alert: RabbitMQUnackedMessagesHigh
expr: rabbitmq_queue_messages_unacked > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "队列 {{ $labels.queue }} 未确认消息过多: {{ $value }}"
# 内存告警
- alert: RabbitMQMemoryHigh
expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes > 0.8
for: 2m
labels:
severity: critical
annotations:
summary: "RabbitMQ 内存使用超过 80%"
# 磁盘告警
- alert: RabbitMQDiskSpaceLow
expr: rabbitmq_disk_space_available_bytes < 2147483648
for: 2m
labels:
severity: critical
annotations:
summary: "RabbitMQ 磁盘空间不足 2GB"
# 连接数告警
- alert: RabbitMQConnectionsHigh
expr: rabbitmq_connections > 5000
for: 5m
labels:
severity: warning
annotations:
summary: "RabbitMQ 连接数过高: {{ $value }}"
# 节点不健康
- alert: RabbitMQNodeDown
expr: up{job="rabbitmq"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "RabbitMQ 节点宕机"
12.5 脚本监控
Shell 监控脚本
#!/bin/bash
# rabbitmq_monitor.sh
API="http://localhost:15672/api"
AUTH="admin:admin123"
echo "=== RabbitMQ 监控报告 ==="
echo "时间: $(date)"
echo
# 集群状态
echo "--- 集群节点 ---"
curl -s -u $AUTH $API/nodes | jq -r '.[] | "\(.name): \(.running)"'
# 队列堆积
echo -e "\n--- 队列堆积 (>100) ---"
curl -s -u $AUTH $API/queues | jq -r '.[] | select(.messages > 100) | "\(.name): \(.messages) (\(.messages_ready) ready, \(.messages_unacknowledged) unacked)"'
# 连接数
echo -e "\n--- 连接数 ---"
conn_count=$(curl -s -u $AUTH $API/connections | jq 'length')
echo "当前连接数: $conn_count"
# 内存使用
echo -e "\n--- 内存使用 ---"
curl -s -u $AUTH $API/nodes | jq -r '.[] | "\(.name): \(.mem_used / 1073741824 | round)GB / \(.mem_limit / 1073741824 | round)GB"'
# 告警
echo -e "\n--- 系统告警 ---"
alarms=$(curl -s -u $AUTH $API/health/checks/alarms)
echo "告警状态: $alarms"
12.6 注意事项
⚠️ 管理 API 性能影响
频繁查询管理 API 会产生额外开销,建议使用 Prometheus 插件而非轮询 API。
⚠️ 监控基数
每个队列/连接/通道都会生成独立指标,大规模集群下指标数量可能爆炸。使用 metrics/per-object 端点并配置过滤。
⚠️ 告警阈值调整
阈值需要根据实际业务量调整,避免告警风暴。
🔥 最佳实践: Prometheus + Grafana + AlertManager 构成完整的监控告警体系。
12.7 扩展阅读
下一章: 第 13 章:消息模式 — 掌握常见的消息设计模式。