RabbitMQ 消息队列完全教程 / 第 12 章：监控与告警

第 12 章：监控与告警

监控是保障 RabbitMQ 稳定运行的基石。本章将构建从指标采集、可视化到告警的完整监控体系。

12.1 监控指标分类

类别	关键指标	说明
节点	内存使用、磁盘空间、Erlang 进程数	节点健康状况
连接	连接数、通道数、消费者数	客户端状态
队列	队列深度、就绪消息、未确认消息	消息堆积情况
消息	发布速率、投递速率、确认速率	消息流转效率
集群	节点状态、网络分区、仲裁队列状态	集群健康状况

12.2 管理 API 监控

关键 API 端点

# 系统概览
curl -u admin:admin123 http://localhost:15672/api/overview

# 节点信息
curl -u admin:admin123 http://localhost:15672/api/nodes

# 队列列表
curl -u admin:admin123 http://localhost:15672/api/queues

# 连接列表
curl -u admin:admin123 http://localhost:15672/api/connections

# 通道列表
curl -u admin:admin123 http://localhost:15672/api/channels

# 交换机列表
curl -u admin:admin123 http://localhost:15672/api/exchanges

# 健康检查
curl -u admin:admin123 http://localhost:15672/api/health/checks/alarms
curl -u admin:admin123 http://localhost:15672/api/health/checks/local-alarms
curl -u admin:admin123 http://localhost:15672/api/health/checks/protocol-listeners

解析关键指标

import requests
import json

RABBITMQ_API = "http://localhost:15672/api"
AUTH = ("admin", "admin123")

def get_metrics():
    # 节点信息
    nodes = requests.get(f"{RABBITMQ_API}/nodes", auth=AUTH).json()
    for node in nodes:
        print(f"节点: {node['name']}")
        print(f"  内存使用: {node['mem_used'] / 1024**3:.2f} GB")
        print(f"  磁盘可用: {node['disk_free'] / 1024**3:.2f} GB")
        print(f"  Erlang 进程: {node['proc_used']}/{node['proc_total']}")
        print(f"  文件描述符: {node['fd_used']}/{node['fd_total']}")
    
    # 队列信息
    queues = requests.get(f"{RABBITMQ_API}/queues", auth=AUTH).json()
    for q in queues:
        if q['messages'] > 0:
            print(f"队列: {q['name']}, 消息: {q['messages']}, "
                  f"就绪: {q['messages_ready']}, 未确认: {q['messages_unacknowledged']}")
    
    # 连接信息
    connections = requests.get(f"{RABBITMQ_API}/connections", auth=AUTH).json()
    print(f"当前连接数: {len(connections)}")

get_metrics()

12.3 Prometheus + Grafana 监控

启用 Prometheus 插件

rabbitmq-plugins enable rabbitmq_prometheus

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']
    metrics_path: /metrics/per-object
    params:
      family:
        - queue_messages
        - queue_messages_ready
        - queue_messages_unacked

关键 Prometheus 指标

指标	说明
`rabbitmq_queue_messages`	队列总消息数
`rabbitmq_queue_messages_ready`	就绪消息数
`rabbitmq_queue_messages_unacked`	未确认消息数
`rabbitmq_queue_messages_published_total`	发布消息总数
`rabbitmq_queue_messages_delivered_total`	投递消息总数
`rabbitmq_connections`	当前连接数
`rabbitmq_channels`	当前通道数
`rabbitmq_process_resident_memory_bytes`	进程内存
`rabbitmq_disk_space_available_bytes`	可用磁盘空间

Grafana Dashboard

{
  "dashboard": {
    "title": "RabbitMQ Dashboard",
    "panels": [
      {
        "title": "Queue Depth",
        "type": "stat",
        "targets": [{
          "expr": "sum(rabbitmq_queue_messages)",
          "legendFormat": "Total Messages"
        }]
      },
      {
        "title": "Message Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(rabbitmq_channel_messages_published_total[5m])",
            "legendFormat": "Publish Rate"
          },
          {
            "expr": "rate(rabbitmq_channel_messages_delivered_total[5m])",
            "legendFormat": "Deliver Rate"
          }
        ]
      },
      {
        "title": "Connections",
        "type": "stat",
        "targets": [{
          "expr": "rabbitmq_connections"
        }]
      },
      {
        "title": "Memory Usage",
        "type": "gauge",
        "targets": [{
          "expr": "rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100",
          "legendFormat": "Memory %"
        }]
      }
    ]
  }
}

12.4 告警规则

Prometheus 告警规则

# rabbitmq_alerts.yml
groups:
  - name: rabbitmq
    rules:
      # 队列堆积告警
      - alert: RabbitMQQueueDepthHigh
        expr: rabbitmq_queue_messages > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "队列 {{ $labels.queue }} 消息堆积: {{ $value }}"

      # 未确认消息过多
      - alert: RabbitMQUnackedMessagesHigh
        expr: rabbitmq_queue_messages_unacked > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "队列 {{ $labels.queue }} 未确认消息过多: {{ $value }}"

      # 内存告警
      - alert: RabbitMQMemoryHigh
        expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes > 0.8
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ 内存使用超过 80%"

      # 磁盘告警
      - alert: RabbitMQDiskSpaceLow
        expr: rabbitmq_disk_space_available_bytes < 2147483648
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ 磁盘空间不足 2GB"

      # 连接数告警
      - alert: RabbitMQConnectionsHigh
        expr: rabbitmq_connections > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RabbitMQ 连接数过高: {{ $value }}"

      # 节点不健康
      - alert: RabbitMQNodeDown
        expr: up{job="rabbitmq"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ 节点宕机"

12.5 脚本监控

Shell 监控脚本

#!/bin/bash
# rabbitmq_monitor.sh

API="http://localhost:15672/api"
AUTH="admin:admin123"

echo "=== RabbitMQ 监控报告 ==="
echo "时间: $(date)"
echo

# 集群状态
echo "--- 集群节点 ---"
curl -s -u $AUTH $API/nodes | jq -r '.[] | "\(.name): \(.running)"'

# 队列堆积
echo -e "\n--- 队列堆积 (>100) ---"
curl -s -u $AUTH $API/queues | jq -r '.[] | select(.messages > 100) | "\(.name): \(.messages) (\(.messages_ready) ready, \(.messages_unacknowledged) unacked)"'

# 连接数
echo -e "\n--- 连接数 ---"
conn_count=$(curl -s -u $AUTH $API/connections | jq 'length')
echo "当前连接数: $conn_count"

# 内存使用
echo -e "\n--- 内存使用 ---"
curl -s -u $AUTH $API/nodes | jq -r '.[] | "\(.name): \(.mem_used / 1073741824 | round)GB / \(.mem_limit / 1073741824 | round)GB"'

# 告警
echo -e "\n--- 系统告警 ---"
alarms=$(curl -s -u $AUTH $API/health/checks/alarms)
echo "告警状态: $alarms"

12.6 注意事项

⚠️ 管理 API 性能影响

频繁查询管理 API 会产生额外开销，建议使用 Prometheus 插件而非轮询 API。

⚠️ 监控基数

每个队列/连接/通道都会生成独立指标，大规模集群下指标数量可能爆炸。使用 metrics/per-object 端点并配置过滤。

⚠️ 告警阈值调整

阈值需要根据实际业务量调整，避免告警风暴。

🔥 最佳实践: Prometheus + Grafana + AlertManager 构成完整的监控告警体系。

12.7 扩展阅读

下一章: 第 13 章：消息模式 — 掌握常见的消息设计模式。