强曰为道

与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

第 12 章:监控与告警

第 12 章:监控与告警

监控是保障 RabbitMQ 稳定运行的基石。本章将构建从指标采集、可视化到告警的完整监控体系。


12.1 监控指标分类

类别关键指标说明
节点内存使用、磁盘空间、Erlang 进程数节点健康状况
连接连接数、通道数、消费者数客户端状态
队列队列深度、就绪消息、未确认消息消息堆积情况
消息发布速率、投递速率、确认速率消息流转效率
集群节点状态、网络分区、仲裁队列状态集群健康状况

12.2 管理 API 监控

关键 API 端点

# 系统概览
curl -u admin:admin123 http://localhost:15672/api/overview

# 节点信息
curl -u admin:admin123 http://localhost:15672/api/nodes

# 队列列表
curl -u admin:admin123 http://localhost:15672/api/queues

# 连接列表
curl -u admin:admin123 http://localhost:15672/api/connections

# 通道列表
curl -u admin:admin123 http://localhost:15672/api/channels

# 交换机列表
curl -u admin:admin123 http://localhost:15672/api/exchanges

# 健康检查
curl -u admin:admin123 http://localhost:15672/api/health/checks/alarms
curl -u admin:admin123 http://localhost:15672/api/health/checks/local-alarms
curl -u admin:admin123 http://localhost:15672/api/health/checks/protocol-listeners

解析关键指标

import requests
import json

RABBITMQ_API = "http://localhost:15672/api"
AUTH = ("admin", "admin123")

def get_metrics():
    # 节点信息
    nodes = requests.get(f"{RABBITMQ_API}/nodes", auth=AUTH).json()
    for node in nodes:
        print(f"节点: {node['name']}")
        print(f"  内存使用: {node['mem_used'] / 1024**3:.2f} GB")
        print(f"  磁盘可用: {node['disk_free'] / 1024**3:.2f} GB")
        print(f"  Erlang 进程: {node['proc_used']}/{node['proc_total']}")
        print(f"  文件描述符: {node['fd_used']}/{node['fd_total']}")
    
    # 队列信息
    queues = requests.get(f"{RABBITMQ_API}/queues", auth=AUTH).json()
    for q in queues:
        if q['messages'] > 0:
            print(f"队列: {q['name']}, 消息: {q['messages']}, "
                  f"就绪: {q['messages_ready']}, 未确认: {q['messages_unacknowledged']}")
    
    # 连接信息
    connections = requests.get(f"{RABBITMQ_API}/connections", auth=AUTH).json()
    print(f"当前连接数: {len(connections)}")

get_metrics()

12.3 Prometheus + Grafana 监控

启用 Prometheus 插件

rabbitmq-plugins enable rabbitmq_prometheus

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'rabbitmq'
    static_configs:
      - targets: ['rabbitmq:15692']
    metrics_path: /metrics/per-object
    params:
      family:
        - queue_messages
        - queue_messages_ready
        - queue_messages_unacked

关键 Prometheus 指标

指标说明
rabbitmq_queue_messages队列总消息数
rabbitmq_queue_messages_ready就绪消息数
rabbitmq_queue_messages_unacked未确认消息数
rabbitmq_queue_messages_published_total发布消息总数
rabbitmq_queue_messages_delivered_total投递消息总数
rabbitmq_connections当前连接数
rabbitmq_channels当前通道数
rabbitmq_process_resident_memory_bytes进程内存
rabbitmq_disk_space_available_bytes可用磁盘空间

Grafana Dashboard

{
  "dashboard": {
    "title": "RabbitMQ Dashboard",
    "panels": [
      {
        "title": "Queue Depth",
        "type": "stat",
        "targets": [{
          "expr": "sum(rabbitmq_queue_messages)",
          "legendFormat": "Total Messages"
        }]
      },
      {
        "title": "Message Rate",
        "type": "graph",
        "targets": [
          {
            "expr": "rate(rabbitmq_channel_messages_published_total[5m])",
            "legendFormat": "Publish Rate"
          },
          {
            "expr": "rate(rabbitmq_channel_messages_delivered_total[5m])",
            "legendFormat": "Deliver Rate"
          }
        ]
      },
      {
        "title": "Connections",
        "type": "stat",
        "targets": [{
          "expr": "rabbitmq_connections"
        }]
      },
      {
        "title": "Memory Usage",
        "type": "gauge",
        "targets": [{
          "expr": "rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes * 100",
          "legendFormat": "Memory %"
        }]
      }
    ]
  }
}

12.4 告警规则

Prometheus 告警规则

# rabbitmq_alerts.yml
groups:
  - name: rabbitmq
    rules:
      # 队列堆积告警
      - alert: RabbitMQQueueDepthHigh
        expr: rabbitmq_queue_messages > 10000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "队列 {{ $labels.queue }} 消息堆积: {{ $value }}"

      # 未确认消息过多
      - alert: RabbitMQUnackedMessagesHigh
        expr: rabbitmq_queue_messages_unacked > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "队列 {{ $labels.queue }} 未确认消息过多: {{ $value }}"

      # 内存告警
      - alert: RabbitMQMemoryHigh
        expr: rabbitmq_process_resident_memory_bytes / rabbitmq_resident_memory_limit_bytes > 0.8
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ 内存使用超过 80%"

      # 磁盘告警
      - alert: RabbitMQDiskSpaceLow
        expr: rabbitmq_disk_space_available_bytes < 2147483648
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ 磁盘空间不足 2GB"

      # 连接数告警
      - alert: RabbitMQConnectionsHigh
        expr: rabbitmq_connections > 5000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "RabbitMQ 连接数过高: {{ $value }}"

      # 节点不健康
      - alert: RabbitMQNodeDown
        expr: up{job="rabbitmq"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "RabbitMQ 节点宕机"

12.5 脚本监控

Shell 监控脚本

#!/bin/bash
# rabbitmq_monitor.sh

API="http://localhost:15672/api"
AUTH="admin:admin123"

echo "=== RabbitMQ 监控报告 ==="
echo "时间: $(date)"
echo

# 集群状态
echo "--- 集群节点 ---"
curl -s -u $AUTH $API/nodes | jq -r '.[] | "\(.name): \(.running)"'

# 队列堆积
echo -e "\n--- 队列堆积 (>100) ---"
curl -s -u $AUTH $API/queues | jq -r '.[] | select(.messages > 100) | "\(.name): \(.messages) (\(.messages_ready) ready, \(.messages_unacknowledged) unacked)"'

# 连接数
echo -e "\n--- 连接数 ---"
conn_count=$(curl -s -u $AUTH $API/connections | jq 'length')
echo "当前连接数: $conn_count"

# 内存使用
echo -e "\n--- 内存使用 ---"
curl -s -u $AUTH $API/nodes | jq -r '.[] | "\(.name): \(.mem_used / 1073741824 | round)GB / \(.mem_limit / 1073741824 | round)GB"'

# 告警
echo -e "\n--- 系统告警 ---"
alarms=$(curl -s -u $AUTH $API/health/checks/alarms)
echo "告警状态: $alarms"

12.6 注意事项

⚠️ 管理 API 性能影响

频繁查询管理 API 会产生额外开销,建议使用 Prometheus 插件而非轮询 API。

⚠️ 监控基数

每个队列/连接/通道都会生成独立指标,大规模集群下指标数量可能爆炸。使用 metrics/per-object 端点并配置过滤。

⚠️ 告警阈值调整

阈值需要根据实际业务量调整,避免告警风暴。

🔥 最佳实践: Prometheus + Grafana + AlertManager 构成完整的监控告警体系。


12.7 扩展阅读


下一章: 第 13 章:消息模式 — 掌握常见的消息设计模式。