强曰为道

与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

第 12 章:监控与可观测性

第 12 章:监控与可观测性

通过状态 API、Prometheus 和 Grafana 实现 rqlite 集群的全面监控。


12.1 监控体系概览

┌──────────────────────────────────────────────────┐
│                  监控体系                         │
│                                                  │
│  ┌──────────┐   ┌──────────────┐   ┌──────────┐ │
│  │ rqlite   │   │ Prometheus   │   │ Grafana  │ │
│  │ 状态API  │──►│ 数据采集     │──►│ 可视化   │ │
│  └──────────┘   └──────┬───────┘   └──────────┘ │
│                        │                         │
│                  ┌─────▼──────┐                  │
│                  │ AlertManager│                  │
│                  │ 告警通知     │                  │
│                  └────────────┘                  │
└──────────────────────────────────────────────────┘
层级组件职责
数据源rqlite 状态 API暴露节点和集群指标
采集层Prometheus定期拉取指标数据
存储层Prometheus TSDB时序数据存储
展示层Grafana仪表盘可视化
告警层AlertManager阈值告警通知

12.2 rqlite 状态 API

12.2.1 节点状态端点

# 获取完整状态
curl -s 'localhost:4001/status?pretty' | python3 -m json.tool

关键指标字段:

{
    "build": {
        "branch": "master",
        "commit": "abc1234",
        "version": "v8.36.5"
    },
    "store": {
        "raft_state": "Leader",
        "node_id": "node1",
        "db_conf": {
            "fk_constraints": true,
            "wal": true,
            "on_disk": false
        },
        "num_raft_peers": 2,
        "num_open_connections": 5,
        "applied_index": 12345,
        "commit_index": 12345,
        "last_log_index": 12345,
        "last_log_term": 3,
        "last_snapshot_index": 12000,
        "last_snapshot_term": 2,
        "last_contact": "0s",
        "term": 3,
        "num_snaps": 5,
        "db_size": 1048576
    },
    "runtime": {
        "GOARCH": "amd64",
        "GOOS": "linux",
        "GOMAXPROCS": 4,
        "num_cpu": 8,
        "num_goroutine": 42,
        "version": "go1.21.5"
    }
}

12.2.2 健康检查端点

# 就绪检查(HTTP 200 = 就绪,其他 = 未就绪)
curl -s -o /dev/null -w "%{http_code}" localhost:4001/status/ready

# Leader 检查
curl -s -o /dev/null -w "%{http_code}" localhost:4001/status/leader

12.2.3 节点列表端点

curl -s 'localhost:4001/nodes?pretty'
字段说明
id节点 ID
api_addrHTTP API 地址
addrRaft 地址
voter是否为投票节点
reachable是否可达
leader是否为 Leader

12.3 状态采集脚本

12.3.1 Shell 脚本采集

#!/bin/bash
# rqlite-exporter.sh — rqlite 指标采集脚本(输出 Prometheus 格式)
NODES=("localhost:4001" "localhost:4011" "localhost:4021")

echo "# HELP rqlite_raft_state Current Raft state (0=unknown, 1=follower, 2=candidate, 3=leader)"
echo "# TYPE rqlite_raft_state gauge"
echo "# HELP rqlite_applied_index Last applied Raft log index"
echo "# TYPE rqlite_applied_index gauge"
echo "# HELP rqlite_commit_index Last committed Raft log index"
echo "# TYPE rqlite_commit_index gauge"
echo "# HELP rqlite_num_peers Number of Raft peers"
echo "# TYPE rqlite_num_peers gauge"
echo "# HELP rqlite_db_size Database size in bytes"
echo "# TYPE rqlite_db_size gauge"
echo "# HELP rqlite_num_snapshots Number of snapshots"
echo "# TYPE rqlite_num_snapshots gauge"
echo "# HELP rqlite_num_open_connections Number of open connections"
echo "# TYPE rqlite_num_open_connections gauge"
echo "# HELP rqlite_up Whether the node is up (1=up, 0=down)"
echo "# TYPE rqlite_up gauge"

for node in "${NODES[@]}"; do
    id=$(echo "$node" | tr ':' '_')
    status=$(curl -s --connect-timeout 3 "http://$node/status" 2>/dev/null)
    
    if [ $? -ne 0 ] || [ -z "$status" ]; then
        echo "rqlite_up{node=\"$node\"} 0"
        continue
    fi
    
    echo "rqlite_up{node=\"$node\"} 1"
    
    # 解析状态
    raft_state=$(echo "$status" | python3 -c "import json,sys; d=json.load(sys.stdin); print(d['store']['raft_state'])" 2>/dev/null)
    
    case "$raft_state" in
        "Leader")    state_val=3 ;;
        "Follower")  state_val=1 ;;
        "Candidate") state_val=2 ;;
        *)           state_val=0 ;;
    esac
    
    echo "rqlite_raft_state{node=\"$node\",state=\"$raft_state\"} $state_val"
    
    echo "$status" | python3 -c "
import json, sys
d = json.load(sys.stdin)
s = d.get('store', {})
node = '$node'
print(f'rqlite_applied_index{{node=\"{node}\"}} {s.get(\"applied_index\", 0)}')
print(f'rqlite_commit_index{{node=\"{node}\"}} {s.get(\"commit_index\", 0)}')
print(f'rqlite_num_peers{{node=\"{node}\"}} {s.get(\"num_raft_peers\", 0)}')
print(f'rqlite_db_size{{node=\"{node}\"}} {s.get(\"db_size\", 0)}')
print(f'rqlite_num_snapshots{{node=\"{node}\"}} {s.get(\"num_snaps\", 0)}')
" 2>/dev/null
done

12.4 Prometheus 集成

12.4.1 配置 Prometheus 抓取

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # 直接抓取 rqlite 状态(需要自定义 exporter)
  - job_name: 'rqlite'
    static_configs:
      - targets:
          - 'localhost:4001'
          - 'localhost:4011'
          - 'localhost:4021'
        labels:
          cluster: 'rqlite-prod'
    
    # 使用自定义脚本 exporter
    metrics_path: /metrics
    scrape_interval: 30s

12.4.2 Prometheus 告警规则

# rqlite-alerts.yml
groups:
  - name: rqlite_alerts
    rules:
      # 节点宕机
      - alert: RqliteNodeDown
        expr: rqlite_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "rqlite 节点 {{ $labels.node }} 已宕机"
          description: "节点 {{ $labels.node }} 已超过 1 分钟无法访问"
      
      # 无 Leader
      - alert: RqliteNoLeader
        expr: count(rqlite_raft_state{state="Leader"}) == 0
        for: 30s
        labels:
          severity: critical
        annotations:
          summary: "rqlite 集群无 Leader"
          description: "当前集群中没有任何 Leader 节点"
      
      # 多个 Leader(脑裂)
      - alert: RqliteMultipleLeaders
        expr: count(rqlite_raft_state{state="Leader"}) > 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "rqlite 集群存在多个 Leader"
          description: "检测到 {{ $value }} 个 Leader,可能存在脑裂"
      
      # Raft 日志落后
      - alert: RqliteReplicationLag
        expr: (max(rqlite_applied_index) - rqlite_applied_index) > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "rqlite 节点 {{ $labels.node }} 复制落后"
          description: "节点落后 {{ $value }} 条日志"
      
      # 数据库大小增长过快
      - alert: RqliteDBSizeGrowing
        expr: increase(rqlite_db_size[1h]) > 104857600  # 100MB/h
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "rqlite 数据库增长过快"
          description: "过去 1 小时数据库增长了 {{ $value }} bytes"

12.4.3 Docker Compose 监控栈

# docker-compose-monitoring.yml
version: "3.8"

services:
  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - ./rqlite-alerts.yml:/etc/prometheus/rqlite-alerts.yml
      - prometheus-data:/prometheus
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin
    volumes:
      - grafana-data:/var/lib/grafana
    networks:
      - monitoring
    depends_on:
      - prometheus

  alertmanager:
    image: prom/alertmanager:latest
    container_name: alertmanager
    ports:
      - "9093:9093"
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
    networks:
      - monitoring

volumes:
  prometheus-data:
  grafana-data:

networks:
  monitoring:
    driver: bridge

12.5 Grafana 仪表盘

12.5.1 仪表盘 JSON 配置

{
    "dashboard": {
        "title": "rqlite 集群监控",
        "panels": [
            {
                "title": "节点状态",
                "type": "stat",
                "targets": [{
                    "expr": "rqlite_up",
                    "legendFormat": "{{ node }}"
                }],
                "fieldConfig": {
                    "defaults": {
                        "mappings": [
                            {"type": "value", "options": {"0": {"text": "离线", "color": "red"}}},
                            {"type": "value", "options": {"1": {"text": "在线", "color": "green"}}}
                        ]
                    }
                }
            },
            {
                "title": "Raft 角色",
                "type": "stat",
                "targets": [{
                    "expr": "rqlite_raft_state",
                    "legendFormat": "{{ node }} - {{ state }}"
                }]
            },
            {
                "title": "Applied Index 趋势",
                "type": "timeseries",
                "targets": [{
                    "expr": "rate(rqlite_applied_index[5m])",
                    "legendFormat": "{{ node }}"
                }]
            },
            {
                "title": "数据库大小",
                "type": "timeseries",
                "targets": [{
                    "expr": "rqlite_db_size",
                    "legendFormat": "{{ node }}"
                }],
                "fieldConfig": {
                    "defaults": {
                        "unit": "bytes"
                    }
                }
            }
        ]
    }
}

12.5.2 关键监控面板

面板名称PromQL 查询说明
节点在线状态rqlite_up1=在线,0=离线
Leader 分布rqlite_raft_state{state="Leader"}应该只有 1 个
日志同步延迟max(rqlite_applied_index) - rqlite_applied_indexFollower 落后的日志数
写入速率rate(rqlite_applied_index[5m])每秒应用的日志数
数据库大小rqlite_db_size当前数据库文件大小
快照数量rqlite_num_snapshots累计快照数
连接数rqlite_num_open_connections当前 HTTP 连接数

12.6 日志管理

12.6.1 rqlite 日志级别

rqlite 使用 Go 标准日志库,日志输出到 stderr:

# Docker 日志查看
docker logs rqlite1 --tail 100 -f

# systemd 日志查看
journalctl -u rqlited -f

# 日志级别(rqlite 目前不支持动态调整日志级别)

12.6.2 关键日志模式

日志模式含义是否需要关注
node is ready节点就绪正常
RAFT: entering Leader state成为 Leader正常
RAFT: entering Follower state成为 Follower正常
RAFT: no known peers, starting as leader首次启动正常
RAFT: election timeout reached选举超时⚠️ 关注
failed to connect to连接失败❌ 需排查
snapshot started开始快照正常
snapshot complete快照完成正常

12.6.3 集中日志收集

# Promtail 配置(Loki 日志收集)
scrape_configs:
  - job_name: rqlite
    docker_sd_configs:
      - filters:
          - name: label
            values: ["app=rqlite"]
    pipeline_stages:
      - regex:
          expression: '\[(?P<component>\w+)\] (?P<level>\w+): (?P<message>.*)'
      - labels:
          component:
          level:

12.7 业务场景:运维告警策略

告警级别条件响应时间处理方式
P0 紧急集群无 Leader5 分钟立即排查,重启故障节点
P0 紧急全部节点宕机1 分钟立即响应,恢复服务
P1 严重单节点宕机15 分钟排查并恢复节点
P1 严重检测到脑裂5 分钟停止写入,排查网络
P2 警告复制延迟 > 10001 小时检查网络和磁盘
P2 警告数据库大小增长异常1 小时检查数据增长原因
P3 通知节点重启当日记录并观察

12.8 监控检查清单

检查项采集频率告警阈值
节点在线状态30s连续 3 次失败
Leader 存在性30s0 个 Leader
复制延迟1min> 1000 条日志
数据库大小5min增长率异常
磁盘使用率1min> 85%
HTTP 连接数1min> 1000
Goroutine 数5min> 1000
响应延迟30s> 5s

12.9 本章小结

要点内容
状态 API/status/nodes/status/ready/status/leader
Prometheus通过自定义 exporter 采集指标
Grafana可视化集群状态和趋势
告警策略分级告警,P0-P3 四级
日志管理集中收集,关注关键日志模式

上一章:第 11 章:容器化部署 下一章:第 13 章:故障排查