强曰为道

与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

15 - 监控运维

监控运维

15.1 INFO 命令详解

# 获取所有信息
redis-cli INFO

# 按模块获取
redis-cli INFO server      # 服务器信息
redis-cli INFO clients      # 客户端信息
redis-cli INFO memory       # 内存信息
redis-cli INFO persistence  # 持久化信息
redis-cli INFO stats        # 统计信息
redis-cli INFO replication  # 复制信息
redis-cli INFO cpu          # CPU 信息
redis-cli INFO keyspace     # 数据库键空间信息
redis-cli INFO modules      # 模块信息
redis-cli INFO commandstats # 命令统计

Server 模块关键指标

redis-cli INFO server | grep -E "redis_version|uptime_in_seconds|hz|config_file"
指标说明告警阈值
redis_versionRedis 版本< 6.0 建议升级
uptime_in_seconds运行时长(秒)异常重启检测
hz服务器频率默认 10
config_file配置文件路径

Clients 模块关键指标

redis-cli INFO clients | grep -E "connected_clients|blocked_clients|tracking_clients"
指标说明告警阈值
connected_clients当前连接数> maxclients × 80%
blocked_clients阻塞的客户端数> 0(持续)
tracking_clients使用 Tracking 的客户端数
maxclients最大连接数根据业务设置

Memory 模块关键指标

redis-cli INFO memory | grep -E "used_memory_human|used_memory_peak_human|mem_fragmentation_ratio|mem_allocator"
指标说明告警阈值
used_memory已使用内存> maxmemory × 80%
used_memory_peak内存使用峰值
mem_fragmentation_ratio内存碎片率> 1.5 或 < 1.0
used_memory_rss操作系统分配的内存
used_memory_dataset数据占用内存
used_memory_overhead管理开销内存

Persistence 模块关键指标

redis-cli INFO persistence | grep -E "rdb_last_save_time|rdb_last_bgsave_status|aof_enabled|aof_last_rewrite_status"
指标说明告警阈值
rdb_last_save_time最后一次 RDB 保存时间超过预期间隔
rdb_last_bgsave_status最后一次 BGSAVE 状态err
aof_enabledAOF 是否开启根据配置
aof_last_rewrite_status最后一次 AOF 重写状态err
aof_current_sizeAOF 文件当前大小持续增长
aof_base_sizeAOF 基础大小

Stats 模块关键指标

redis-cli INFO stats | grep -E "total_commands_processed|instantaneous_ops_per_sec|keyspace_hits|keyspace_misses|rejected_connections|expired_keys"
指标说明告警阈值
total_commands_processed总处理命令数
instantaneous_ops_per_sec每秒操作数(QPS)根据容量
keyspace_hits键空间命中数
keyspace_misses键空间未命中数命中率 < 80%
rejected_connections拒绝的连接数> 0
expired_keys过期删除的键数
evicted_keys淘汰的键数> 0(可能内存不足)
latest_fork_usec最近一次 fork 耗时> 500ms

命中率计算

# 键空间命中率
HITS=$(redis-cli INFO stats | grep keyspace_hits | cut -d: -f2 | tr -d '\r')
MISSES=$(redis-cli INFO stats | grep keyspace_misses | cut -d: -f2 | tr -d '\r')
TOTAL=$((HITS + MISSES))
if [ $TOTAL -gt 0 ]; then
    HIT_RATE=$(echo "scale=2; $HITS * 100 / $TOTAL" | bc)
    echo "Hit rate: ${HIT_RATE}%"
fi

Replication 模块关键指标

redis-cli INFO replication | grep -E "role|connected_slaves|master_link_status|master_last_io_seconds_ago|slave_repl_offset"
指标说明告警阈值
role角色(master/slave)
connected_slaves连接的从节点数< 预期值
master_link_status主节点连接状态down
master_last_io_seconds_ago上次主节点通信> 10
slave_repl_offset从节点复制偏移量与主节点差异大

Commandstats 模块

redis-cli INFO commandstats | head -20
指标说明
cmdstat_SETSET 命令统计(调用次数、总耗时、平均耗时)
cmdstat_GETGET 命令统计

15.2 实时监控

redis-cli monitor

# 实时打印所有命令(生产慎用!会影响性能约 50%)
redis-cli monitor

# 只监控特定命令
redis-cli monitor | grep -E "SET|GET"

# 只监控特定 Key
redis-cli monitor | grep "user:1001"

# 统计命令频率(采样分析)
redis-cli monitor | head -10000 | awk '{print $4}' | sort | uniq -c | sort -nr | head -10

⚠️ 注意MONITOR 命令会显著影响性能(约 50%),只在调试时短时间使用,生产环境不要长时间开启。

redis-cli stat

# 实时统计信息
redis-cli --stat
# ------- data ------ -  ----- load ----  -  -  -  -
#   db   keys    mem     clients  blocked   requests     connections
#    0    1000  2.50M        5        0       1000/s          10
#    0    1001  2.51M        5        0       1200/s          10

redis-cli latency

# 延迟测试
redis-cli --latency
# min: 0, max: 1, avg: 0.19 (264 samples)

# 延迟历史
redis-cli --latency-history
# min: 0, max: 1, avg: 0.19 (132 samples) -- 15.00 seconds range

# 延迟分布图
redis-cli --latency-dist

# 采样特定命令的延迟
redis-cli --intrinsic-latency 5
# 5 seconds of testing...
# 128 microseconds per call (best of 1000 calls)

15.3 Prometheus + Grafana 监控

redis_exporter 安装

# Docker 方式
docker run -d --name redis-exporter \
  -p 9121:9121 \
  oliver006/redis_exporter \
  --redis.addr redis://192.168.1.100:6379 \
  --redis.password YourPassword

Docker Compose 完整监控方案

version: '3.8'

services:
  redis:
    image: redis:7.2
    container_name: redis
    ports:
      - "6379:6379"
    command: redis-server --requirepass mypassword --appendonly yes
    networks:
      - monitoring

  redis-exporter:
    image: oliver006/redis_exporter:latest
    container_name: redis-exporter
    ports:
      - "9121:9121"
    environment:
      REDIS_ADDR: redis://redis:6379
      REDIS_PASSWORD: mypassword
    depends_on:
      - redis
    networks:
      - monitoring

  prometheus:
    image: prom/prometheus:latest
    container_name: prometheus
    ports:
      - "9090:9090"
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:latest
    container_name: grafana
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: admin123
    volumes:
      - grafana-data:/var/lib/grafana
    depends_on:
      - prometheus
    networks:
      - monitoring

volumes:
  grafana-data:

networks:
  monitoring:
    driver: bridge

Prometheus 配置

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
    metrics_path: /metrics

Grafana Dashboard

导入 Redis 监控面板(Dashboard ID: 11835):

1. 打开 Grafana → http://localhost:3000
2. 左侧菜单 → Dashboards → Import
3. 输入 Dashboard ID: 11835
4. 选择 Prometheus 数据源
5. 导入

关键监控指标(PromQL)

# QPS
rate(redis_commands_processed_total[5m])

# 内存使用
redis_memory_used_bytes

# 连接数
redis_connected_clients

# 命中率
rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m]))

# 延迟(通过 exporter)
redis_commands_duration_seconds_total

# 淘汰数
rate(redis_evicted_keys_total[5m])

# 慢查询
redis_slowlog_length

# 主从复制延迟
redis_connected_slave_lag_seconds

告警规则

# alert-rules.yml
groups:
  - name: redis_alerts
    rules:
      - alert: RedisDown
        expr: redis_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Redis 实例宕机"
          description: "Redis {{ $labels.instance }} 已停止响应"

      - alert: RedisHighMemory
        expr: redis_memory_used_bytes / redis_memory_max_bytes > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis 内存使用率超过 80%"

      - alert: RedisHighFragmentation
        expr: redis_mem_fragmentation_ratio > 1.5
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis 内存碎片率过高"

      - alert: RedisLowHitRate
        expr: rate(redis_keyspace_hits_total[5m]) / (rate(redis_keyspace_hits_total[5m]) + rate(redis_keyspace_misses_total[5m])) < 0.8
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Redis 命中率低于 80%"

      - alert: RedisTooManyConnections
        expr: redis_connected_clients > 8000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Redis 连接数过多"

      - alert: RedisReplicationBroken
        expr: redis_connected_slaves < 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Redis 主从复制断开"

15.4 日志管理

# redis.conf
loglevel notice
logfile /var/log/redis/redis-server.log
# 日志轮转(logrotate)
# /etc/logrotate.d/redis
/var/log/redis/redis-server.log {
    daily
    rotate 7
    compress
    delaycompress
    missingok
    notifempty
    create 640 redis redis
    postrotate
        /bin/kill -USR1 $(cat /var/run/redis/redis-server.pid 2>/dev/null) 2>/dev/null || true
    endscript
}

📌 业务场景

场景一:容量规划

# 定期监控内存增长趋势
redis-cli INFO memory | grep used_memory_human
# 预测何时达到 maxmemory

场景二:性能调优

# 通过命令统计找出耗时命令
redis-cli INFO commandstats | sort -t= -k3 -nr | head -10

场景三:告警通知

# Prometheus + Alertmanager + 钉钉/企业微信
# 配置告警规则,发送到通知渠道

🔗 扖展阅读