第 12 章:监控与告警
第 12 章:监控与告警
12.1 监控指标体系
关键指标分类
| 类别 | 指标 | 告警阈值 | 说明 |
|---|
| 健康 | uptime | < 60s | 刚重启 |
| 命中率 | hit_rate | < 80% | 缓存效率低 |
| 内存 | mem_used / mem_limit | > 90% | 内存即将耗尽 |
| 连接 | curr_connections / max_connections | > 80% | 连接数接近上限 |
| 淘汰 | evictions | > 0 持续增长 | 内存不足导致淘汰 |
| QPS | cmd_get + cmd_set | 取决于基线 | 流量异常 |
| 延迟 | 响应时间 P99 | > 5ms | 性能劣化 |
| Slab | slab_automove | 异常 | Slab 分配问题 |
12.2 stats 命令详解
基本统计
echo "stats" | nc localhost 11211
| 指标 | 说明 | 重要程度 |
|---|
pid | 进程 ID | ★ |
uptime | 运行时间(秒) | ★★★ |
time | 当前 Unix 时间戳 | ★ |
version | 版本号 | ★★ |
libevent | libevent 版本 | ★ |
pointer_size | 指针位数 | ★ |
rusage_user | 用户态 CPU 时间 | ★★ |
rusage_system | 内核态 CPU 时间 | ★★ |
curr_connections | 当前连接数 | ★★★★★ |
total_connections | 累计连接数 | ★★ |
connection_structures | 已分配的连接结构数 | ★★ |
rejected_connections | 被拒绝的连接数 | ★★★★ |
cmd_get | GET 请求数 | ★★★★ |
cmd_set | SET 请求数 | ★★★★ |
cmd_flush | FLUSH 请求数 | ★★★ |
cmd_touch | TOUCH 请求数 | ★★ |
get_hits | GET 命中数 | ★★★★★ |
get_misses | GET 未命中数 | ★★★★★ |
get_expired | GET 过期数 | ★★★ |
get_flushed | GET 被 flush 数 | ★★ |
delete_misses | DELETE 未命中数 | ★★ |
delete_hits | DELETE 命中数 | ★★ |
incr_misses | INCR 未命中数 | ★★ |
incr_hits | INCR 命中数 | ★★ |
decr_misses | DECR 未命中数 | ★★ |
decr_hits | DECR 命中数 | ★★ |
cas_misses | CAS 未命中数 | ★★ |
cas_hits | CAS 命中数 | ★★ |
cas_badval | CAS 值不匹配数 | ★★ |
bytes | 当前存储字节数 | ★★★★★ |
limit_maxbytes | 最大内存限制 | ★★★★★ |
curr_items | 当前 Item 数 | ★★★★ |
total_items | 累计 Item 数 | ★★★ |
evictions | 淘汰次数 | ★★★★★ |
bytes_read | 读取字节数 | ★★ |
bytes_written | 写入字节数 | ★★ |
threads | Worker 线程数 | ★★ |
hash_power_level | 哈希表幂次 | ★★ |
hash_bytes | 哈希表字节数 | ★★ |
hash_is_expanding | 哈希表是否扩展中 | ★★ |
slab_reassign_running | Slab 迁移是否运行中 | ★★ |
slabs_moved | Slab 迁移次数 | ★★ |
计算命中率
#!/bin/bash
# 计算 Memcached 命中率
STATS=$(echo "stats" | nc localhost 11211)
HITS=$(echo "$STATS" | grep "get_hits" | awk '{print $3}')
MISSES=$(echo "$STATS" | grep "get_misses" | awk '{print $3}')
TOTAL=$((HITS + MISSES))
if [ $TOTAL -gt 0 ]; then
HIT_RATE=$(echo "scale=2; $HITS * 100 / $TOTAL" | bc)
echo "命中率: ${HIT_RATE}%"
echo "命中: $HITS, 未命中: $MISSES, 总计: $TOTAL"
else
echo "暂无请求数据"
fi
Item 统计
echo "stats items" | nc localhost 11211
# STAT items:1:number 523
# STAT items:1:number_hot 100
# STAT items:1:number_warm 150
# STAT items:1:number_cold 250
# STAT items:1:number_temp 23
# STAT items:1:age 1234
# STAT items:1:evicted 50
# STAT items:1:evicted_nonzero 40
# STAT items:1:evicted_time 300
# STAT items:1:outofmemory 5
# STAT items:1:tailrepairs 10
Slab 统计
echo "stats slabs" | nc localhost 11211
Settings 统计
echo "stats settings" | nc localhost 11211
# STAT maxbytes 134217728
# STAT maxconns 1024
# STAT tcpport 11211
# STAT udpport 0
# STAT inter 127.0.0.1
# STAT verbosity 0
# STAT oldest 0
# STAT evictions on
# STAT domain_socket NULL
# STAT umask 700
# STAT growth_factor 1.25
# STAT chunk_size 48
# STAT num_threads 4
# STAT num_threads_per_udp 4
# STAT stat_key_prefix :
# STAT detail_enabled no
# STAT reqs_per_event 20
# STAT cas_enabled yes
# STAT tcp_backlog 1024
# STAT binding_protocol auto-negotiate
# STAT auth_enabled_sasl no
# STAT item_size_max 1048576
# STAT maxconns_fast yes
# STAT hashpower_init 0
# STAT slab_reassign yes
# STAT slab_automove 1
# STAT lru_maintainer_thread yes
# STAT lru_crawler no
# STAT lru_crawler_sleep 100
# STAT lru_crawler_tocrawl 0
# STAT tail_repair_time 0
# STAT flush_enabled yes
# STAT dump_flawed no
# STAT hash_algorithm murmur3
12.3 Prometheus + Grafana 监控
方案架构
┌──────────────┐ ┌─────────────────────┐ ┌──────────┐
│ Memcached │────▶│ Exporter │────▶│Prometheus│
│ :11211 │stats│ (memcached_exporter)│ │ :9090 │
└──────────────┘ │ :9150 │ └────┬─────┘
└─────────────────────┘ │
▼
┌──────────┐
│ Grafana │
│ :3000 │
└──────────┘
部署 Memcached Exporter
# Docker 方式
docker run -d --name memcached-exporter \
-p 9150:9150 \
prom/memcached-exporter \
--memcached.address=memcached:11211
# 或使用二进制
wget https://github.com/prometheus/memcached_exporter/releases/download/v0.14.4/memcached_exporter-0.14.4.linux-amd64.tar.gz
tar xzf memcached_exporter-0.14.4.linux-amd64.tar.gz
./memcached_exporter --memcached.address=localhost:11211
Prometheus 配置
# prometheus.yml
scrape_configs:
- job_name: 'memcached'
static_configs:
- targets:
- 'mc-exporter1:9150'
- 'mc-exporter2:9150'
- 'mc-exporter3:9150'
scrape_interval: 15s
scrape_timeout: 10s
核心 Exporter 指标
| Prometheus 指标 | 含义 | 类型 |
|---|
memcached_up | 实例是否存活 | gauge |
memcached_items_total | Item 总数 | gauge |
memcached_current_bytes | 当前使用字节数 | gauge |
memcached_limit_bytes | 内存限制 | gauge |
memcached_commands_total | 命令总数(按类型) | counter |
memcached_connections_total | 连接数 | gauge |
memcached_current_items | 当前 Item 数 | gauge |
memcached_evictions_total | 淘汰总数 | counter |
memcached_slab_chunk_size | Slab chunk 大小 | gauge |
memcached_slab_chunks_free | Slab 空闲 chunk | gauge |
memcached_slab_chunks_used | Slab 已用 chunk | gauge |
Grafana 仪表盘
推荐使用社区提供的模板:
# 导入 Grafana 仪表盘 ID: 11987 (Memcached Overview)
# 或 ID: 2279 (Memcached Full)
常用 PromQL 查询
# 命中率
sum(rate(memcached_commands_total{command="get",status="hit"}[5m]))
/
sum(rate(memcached_commands_total{command="get"}[5m]))
* 100
# QPS
sum(rate(memcached_commands_total[5m]))
# 内存使用率
memcached_current_bytes / memcached_limit_bytes * 100
# 连接使用率
memcached_current_connections / memcached_max_connections * 100
# 淘汰速率
rate(memcached_evictions_total[5m])
# 各命令 QPS
sum by (command) (rate(memcached_commands_total[5m]))
12.4 告警规则
Prometheus AlertManager 规则
# memcached_alerts.yml
groups:
- name: memcached
rules:
# 实例宕机
- alert: MemcachedDown
expr: memcached_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Memcached 实例宕机"
description: "{{ $labels.instance }} 已宕机超过 1 分钟"
# 命中率低
- alert: MemcachedHitRateLow
expr: |
sum(rate(memcached_commands_total{command="get",status="hit"}[5m]))
/ sum(rate(memcached_commands_total{command="get"}[5m]))
* 100 < 80
for: 5m
labels:
severity: warning
annotations:
summary: "Memcached 命中率低于 80%"
description: "当前命中率: {{ $value }}%"
# 内存使用率高
- alert: MemcachedMemoryHigh
expr: memcached_current_bytes / memcached_limit_bytes * 100 > 90
for: 5m
labels:
severity: warning
annotations:
summary: "Memcached 内存使用率超过 90%"
# 连接数接近上限
- alert: MemcachedConnectionsHigh
expr: memcached_current_connections / memcached_max_connections * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Memcached 连接数超过 80%"
# 持续淘汰
- alert: MemcachedEvictions
expr: rate(memcached_evictions_total[5m]) > 10
for: 10m
labels:
severity: warning
annotations:
summary: "Memcached 持续淘汰数据"
description: "淘汰速率: {{ $value }}/s"
12.5 自定义监控脚本
完整监控脚本
#!/usr/bin/env python3
"""Memcached 监控脚本"""
import socket
import time
import json
import sys
def get_stats(host='localhost', port=11211):
s = socket.socket()
s.settimeout(5)
s.connect((host, port))
s.send(b'stats\r\n')
data = b''
while True:
chunk = s.recv(4096)
data += chunk
if b'END\r\n' in chunk:
break
s.close()
stats = {}
for line in data.decode().split('\r\n'):
if line.startswith('STAT '):
parts = line.split()
stats[parts[1]] = parts[2]
return stats
def check_health(stats):
alerts = []
# 命中率
hits = int(stats.get('get_hits', 0))
misses = int(stats.get('get_misses', 0))
total = hits + misses
if total > 0:
hit_rate = hits / total * 100
if hit_rate < 80:
alerts.append(f"命中率过低: {hit_rate:.1f}%")
# 内存使用率
used = int(stats.get('bytes', 0))
limit = int(stats.get('limit_maxbytes', 1))
mem_pct = used / limit * 100
if mem_pct > 90:
alerts.append(f"内存使用率过高: {mem_pct:.1f}%")
# 连接数
curr_conn = int(stats.get('curr_connections', 0))
max_conn = int(stats.get('max_connections', 1))
conn_pct = curr_conn / max_conn * 100
if conn_pct > 80:
alerts.append(f"连接数过高: {conn_pct:.1f}%")
# 淘汰
evictions = int(stats.get('evictions', 0))
if evictions > 0:
alerts.append(f"存在淘汰: {evictions}")
# 拒绝连接
rejected = int(stats.get('rejected_connections', 0))
if rejected > 0:
alerts.append(f"存在拒绝连接: {rejected}")
return alerts
def print_report(stats):
hits = int(stats.get('get_hits', 0))
misses = int(stats.get('get_misses', 0))
total = hits + misses
hit_rate = (hits / total * 100) if total > 0 else 0
print(f"""
Memcached 监控报告
═══════════════════════════════════
版本: {stats.get('version', 'N/A')}
运行时间: {int(stats.get('uptime', 0)) // 3600} 小时
线程数: {stats.get('threads', 'N/A')}
━━ 命中率 ━━━━━━━━━━━━━━━━━━━━━━
命中率: {hit_rate:.2f}%
命中数: {hits}
未命中数: {misses}
━━ 内存 ━━━━━━━━━━━━━━━━━━━━━━━
已用: {int(stats.get('bytes', 0)) / 1048576:.1f} MB
上限: {int(stats.get('limit_maxbytes', 0)) / 1048576:.1f} MB
使用率: {int(stats.get('bytes', 0)) / max(int(stats.get('limit_maxbytes', 1)), 1) * 100:.1f}%
Item 数: {stats.get('curr_items', 'N/A')}
━━ 流量 ━━━━━━━━━━━━━━━━━━━━━━━
GET: {stats.get('cmd_get', 'N/A')}
SET: {stats.get('cmd_set', 'N/A')}
DELETE: {stats.get('cmd_delete', 'N/A')}
INCR: {stats.get('cmd_incr', 'N/A')}
DECR: {stats.get('cmd_decr', 'N/A')}
━━ 连接 ━━━━━━━━━━━━━━━━━━━━━━━
当前连接: {stats.get('curr_connections', 'N/A')}
最大连接: {stats.get('max_connections', 'N/A')}
被拒绝: {stats.get('rejected_connections', 'N/A')}
━━ 淘汰 ━━━━━━━━━━━━━━━━━━━━━━━
淘汰数: {stats.get('evictions', 'N/A')}
""")
if __name__ == '__main__':
host = sys.argv[1] if len(sys.argv) > 1 else 'localhost'
port = int(sys.argv[2]) if len(sys.argv) > 2 else 11211
stats = get_stats(host, port)
print_report(stats)
alerts = check_health(stats)
if alerts:
print("⚠️ 告警:")
for a in alerts:
print(f" - {a}")
else:
print("✅ 状态正常")
12.6 日志分析
启用详细日志
# 启动时设置日志级别
memcached -vv # 详细日志(显示每次 get/set)
memcached -vvv # 非常详细(调试用)
# 运行时调整
echo "verbosity 2" | nc localhost 11211
日志级别
| 级别 | 参数 | 内容 |
|---|
| 0 | -v | 错误和关键信息 |
| 1 | -vv | 添加连接/断开信息 |
| 2 | -vvv | 添加每次命令执行 |
扩展阅读
小结
| 要点 | 内容 |
|---|
| 核心指标 | 命中率、内存使用率、连接数、淘汰数 |
| 命中率 | get_hits / (get_hits + get_misses),保持 > 80% |
| 推荐方案 | Prometheus + memcached_exporter + Grafana |
| 告警阈值 | 内存 > 90%、连接 > 80%、命中率 < 80%、淘汰 > 0 |