强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

Docker Compose 完全指南 / 第 13 章 · 监控:cAdvisor、Prometheus 与 Grafana 集成

第 13 章 · 监控与可观测性

13.1 可观测性三大支柱

支柱工具说明
指标 (Metrics)Prometheus + Grafana数值型时间序列数据
日志 (Logs)Loki + Grafana(见第 12 章)离散事件记录
追踪 (Traces)Jaeger / Tempo请求链路追踪

本章聚焦指标监控——通过 Prometheus 收集指标,Grafana 展示仪表盘。


13.2 监控架构

┌──────────────────────────────────────────────────────┐
│                  Docker 宿主机                        │
│                                                      │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐              │
│  │  app-1  │  │  app-2  │  │  app-3  │              │
│  │ /metrics│  │ /metrics│  │ /metrics│  ← 应用暴露   │
│  └────┬────┘  └────┬────┘  └────┬────┘              │
│       │            │            │                    │
│       └────────────┼────────────┘                    │
│                    │                                 │
│            ┌───────▼────────┐                        │
│            │   Prometheus   │  ← 定期拉取指标         │
│            │   (tsdb)       │                        │
│            └───────┬────────┘                        │
│                    │                                 │
│       ┌────────────┼────────────┐                    │
│       │            │            │                    │
│  ┌────▼────┐ ┌─────▼────┐ ┌────▼─────┐              │
│  │ cAdvisor│ │  Node    │ │  Redis   │              │
│  │ 容器指标 │ │ Exporter │ │ Exporter │              │
│  └─────────┘ └──────────┘ └──────────┘              │
│                    │                                 │
│            ┌───────▼────────┐                        │
│            │    Grafana     │  ← 可视化仪表盘         │
│            │   (Dashboard)  │                        │
│            └────────────────┘                        │
└──────────────────────────────────────────────────────┘

13.3 cAdvisor

cAdvisor(Container Advisor)是 Google 开源的容器资源监控工具,自动采集容器的 CPU、内存、网络、磁盘等指标。

基本配置

services:
  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
      - /dev/disk/:/dev/disk:ro
    privileged: true
    devices:
      - /dev/kmsg
    restart: unless-stopped

⚠️ 安全提示:cAdvisor 需要 privileged 权限来读取宿主机信息。在生产环境中需要评估安全风险。

cAdvisor 暴露的指标

指标类别示例指标
CPUcontainer_cpu_usage_seconds_total
内存container_memory_usage_bytescontainer_memory_working_set_bytes
网络container_network_receive_bytes_totalcontainer_network_transmit_bytes_total
磁盘container_fs_usage_bytescontainer_fs_reads_total
任务container_tasks_state

13.4 Prometheus

Prometheus 是 CNCF 毕业项目,采用拉取(Pull)模式采集指标。

基本配置

services:
  prometheus:
    image: prom/prometheus:v2.52.0
    ports:
      - "9090:9090"
    volumes:
      - prometheus-data:/prometheus
      - ./prometheus.yml:/etc/prometheus/prometheus.yml:ro
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
      - '--web.enable-lifecycle'
    restart: unless-stopped

prometheus.yml 配置文件

# prometheus.yml
global:
  scrape_interval: 15s       # 全局抓取间隔
  evaluation_interval: 15s   # 规则评估间隔

# 告警规则文件
rule_files:
  - "rules/*.yml"

# 抓取配置
scrape_configs:
  # Prometheus 自身
  - job_name: 'prometheus'
    static_configs:
      - targets: ['localhost:9090']

  # cAdvisor — 容器指标
  - job_name: 'cadvisor'
    static_configs:
      - targets: ['cadvisor:8080']
    # 过滤不需要的指标
    metric_relabel_configs:
      - source_labels: [__name__]
        regex: 'container_(cpu|memory|network|fs)_.*'
        action: keep

  # Node Exporter — 宿主机指标
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # 应用自定义指标
  - job_name: 'app'
    static_configs:
      - targets: ['app:3000']
    metrics_path: '/metrics'
    scrape_interval: 10s     # 服务级别的抓取间隔

  # 使用 Docker 服务发现
  - job_name: 'docker'
    docker_sd_configs:
      - host: unix:///var/run/docker.sock
        refresh_interval: 10s
    relabel_configs:
      # 只抓取有 prometheus.scrape=true 标签的容器
      - source_labels: [__meta_docker_container_label_prometheus_scrape]
        regex: 'true'
        action: keep
      - source_labels: [__meta_docker_container_label_prometheus_port]
        regex: (.+)
        target_label: __address__
        replacement: '${1}'

13.5 Node Exporter

Node Exporter 采集宿主机级别的指标(CPU、内存、磁盘、网络)。

services:
  node-exporter:
    image: prom/node-exporter:v1.8.0
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
      - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
    restart: unless-stopped

13.6 Grafana

Grafana 是业界最流行的可视化工具,支持多种数据源。

基本配置

services:
  grafana:
    image: grafana/grafana:11.0.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_USER: admin
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
      GF_USERS_ALLOW_SIGN_UP: "false"
    volumes:
      - grafana-data:/var/lib/grafana
      - ./grafana/provisioning:/etc/grafana/provisioning:ro
    restart: unless-stopped

自动配置数据源

# grafana/provisioning/datasources/datasources.yml
apiVersion: 1

datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true
    editable: false

  - name: Loki
    type: loki
    access: proxy
    url: http://loki:3100
    editable: false

自动导入仪表盘

# grafana/provisioning/dashboards/dashboards.yml
apiVersion: 1

providers:
  - name: 'default'
    orgId: 1
    folder: ''
    type: file
    disableDeletion: false
    editable: true
    options:
      path: /etc/grafana/provisioning/dashboards
      foldersFromFilesStructure: true

常用社区仪表盘 ID

DashboardID说明
Docker 容器监控893cAdvisor 全景
Node Exporter Full1860宿主机全景
Prometheus 2.0 概览3662Prometheus 自身
Redis 仪表盘763Redis 监控
PostgreSQL 仪表盘9628数据库监控
Nginx 仪表盘12708Web 服务器监控
# 在 Grafana 中导入:
# 1. 访问 http://localhost:3000
# 2. 左侧菜单 → Dashboards → Import
# 3. 输入 Dashboard ID → Load
# 4. 选择 Prometheus 数据源 → Import

13.7 完整监控栈

# compose.monitoring.yaml
services:
  # ===== 监控组件 =====
  prometheus:
    image: prom/prometheus:v2.52.0
    ports:
      - "9090:9090"
    volumes:
      - prometheus-data:/prometheus
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - ./monitoring/rules:/etc/prometheus/rules:ro
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'
    restart: unless-stopped
    networks:
      - monitoring

  grafana:
    image: grafana/grafana:11.0.0
    ports:
      - "3000:3000"
    environment:
      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_PASSWORD:-admin}
    volumes:
      - grafana-data:/var/lib/grafana
      - ./monitoring/grafana/provisioning:/etc/grafana/provisioning:ro
    depends_on:
      - prometheus
    restart: unless-stopped
    networks:
      - monitoring

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.1
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro
    privileged: true
    devices:
      - /dev/kmsg
    restart: unless-stopped
    networks:
      - monitoring

  node-exporter:
    image: prom/node-exporter:v1.8.0
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
      - /:/rootfs:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'
      - '--path.rootfs=/rootfs'
    restart: unless-stopped
    networks:
      - monitoring

  alertmanager:
    image: prom/alertmanager:v0.27.0
    ports:
      - "9093:9093"
    volumes:
      - ./monitoring/alertmanager.yml:/etc/alertmanager/alertmanager.yml:ro
    command:
      - '--config.file=/etc/alertmanager/alertmanager.yml'
    restart: unless-stopped
    networks:
      - monitoring

  # ===== 应用服务(带监控标签)=====
  app:
    image: myapp:latest
    labels:
      prometheus.scrape: "true"
      prometheus.port: "3000"
      prometheus.path: "/metrics"
    networks:
      - monitoring
      - app-net

networks:
  monitoring:
  app-net:

volumes:
  prometheus-data:
  grafana-data:

13.8 应用指标暴露

各语言 Prometheus 客户端

语言
Goprometheus/client_golang
Pythonprometheus_client
Node.jsprom-client
Javamicrometer-registry-prometheus

Python 示例

from prometheus_client import Counter, Histogram, generate_latest
from flask import Flask, Response
import time

app = Flask(__name__)

# 定义指标
REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_DURATION = Histogram(
    'http_request_duration_seconds',
    'HTTP request duration',
    ['method', 'endpoint']
)

@app.before_request
def before_request():
    request.start_time = time.time()

@app.after_request
def after_request(response):
    duration = time.time() - request.start_time
    REQUEST_COUNT.labels(
        method=request.method,
        endpoint=request.path,
        status=response.status_code
    ).inc()
    REQUEST_DURATION.labels(
        method=request.method,
        endpoint=request.path
    ).observe(duration)
    return response

@app.route('/metrics')
def metrics():
    return Response(generate_latest(), mimetype='text/plain')

@app.route('/health')
def health():
    return {'status': 'healthy'}

13.9 告警配置

Prometheus 告警规则

# monitoring/rules/alerts.yml
groups:
  - name: container_alerts
    rules:
      # 容器 CPU 使用率过高
      - alert: ContainerHighCPU
        expr: rate(container_cpu_usage_seconds_total{name=~".+"}[5m]) > 0.8
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "容器 {{ $labels.name }} CPU 使用率超过 80%"

      # 容器内存使用率过高
      - alert: ContainerHighMemory
        expr: container_memory_working_set_bytes{name=~".+"} / container_spec_memory_limit_bytes{name=~".+"} > 0.85
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "容器 {{ $labels.name }} 内存使用率超过 85%"

      # 容器重启
      - alert: ContainerRestarting
        expr: increase(container_restart_count{name=~".+"}[15m]) > 3
        labels:
          severity: critical
        annotations:
          summary: "容器 {{ $labels.name }} 在 15 分钟内重启超过 3 次"

      # 容器停止
      - alert: ContainerDown
        expr: absent(container_memory_working_set_bytes{name=~".+"})
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "容器 {{ $labels.name }} 已停止"

Alertmanager 配置

# monitoring/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  receiver: 'default'
  routes:
    - match:
        severity: critical
      receiver: 'pager'

receivers:
  - name: 'default'
    webhook_configs:
      - url: 'http://notification-service:8080/webhook'

  - name: 'pager'
    webhook_configs:
      - url: 'http://notification-service:8080/webhook?priority=high'
    # 或邮件、Slack、钉钉等
    # slack_configs:
    #   - api_url: 'https://hooks.slack.com/...'
    #     channel: '#alerts'

13.10 监控最佳实践

实践说明
先基础设施后应用cAdvisor + Node Exporter → 应用指标
合理设置抓取间隔基础设施 15-30s,关键应用 5-10s
指标标签控制不要过度使用高基数标签(如 user_id)
数据保留策略开发 7 天,生产 30-90 天
仪表盘分层总览 → 服务 → 容器 → 实例
告警分级critical(立即处理)、warning(1小时内)、info(知悉)
容量规划监控 Prometheus 自身的存储和性能

13.11 小结

概念说明
cAdvisor容器指标采集(CPU、内存、网络、磁盘)
Prometheus指标存储与查询,Pull 模式
Grafana可视化仪表盘,多数据源支持
Node Exporter宿主机指标采集
Alertmanager告警管理与通知路由
Docker 服务发现自动发现有标签的容器
应用指标各语言 Prometheus 客户端暴露 /metrics

扩展阅读


上一章:第 12 章 · 日志 ← | 下一章:第 14 章 · 故障排查 →