Docker Compose 完全指南 / 第 7 章 · 依赖与健康检查：depends_on、healthcheck、restart

第 7 章 · 依赖与健康检查

7.1 启动顺序的问题

多容器应用中，服务之间存在依赖关系。典型场景：Web 应用依赖数据库就绪后才能启动。

❌ 默认行为：启动顺序 ≠ 就绪顺序

services:
  web:
    image: myapp:latest
    depends_on:
      - db     # 只保证 db 先"启动"，不保证 db "就绪"

  db:
    image: postgres:16-alpine

时间线：
─────┬─────────────┬────────────────┬──────────
     │             │                │
   db 启动       web 启动        db 就绪
   (容器创建)    (容器创建)      (可接受连接)
     │             │                │
     │             ▼                │
     │        ❌ web 尝试连接 db    │
     │           失败！             │
     └─────────────────────────────┘

⚠️ 关键理解：depends_on 默认只控制启动顺序（容器创建），不等待服务就绪（如数据库接受连接）。这是最常见的陷阱。

7.2 depends_on 详解

短语法（仅控制启动顺序）

services:
  web:
    image: myapp:latest
    depends_on:
      - db
      - redis

  db:
    image: postgres:16-alpine

  redis:
    image: redis:7-alpine

长语法（支持条件等待）

services:
  web:
    image: myapp:latest
    depends_on:
      db:
        condition: service_healthy      # 等待 db 健康检查通过
        restart: true                   # db 重启时 web 也重启（V2.29+）
      redis:
        condition: service_healthy

  db:
    image: postgres:16-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 5

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 5s
      timeout: 3s
      retries: 5

condition 选项

条件值	说明
`service_started`	服务容器已启动（默认，等同于短语法）
`service_healthy`	服务的健康检查通过 ✅
`service_completed_successfully`	服务执行完毕且退出码为 0

条件使用场景

services:
  # 场景一：等待数据库就绪
  web:
    depends_on:
      db:
        condition: service_healthy

  # 场景二：等待初始化脚本执行完成
  db:
    image: postgres:16
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 5s
      timeout: 5s
      retries: 10

  db-init:
    image: postgres:16
    depends_on:
      db:
        condition: service_healthy
    command: psql -h db -U postgres -d myapp -f /scripts/init.sql
    volumes:
      - ./init.sql:/scripts/init.sql:ro

  web:
    depends_on:
      db-init:
        condition: service_completed_successfully  # 等初始化完成

7.3 healthcheck 健康检查

健康检查让 Docker 定期验证容器是否正常工作。

语法

services:
  web:
    image: nginx:alpine
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:80/health"]
      interval: 30s       # 检查间隔
      timeout: 10s         # 超时时间
      retries: 3           # 失败重试次数
      start_period: 40s    # 启动宽限期（首次检查前等待）
      start_interval: 5s   # 启动期间的检查间隔（Compose V2.20+）

参数说明

参数	默认值	说明
`test`	继承自镜像	检查命令
`interval`	30s	两次检查之间的时间间隔
`timeout`	30s	单次检查的超时时间
`retries`	3	连续失败几次后标记为 unhealthy
`start_period`	0s	容器启动后的宽限时间
`start_interval`	5s	启动期间的检查间隔（仅启动阶段）

写法变体

services:
  # 形式一：CMD 数组（推荐）
  app:
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]

  # 形式二：CMD-SHELL（通过 shell 执行）
  db:
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres || exit 1"]

  # 形式三：继承镜像的健康检查
  nginx:
    image: nginx:alpine
    healthcheck: {}        # 空对象 = 使用镜像默认

  # 形式四：禁用健康检查
  batch:
    image: mybatch:latest
    healthcheck:
      disable: true

各常见服务的健康检查

services:
  # PostgreSQL
  postgres:
    image: postgres:16-alpine
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-postgres}"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  # MySQL
  mysql:
    image: mysql:8.0
    healthcheck:
      test: ["CMD", "mysqladmin", "ping", "-h", "localhost"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s

  # Redis
  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5

  # MongoDB
  mongo:
    image: mongo:7
    healthcheck:
      test: ["CMD", "mongosh", "--eval", "db.adminCommand('ping')"]
      interval: 10s
      timeout: 5s
      retries: 5

  # Elasticsearch
  elasticsearch:
    image: elasticsearch:8.13.0
    healthcheck:
      test: ["CMD-SHELL", "curl -f http://localhost:9200/_cluster/health || exit 1"]
      interval: 30s
      timeout: 10s
      retries: 5
      start_period: 60s

  # HTTP 服务
  web:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 15s
      timeout: 5s
      retries: 3

  # TCP 端口检查（无 curl 时）
  tcp-service:
    image: myapp:latest
    healthcheck:
      test: ["CMD-SHELL", "nc -z localhost 3000 || exit 1"]
      interval: 10s
      timeout: 3s
      retries: 3

查看健康状态

# 查看容器健康状态
docker compose ps
# NAME          STATUS                  PORTS
# myproject-db-1   Up (healthy)        5432/tcp
# myproject-web-1  Up (unhealthy)      0.0.0.0:8080->3000/tcp

# 查看健康检查日志
docker inspect --format='{{json .State.Health}}' myproject-db-1 | jq

# 实时观察健康检查
docker compose logs -f db 2>&1 | grep healthcheck

7.4 restart 重启策略

重启策略控制容器退出后的行为。

策略选项

策略	说明
`no`	默认，不自动重启
`always`	始终重启（包括手动 stop 后 Docker 重启时）
`unless-stopped`	始终重启，除非手动 stop
`on-failure`	仅在非零退出码时重启
`on-failure:N`	非零退出码时重启，最多 N 次

使用示例

services:
  # 生产 Web 服务 — 始终重启
  web:
    image: myapp:latest
    restart: unless-stopped

  # 数据库 — 始终重启
  db:
    image: postgres:16
    restart: unless-stopped

  # 一次性任务 — 不重启
  migration:
    image: myapp:latest
    command: python manage.py migrate
    restart: "no"

  # Worker — 失败时重启，最多 5 次
  worker:
    image: myapp:latest
    command: celery -A myapp worker
    restart: on-failure:5

restart vs deploy.restart_policy

services:
  app:
    image: myapp:latest

    # 独立模式（docker compose up）
    restart: unless-stopped

    # Swarm 模式（docker stack deploy）— 会覆盖上面的设置
    deploy:
      restart_policy:
        condition: on-failure
        delay: 5s
        max_attempts: 3
        window: 120s

重启策略选择指南

容器退出后需要重启吗？
├── 否
│   ├── 一次性任务 → restart: "no"（或不设置）
│   └── 临时调试容器 → 不设置
└── 是
    ├── 生产服务 → restart: unless-stopped ✅
    ├── 任务 Worker → restart: on-failure:5
    └── 需要在 Docker 守护进程重启后也恢复 → restart: always

7.5 组合模式：依赖 + 健康检查 + 重启

生产级 Web 应用

services:
  web:
    image: myapp:latest
    ports:
      - "8080:3000"
    environment:
      DATABASE_URL: postgresql://postgres:secret@db:5432/myapp
      REDIS_URL: redis://redis:6379
    depends_on:
      db:
        condition: service_healthy
      redis:
        condition: service_healthy
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 15s
      timeout: 5s
      retries: 3
      start_period: 30s
    restart: unless-stopped

  db:
    image: postgres:16-alpine
    volumes:
      - pgdata:/var/lib/postgresql/data
    environment:
      POSTGRES_PASSWORD: secret
      POSTGRES_DB: myapp
    healthcheck:
      test: ["CMD-SHELL", "pg_isready -U postgres"]
      interval: 10s
      timeout: 5s
      retries: 5
      start_period: 30s
    restart: unless-stopped

  redis:
    image: redis:7-alpine
    healthcheck:
      test: ["CMD", "redis-cli", "ping"]
      interval: 10s
      timeout: 3s
      retries: 5
    restart: unless-stopped

  nginx:
    image: nginx:alpine
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf:ro
    depends_on:
      web:
        condition: service_healthy
    restart: unless-stopped

volumes:
  pgdata:

启动时序

t=0s    db 启动，healthcheck 开始
t=5s    db healthcheck (start_interval)
t=10s   db healthcheck → pg_isready 成功 → healthy ✅
t=10s   redis 启动（几乎同时 healthy）
t=10s   web 启动（db 和 redis 都 healthy）
t=10s   web healthcheck 开始（start_period=30s 内不计入 retries）
t=40s   web healthcheck → curl 成功 → healthy ✅
t=40s   nginx 启动
t=41s   nginx 就绪，开始接收流量

7.6 启动宽限期（start_period）

start_period 解决了"服务还在初始化就被判定为不健康"的问题。

services:
  # Spring Boot 应用启动可能需要 60 秒
  spring-app:
    image: myapp:latest
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8080/actuator/health"]
      interval: 10s
      timeout: 5s
      retries: 3
      start_period: 60s    # 前 60 秒内健康检查失败不计入 retries
      start_interval: 10s  # 启动期间每 10s 检查一次

start_period 行为

时间线：
0s          60s                           90s
├───────────┼──────────────────────────────┤
│ start_period (宽限期)    │ 正常检查间隔   │
│                          │               │
│ 失败不计入 retries       │ 失败计入       │
│ 但如果成功，容器标记为    │ retries       │
│ healthy                  │               │
└──────────┴──────────────────────────────┘

7.7 等待脚本模式

当 healthcheck 不够灵活时，可以使用等待脚本。

wait-for-it 模式

services:
  web:
    image: myapp:latest
    depends_on:
      db:
        condition: service_started    # 仅等启动
    command: >
      sh -c "
        echo 'Waiting for database...'
        until nc -z db 5432; do
          sleep 1
        done
        echo 'Database is ready!'
        exec python app.py
      "

Docker 官方等待模式

services:
  web:
    image: myapp:latest
    depends_on:
      db:
        condition: service_healthy    # V2 推荐方式
    command: ["python", "app.py"]

💡 V2 最佳实践：优先使用 depends_on.condition: service_healthy 配合 healthcheck，避免自定义等待脚本。等待脚本是 V1 时代的解决方案。

自定义等待脚本（作为 entrypoint）

#!/bin/sh
# wait-for-deps.sh

set -e

host="$1"
port="$2"
shift 2
cmd="$@"

echo "Waiting for $host:$port..."
while ! nc -z "$host" "$port" 2>/dev/null; do
  sleep 1
done
echo "$host:$port is available. Starting application..."

exec $cmd

services:
  web:
    image: myapp:latest
    volumes:
      - ./wait-for-deps.sh:/usr/local/bin/wait-for-deps.sh:ro
    entrypoint: ["wait-for-deps.sh", "db", "5432"]
    command: ["python", "app.py"]
    depends_on:
      - db

7.8 init 进程与信号处理

为什么需要 init 进程？

默认情况下，PID 1 是你的应用进程。PID 1 有一些特殊行为：

不会默认处理 SIGTERM：docker stop 发送 SIGTERM，如果应用不处理，10s 后被 SIGKILL
僵尸进程不会被回收：子进程退出后成为僵尸

使用 tini 作为 init

services:
  app:
    image: myapp:latest
    init: true    # 使用 tini 作为 PID 1

  # 或在 Dockerfile 中
  # FROM myapp:latest
  # ENTRYPOINT ["tini", "--"]
  # CMD ["python", "app.py"]

优雅关闭

services:
  nginx:
    image: nginx:alpine
    stop_grace_period: 30s    # 等待 30s 后才 SIGKILL

  app:
    image: myapp:latest
    stop_grace_period: 15s
    stop_signal: SIGQUIT      # 自定义停止信号

信号	默认行为	用途
`SIGTERM`	`docker stop` 默认发送	优雅关闭
`SIGKILL`	超时后强制发送	强制终止
`SIGINT`	`Ctrl+C` 发送	用户中断
`SIGQUIT`	可自定义	Nginx 等用于优雅关闭

7.9 依赖关系图可视化

查看依赖关系

# 查看服务依赖（V2.24+）
docker compose ls --format json | jq

# 手动绘制依赖图
# 从 compose.yaml 中提取 depends_on 关系

依赖关系示例

┌──────────────────────────────────────────────────┐
│                   依赖关系图                       │
│                                                  │
│   ┌──────┐                                      │
│   │nginx │                                      │
│   └──┬───┘                                      │
│      │ depends_on (service_healthy)              │
│      ▼                                          │
│   ┌──────┐                                      │
│   │ web  │                                      │
│   └──┬───┘                                      │
│      │                                          │
│      ├───── depends_on (service_healthy) ───┐    │
│      ▼                                      ▼    │
│   ┌──────┐                            ┌───────┐ │
│   │  db  │                            │ redis │ │
│   └──────┘                            └───────┘ │
│                                                  │
│   启动顺序: db, redis → web → nginx              │
└──────────────────────────────────────────────────┘

7.10 常见问题

问题	原因	解决方案
应用启动时数据库连不上	`depends_on` 未等待就绪	使用 `condition: service_healthy`
健康检查一直 unhealthy	检查命令错误或服务启动慢	增大 `start_period`，检查命令
容器反复重启	`restart: always` + 应用崩溃	修复应用或使用 `on-failure:N`
僵尸进程积累	PID 1 不回收子进程	使用 `init: true`
优雅关闭失败	应用不处理 SIGTERM	配置 `stop_signal` 和 `stop_grace_period`

7.11 小结

概念	说明
`depends_on`	控制服务启动顺序，长语法支持条件等待
`condition`	`service_healthy` 是生产环境最常用的等待条件
`healthcheck`	定期检查服务健康状态，各种服务有不同检查方式
`restart`	`unless-stopped` 适合生产，`on-failure:N` 适合任务
`start_period`	启动宽限期，避免慢启动服务被误判
`init`	使用 tini 作为 PID 1，正确处理信号和僵尸进程

扩展阅读

上一章：第 6 章 · 环境变量 ← ｜下一章：第 8 章 · 构建 →