vLLM 高性能推理部署指南 / 13 - Docker 容器化部署
13 - Docker 容器化部署
使用 Docker 和 Docker Compose 构建标准化、可复现的 vLLM 部署方案。
13.1 Docker 基础部署
13.1.1 使用官方镜像
# 拉取官方镜像
docker pull vllm/vllm-openai:latest
# 基础启动
docker run -d \
--name vllm \
--gpus all \
--ipc=host \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
vllm/vllm-openai:latest \
--model Qwen/Qwen2.5-7B-Instruct \
--served-model-name qwen-7b
13.1.2 关键 Docker 参数说明
| 参数 | 必需 | 说明 |
|---|---|---|
--gpus all | ✅ | 使用所有 GPU |
--ipc=host | ✅ | 共享主机 IPC(共享内存) |
--shm-size=16g | 推荐 | 设置共享内存大小 |
-p 8000:8000 | ✅ | 端口映射 |
-v ...:/root/.cache/huggingface | 推荐 | 模型缓存持久化 |
--ulimit memlock=-1:-1 | 推荐 | 解除内存锁定 |
--ipc=host是必需的。没有这个参数,vLLM 的多进程架构会因为共享内存不足而崩溃。
13.2 自定义 Dockerfile
13.2.1 基础 Dockerfile
# Dockerfile
FROM vllm/vllm-openai:latest
# 安装额外依赖
RUN pip install --no-cache-dir \
prometheus-client \
structlog
# 复制自定义配置
COPY entrypoint.sh /entrypoint.sh
COPY chat_template.jinja /templates/chat_template.jinja
RUN chmod +x /entrypoint.sh
# 健康检查
HEALTHCHECK --interval=30s --timeout=10s --retries=3 \
CMD curl -f http://localhost:8000/health || exit 1
EXPOSE 8000
ENTRYPOINT ["/entrypoint.sh"]
13.2.2 Entrypoint 脚本
#!/bin/bash
# entrypoint.sh
set -e
# 环境变量默认值
MODEL_NAME=${MODEL_NAME:-"Qwen/Qwen2.5-7B-Instruct"}
SERVED_MODEL_NAME=${SERVED_MODEL_NAME:-"qwen-7b"}
MAX_MODEL_LEN=${MAX_MODEL_LEN:-"4096"}
GPU_MEMORY_UTILIZATION=${GPU_MEMORY_UTILIZATION:-"0.9"}
PORT=${PORT:-"8000"}
echo "Starting vLLM server..."
echo " Model: $MODEL_NAME"
echo " Served Name: $SERVED_MODEL_NAME"
echo " Max Length: $MAX_MODEL_LEN"
echo " GPU Memory: $GPU_MEMORY_UTILIZATION"
exec python -m vllm.entrypoints.openai.api_server \
--model "$MODEL_NAME" \
--served-model-name "$SERVED_MODEL_NAME" \
--max-model-len "$MAX_MODEL_LEN" \
--gpu-memory-utilization "$GPU_MEMORY_UTILIZATION" \
--host 0.0.0.0 \
--port "$PORT" \
--trust-remote-code \
--dtype auto \
"$@"
13.2.3 多阶段构建(更小的镜像)
# Dockerfile.multistage
FROM python:3.11-slim as builder
WORKDIR /app
RUN pip install --no-cache-dir vllm
# 最终镜像
FROM nvidia/cuda:12.4.0-runtime-ubuntu22.04
RUN apt-get update && apt-get install -y \
python3 python3-pip curl \
&& rm -rf /var/lib/apt/lists/*
COPY --from=builder /usr/local/lib/python3.11/site-packages /usr/local/lib/python3.11/site-packages
COPY --from=builder /usr/local/bin/vllm /usr/local/bin/vllm
COPY entrypoint.sh /entrypoint.sh
RUN chmod +x /entrypoint.sh
EXPOSE 8000
ENTRYPOINT ["/entrypoint.sh"]
13.3 Docker Compose 部署
13.3.1 单模型部署
# docker-compose.yml
version: '3.8'
services:
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8000"
ipc: host
shm_size: '16gb'
ulimits:
memlock:
soft: -1
hard: -1
volumes:
- model-cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model Qwen/Qwen2.5-7B-Instruct
--served-model-name qwen-7b
--max-model-len 4096
--gpu-memory-utilization 0.9
--trust-remote-code
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 5
start_period: 300s
volumes:
model-cache:
13.3.2 多模型部署
# docker-compose.multi-model.yml
version: '3.8'
services:
# 通用对话模型
vllm-chat:
image: vllm/vllm-openai:latest
container_name: vllm-chat
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0']
capabilities: [gpu]
ports:
- "8000:8000"
ipc: host
shm_size: '16gb'
volumes:
- model-cache:/root/.cache/huggingface
environment:
- CUDA_VISIBLE_DEVICES=0
command: >
--model Qwen/Qwen2.5-7B-Instruct
--served-model-name qwen-7b
--max-model-len 4096
--gpu-memory-utilization 0.9
restart: unless-stopped
# 代码模型
vllm-code:
image: vllm/vllm-openai:latest
container_name: vllm-code
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['1']
capabilities: [gpu]
ports:
- "8001:8000"
ipc: host
shm_size: '16gb'
volumes:
- model-cache:/root/.cache/huggingface
environment:
- CUDA_VISIBLE_DEVICES=1
command: >
--model Qwen/Qwen2.5-Coder-7B-Instruct
--served-model-name qwen-coder
--max-model-len 4096
--gpu-memory-utilization 0.9
restart: unless-stopped
# 负载均衡(Nginx)
nginx:
image: nginx:alpine
container_name: vllm-nginx
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
depends_on:
- vllm-chat
- vllm-code
restart: unless-stopped
volumes:
model-cache:
13.3.3 Nginx 负载均衡配置
# nginx.conf
events {
worker_connections 1024;
}
http {
upstream vllm_chat {
server vllm-chat:8000;
}
upstream vllm_code {
server vllm-code:8000;
}
server {
listen 80;
# 通用模型
location /v1/chat/completions {
proxy_pass http://vllm_chat;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_buffering off;
proxy_cache off;
proxy_read_timeout 300s;
}
# 代码模型
location /v1/code/completions {
rewrite ^/v1/code/(.*)$ /v1/$1 break;
proxy_pass http://vllm_code;
proxy_set_header Host $host;
proxy_buffering off;
proxy_read_timeout 300s;
}
# 健康检查
location /health/chat {
proxy_pass http://vllm_chat/health;
}
location /health/code {
proxy_pass http://vllm_code/health;
}
}
}
13.4 多 GPU 部署
13.4.1 单容器多 GPU(张量并行)
# docker-compose.multi-gpu.yml
version: '3.8'
services:
vllm-70b:
image: vllm/vllm-openai:latest
container_name: vllm-70b
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '1', '2', '3']
capabilities: [gpu]
ports:
- "8000:8000"
ipc: host
shm_size: '32gb'
volumes:
- model-cache:/root/.cache/huggingface
command: >
--model Qwen/Qwen2.5-72B-Instruct
--tensor-parallel-size 4
--served-model-name qwen-72b
--max-model-len 8192
--gpu-memory-utilization 0.9
restart: unless-stopped
volumes:
model-cache:
13.4.2 GPU 设备选择
# 使用特定 GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '2'] # 仅使用 GPU 0 和 GPU 2
capabilities: [gpu]
# 使用所有 GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
# 使用指定数量的 GPU
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 2
capabilities: [gpu]
13.5 带监控的完整部署
13.5.1 完整 Compose 文件
# docker-compose.full.yml
version: '3.8'
services:
# vLLM 推理服务
vllm:
image: vllm/vllm-openai:latest
container_name: vllm-server
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
ports:
- "8000:8000"
ipc: host
shm_size: '16gb'
volumes:
- model-cache:/root/.cache/huggingface
environment:
- HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
command: >
--model Qwen/Qwen2.5-7B-Instruct
--served-model-name qwen-7b
--max-model-len 4096
--gpu-memory-utilization 0.9
restart: unless-stopped
networks:
- monitoring
# Prometheus
prometheus:
image: prom/prometheus:latest
container_name: prometheus
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
restart: unless-stopped
networks:
- monitoring
# Grafana
grafana:
image: grafana/grafana:latest
container_name: grafana
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASSWORD:-admin}
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
restart: unless-stopped
networks:
- monitoring
# DCGM Exporter(GPU 监控)
dcgm-exporter:
image: nvcr.io/nvidia/k8s/dcgm-exporter:3.3.8-3.6.0-ubuntu22.04
container_name: dcgm-exporter
ports:
- "9400:9400"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
networks:
- monitoring
# Nginx 反向代理
nginx:
image: nginx:alpine
container_name: nginx
ports:
- "80:80"
- "443:443"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf:ro
- ./certs:/etc/nginx/certs:ro
depends_on:
- vllm
restart: unless-stopped
networks:
- monitoring
networks:
monitoring:
driver: bridge
volumes:
model-cache:
prometheus-data:
grafana-data:
13.5.2 Prometheus 配置
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'vllm'
static_configs:
- targets: ['vllm:8000']
scrape_interval: 5s
- job_name: 'dcgm'
static_configs:
- targets: ['dcgm-exporter:9400']
13.6 模型缓存策略
13.6.1 Docker Volume 缓存
# 使用命名卷持久化模型缓存
volumes:
model-cache:
driver: local
driver_opts:
type: none
o: bind
device: /data/huggingface-cache # 宿主机路径
13.6.2 挂载本地目录
volumes:
- /data/models/Qwen2.5-7B:/models/qwen-7b:ro
command: --model /models/qwen-7b
13.6.3 预下载模型到镜像
# Dockerfile.with-model
FROM vllm/vllm-openai:latest
# 预下载模型(构建时缓存)
RUN python -c "
from huggingface_hub import snapshot_download;
snapshot_download('Qwen/Qwen2.5-7B-Instruct', local_dir='/models/qwen-7b')
"
CMD ["--model", "/models/qwen-7b"]
13.7 日志管理
# docker-compose.yml 日志配置
services:
vllm:
logging:
driver: json-file
options:
max-size: "100m"
max-file: "3"
tag: "vllm"
13.8 生产部署清单
□ 使用 --ipc=host(必须)
□ 设置合适的 shm_size(≥ 16GB)
□ 挂载模型缓存卷(避免重复下载)
□ 配置健康检查
□ 设置重启策略(unless-stopped)
□ 限制 GPU 设备选择
□ 配置日志大小限制
□ 配置 NVIDIA Container Toolkit
□ 使用 .env 文件管理密钥
□ 设置合适的超时时间
□ 配置监控(Prometheus + Grafana)
□ 测试流式输出(确保 Nginx 未缓冲)
□ 配置 TLS/HTTPS
□ 测试故障恢复
13.9 常见 Docker 问题
| 问题 | 原因 | 解决方案 |
|---|---|---|
bus error | 共享内存不足 | 添加 --ipc=host 或 --shm-size=16g |
CUDA out of memory | 显存不足 | 减小 gpu-memory-utilization 或 max-model-len |
| 容器启动后无法访问 | 端口未映射 | 检查 -p 8000:8000 |
| 模型每次都重新下载 | 缓存未挂载 | 挂载 /root/.cache/huggingface |
| GPU 不可见 | NVIDIA 运行时未配置 | 安装 nvidia-container-toolkit |
13.10 注意事项
安全:生产环境不要使用
--ipc=host与其他容器共享主机 IPC。考虑使用专用节点。
资源隔离:使用
deploy.resources.limits限制 CPU 和内存,防止容器占用过多资源。
镜像大小:vLLM 官方镜像较大(约 10GB)。使用多阶段构建可以减小镜像大小。
启动时间:大模型的容器启动时间可能较长。确保编排工具(如 Docker Swarm 或 K8s)的健康检查有足够的宽限期。
13.11 扩展阅读
上一章:12 - Kubernetes 部署 | 下一章:14 - 故障排查