vLLM 高性能推理部署指南 / 03 - 快速开始

03 - 快速开始

从零到第一个推理请求，15 分钟掌握 vLLM 的基本使用。

3.1 概览

vLLM 提供两种核心使用模式：

模式	适用场景	核心类/工具
离线批量推理（Offline Inference）	数据处理、批量生成、评估	`LLM` 类
在线服务（Online Serving）	API 服务、实时交互	`vllm serve` 命令

                  ┌───────────────────────────┐
                  │         vLLM 使用方式       │
                  └─────────┬─────────────────┘
                   ┌────────┴────────┐
                   ▼                 ▼
           ┌──────────────┐  ┌──────────────┐
           │  离线推理     │  │  在线服务     │
           │              │  │              │
           │  from vllm   │  │  vllm serve  │
           │  import LLM  │  │  模型名      │
           │              │  │              │
           │  批量处理     │  │  HTTP API    │
           │  数据管道     │  │  实时请求     │
           └──────────────┘  └──────────────┘

3.2 离线批量推理（Offline Inference）

3.2.1 基础示例

离线推理适用于一次性处理大量文本的场景，如数据增强、评估、批量生成等。

# offline_basic.py
"""vLLM 离线推理基础示例"""

from vllm import LLM, SamplingParams

# 1. 初始化 LLM 引擎
#    首次运行会自动下载模型（约 15 GB），后续从缓存加载
llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,        # 最大序列长度（减小可节省显存）
    gpu_memory_utilization=0.9, # GPU 显存使用比例
)

# 2. 配置采样参数
sampling_params = SamplingParams(
    temperature=0.7,     # 采样温度（0=贪心，>1=更随机）
    top_p=0.9,           # nucleus sampling
    max_tokens=512,      # 最大生成 token 数
    repetition_penalty=1.1,  # 重复惩罚
)

# 3. 批量推理
prompts = [
    "请介绍一下 vLLM 的核心优势。",
    "什么是 PagedAttention？",
    "如何优化 LLM 的推理性能？",
    "Python 中如何实现异步编程？",
]

outputs = llm.generate(prompts, sampling_params)

# 4. 处理结果
for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt[:30]}...")
    print(f"Generated: {generated_text[:100]}...")
    print(f"Tokens: {len(output.outputs[0].token_ids)}")
    print("-" * 60)

3.2.2 Chat 格式推理

对于指令微调模型（Chat 模型），使用 Chat 格式更合适：

# offline_chat.py
"""Chat 格式离线推理"""

from vllm import LLM, SamplingParams

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    max_model_len=4096,
)

sampling_params = SamplingParams(
    temperature=0.7,
    max_tokens=1024,
)

# 使用 chat 方法（自动应用 chat template）
conversations = [
    [
        {"role": "system", "content": "你是一个专业的技术助手。"},
        {"role": "user", "content": "解释一下 Transformer 的自注意力机制。"},
    ],
    [
        {"role": "system", "content": "你是一个专业的技术助手。"},
        {"role": "user", "content": "vLLM 是如何提升推理吞吐量的？"},
    ],
]

# chat() 方法自动处理 chat template
outputs = llm.chat(conversations, sampling_params=sampling_params)

for output in outputs:
    print(output.outputs[0].text)
    print("---")

3.2.3 大规模批量处理

# offline_batch.py
"""大规模批量处理示例"""

import json
from vllm import LLM, SamplingParams

def load_prompts(file_path: str) -> list[str]:
    """从 JSONL 文件加载提示词"""
    prompts = []
    with open(file_path, "r", encoding="utf-8") as f:
        for line in f:
            data = json.loads(line)
            prompts.append(data["prompt"])
    return prompts

def process_batch(prompts: list[str], output_file: str):
    """批量处理并保存结果"""
    llm = LLM(
        model="Qwen/Qwen2.5-7B-Instruct",
        max_model_len=2048,
        gpu_memory_utilization=0.92,
        max_num_batched_tokens=8192,  # 批处理最大 token 数
        max_num_seqs=256,             # 最大并发序列数
    )
    
    sampling_params = SamplingParams(
        temperature=0.1,   # 低温度，结果更确定
        max_tokens=512,
    )
    
    # vLLM 内部自动进行连续批处理
    outputs = llm.generate(prompts, sampling_params)
    
    # 保存结果
    with open(output_file, "w", encoding="utf-8") as f:
        for output in outputs:
            result = {
                "prompt": output.prompt,
                "generated": output.outputs[0].text,
                "tokens": len(output.outputs[0].token_ids),
                "finish_reason": output.outputs[0].finish_reason,
            }
            f.write(json.dumps(result, ensure_ascii=False) + "\n")
    
    print(f"处理完成，共 {len(outputs)} 条，结果保存到 {output_file}")

if __name__ == "__main__":
    prompts = load_prompts("prompts.jsonl")
    process_batch(prompts, "results.jsonl")

3.2.4 自定义模型路径

# 使用本地模型
llm = LLM(
    model="/data/models/Qwen2.5-7B-Instruct",  # 本地路径
    tokenizer="/data/models/Qwen2.5-7B-Instruct",  # 可选，指定分词器
    trust_remote_code=True,  # 加载自定义代码的模型
)

# 使用 ModelScope 模型
import os
os.environ["VLLM_USE_MODELSCOPE"] = "True"

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",  # 从 ModelScope 下载
)

3.3 在线服务（Online Serving）

3.3.1 启动 API 服务

# 最简单的启动方式
vllm serve Qwen/Qwen2.5-7B-Instruct

# 完整参数启动
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen-7b \
    --max-model-len 4096 \
    --gpu-memory-utilization 0.9 \
    --trust-remote-code \
    --dtype auto \
    --quantization awq \  # 可选：量化
    --enable-prefix-caching \  # 可选：前缀缓存
    --chat-template chat_template.jinja  # 可选：自定义 chat template

3.3.2 启动参数详解

参数	默认值	说明
`--host`	`localhost`	监听地址
`--port`	`8000`	监听端口
`--served-model-name`	模型名	API 中的模型名称
`--model`	-	HuggingFace 模型名或本地路径
`--max-model-len`	模型最大	最大序列长度
`--gpu-memory-utilization`	`0.9`	GPU 显存使用率
`--tensor-parallel-size`	`1`	张量并行数（= GPU 数）
`--dtype`	`auto`	数据类型：auto/half/float16/bfloat16
`--quantization`	无	量化方式：awq/gptq/fp8
`--enforce-eager`	`False`	禁用 CUDA Graph（调试用）
`--enable-prefix-caching`	`False`	启用前缀缓存
`--disable-log-requests`	`False`	禁用请求日志
`--trust-remote-code`	`False`	信任远程代码

3.3.3 测试 API 服务

# 测试健康检查
curl http://localhost:8000/health

# 测试模型列表
curl http://localhost:8000/v1/models | python -m json.tool

# 测试文本补全
curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "prompt": "The future of AI is",
        "max_tokens": 100,
        "temperature": 0.7
    }'

# 测试 Chat 接口
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "messages": [
            {"role": "user", "content": "你好，请介绍一下你自己。"}
        ],
        "max_tokens": 200,
        "temperature": 0.7
    }'

3.3.4 使用 Python 客户端

# client_test.py
"""使用 OpenAI 客户端连接 vLLM 服务"""

from openai import OpenAI

# 创建客户端（指向 vLLM 服务）
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",  # vLLM 默认不需要 API key
)

# Chat Completion
response = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "system", "content": "你是一个有用的助手。"},
        {"role": "user", "content": "什么是 PagedAttention？"},
    ],
    max_tokens=300,
    temperature=0.7,
)

print(response.choices[0].message.content)

# Text Completion
response = client.completions.create(
    model="qwen-7b",
    prompt="Python 是一种",
    max_tokens=100,
    temperature=0.7,
)

print(response.choices[0].text)

3.3.5 流式输出

# stream_client.py
"""流式输出示例"""

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="not-needed",
)

# Chat 流式输出
stream = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "user", "content": "写一首关于人工智能的短诗。"},
    ],
    max_tokens=200,
    temperature=0.8,
    stream=True,  # 启用流式输出
)

for chunk in stream:
    content = chunk.choices[0].delta.content
    if content:
        print(content, end="", flush=True)

print()  # 换行

3.4 模型加载配置

3.4.1 数据类型选择

# 自动选择（推荐）
llm = LLM(model="model", dtype="auto")

# 强制 FP16
llm = LLM(model="model", dtype="float16")

# 强制 BF16（A100/H100 推荐）
llm = LLM(model="model", dtype="bfloat16")

# FP32（不推荐，显存占用大）
llm = LLM(model="model", dtype="float32")

dtype	显存占用	精度	推荐场景
float32	4x	最高	几乎不用
float16	2x	高	通用 GPU
bfloat16	2x	高	A100/H100
auto	自动	-	推荐默认

3.4.2 显存管理

llm = LLM(
    model="Qwen/Qwen2.5-7B-Instruct",
    
    # GPU 显存使用比例（默认 0.9）
    # 设低可为其他程序留出空间
    gpu_memory_utilization=0.85,
    
    # 最大序列长度（减小可增加并发数）
    max_model_len=4096,  # 默认使用模型最大长度
    
    # 批处理参数
    max_num_batched_tokens=8192,  # 单批最大 token 数
    max_num_seqs=256,             # 单批最大序列数
    
    # Swap 空间（CPU 内存，用于 KV Cache 换出）
    swap_space=4,  # GB，默认 4
)

3.4.3 Tokenizer 配置

llm = LLM(
    model="model",
    
    # 使用自定义 tokenizer
    tokenizer="custom-tokenizer-path",
    
    # Tokenizer 模式
    tokenizer_mode="auto",     # auto / slow / mistral
    
    # 是否信任远程代码
    trust_remote_code=True,
    
    # 自定义 chat template
    chat_template="path/to/template.jinja",
)

3.5 采样参数详解

SamplingParams 控制文本生成的行为：

from vllm import SamplingParams

params = SamplingParams(
    # === 核心参数 ===
    n=1,                    # 生成的候选数量
    best_of=1,              # 从 best_of 个中选最好的 n 个
    max_tokens=512,         # 最大生成 token 数
    min_tokens=0,           # 最小生成 token 数
    
    # === 采样策略 ===
    temperature=0.7,        # 温度（0=贪心，>1=更多样）
    top_p=0.9,              # Nucleus sampling
    top_k=50,               # Top-K sampling（-1=禁用）
    
    # === 惩罚参数 ===
    repetition_penalty=1.1,     # 重复惩罚（>1=惩罚重复）
    frequency_penalty=0.0,      # 频率惩罚
    presence_penalty=0.0,       # 存在惩罚
    
    # === 停止条件 ===
    stop=["\n\n", "END"],   # 停止词
    stop_token_ids=[151643], # 停止 token ID
    
    # === 输出控制 ===
    logprobs=5,             # 返回的 log 概率数量
    prompt_logprobs=0,      # prompt 的 log 概率
    
    # === 特殊功能 ===
    ignore_eos=False,       # 是否忽略 EOS token
    detokenize=True,        # 是否反分词
    skip_special_tokens=True,  # 是否跳过特殊 token
)

采样参数对照表

场景	temperature	top_p	top_k	repetition_penalty
代码生成	0.0-0.2	0.95	-1	1.0
技术问答	0.3-0.5	0.9	-1	1.1
创意写作	0.7-1.0	0.95	50	1.1
贪心解码	0	1.0	-1	1.0

3.6 使用 Chat 模板

3.6.1 自动模板

vLLM 自动使用模型自带的 Chat Template：

from vllm import LLM, SamplingParams

llm = LLM(model="Qwen/Qwen2.5-7B-Instruct")
sampling_params = SamplingParams(max_tokens=512)

# 自动使用 Qwen 的 Chat Template
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello!"},
]

outputs = llm.chat([messages], sampling_params=sampling_params)
print(outputs[0].outputs[0].text)

3.6.2 自定义模板

# 指定自定义模板文件
llm = LLM(
    model="model",
    chat_template="custom_template.jinja",
)

# 或在命令行中指定
# vllm serve model --chat-template custom_template.jinja

自定义模板示例 (custom_template.jinja)：

{% for message in messages %}
{% if message.role == 'system' %}
[System]: {{ message.content }}
{% elif message.role == 'user' %}
[User]: {{ message.content }}
{% elif message.role == 'assistant' %}
[Assistant]: {{ message.content }}
{% endif %}
{% endfor %}
[Assistant]:

3.7 完整工作流示例

# complete_workflow.py
"""vLLM 完整工作流：从启动到推理"""

import time
from vllm import LLM, SamplingParams

def main():
    # 1. 初始化引擎
    print("正在加载模型...")
    start_time = time.time()
    
    llm = LLM(
        model="Qwen/Qwen2.5-7B-Instruct",
        max_model_len=4096,
        gpu_memory_utilization=0.9,
    )
    
    load_time = time.time() - start_time
    print(f"模型加载完成，耗时 {load_time:.1f}s")
    
    # 2. 准备数据
    prompts = [
        "用一句话解释量子计算。",
        "Python 和 Java 的主要区别是什么？",
        "推荐三本机器学习入门书籍。",
        "写一个快速排序的 Python 实现。",
    ]
    
    # 3. 配置采样
    sampling_params = SamplingParams(
        temperature=0.7,
        top_p=0.9,
        max_tokens=256,
    )
    
    # 4. 执行推理
    print(f"\n开始推理，共 {len(prompts)} 条...")
    start_time = time.time()
    
    outputs = llm.generate(prompts, sampling_params)
    
    inference_time = time.time() - start_time
    
    # 5. 输出结果
    total_tokens = 0
    for i, output in enumerate(outputs):
        text = output.outputs[0].text
        num_tokens = len(output.outputs[0].token_ids)
        total_tokens += num_tokens
        print(f"\n[{i+1}] Prompt: {output.prompt[:50]}...")
        print(f"    Response: {text[:200]}...")
        print(f"    Tokens: {num_tokens}")
    
    # 6. 统计信息
    print(f"\n=== 性能统计 ===")
    print(f"总耗时: {inference_time:.2f}s")
    print(f"总生成 tokens: {total_tokens}")
    print(f"吞吐量: {total_tokens / inference_time:.1f} tokens/s")

if __name__ == "__main__":
    main()

3.8 注意事项

首次运行：首次加载模型需要从 HuggingFace 下载权重，可能需要较长时间。建议提前下载（参见第 2 章）。

显存不足：如果 GPU 显存不够，可以：减小 max_model_len、降低 gpu_memory_utilization、使用量化模型。

warmup：首次推理可能较慢，vLLM 需要进行 CUDA Graph 的 warmup。后续请求会更快。

多进程启动：vLLM 使用多进程架构，multiprocessing 的启动方式默认为 spawn。如果遇到问题，可设置 VLLM_WORKER_MULTIPROC_METHOD=spawn。

3.9 扩展阅读

上一章：02 - 安装与环境配置 | 下一章：04 - OpenAI 兼容 API 服务