强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

vLLM 高性能推理部署指南 / 04 - OpenAI 兼容 API 服务

04 - OpenAI 兼容 API 服务

全面掌握 vLLM 的 OpenAI 兼容 API,实现与 OpenAI 生态的无缝对接。


4.1 API 服务概览

vLLM 内置了与 OpenAI API 完全兼容的 HTTP 服务,可以作为 OpenAI API 的直接替代品。这意味着现有的 OpenAI 客户端代码只需修改 base_url 即可无缝切换到 vLLM。

4.1.1 架构总览

客户端代码
    │
    ▼
OpenAI Python SDK / cURL / 任意 HTTP 客户端
    │
    ▼ (HTTP Request)
┌───────────────────────────────────┐
│  vLLM OpenAI-Compatible Server    │
│                                   │
│  /v1/chat/completions  ──┐       │
│  /v1/completions       ──┤       │
│  /v1/embeddings        ──┼──→ vLLM Engine
│  /v1/models            ──┤       │
│  /health               ──┘       │
└───────────────────────────────────┘

4.1.2 支持的 API 端点

端点 方法 功能 OpenAI 兼容
/v1/chat/completions POST Chat 模式对话
/v1/completions POST 文本补全
/v1/embeddings POST 文本向量化
/v1/models GET 模型列表
/health GET 健康检查 -
/tokenize POST 分词 vLLM 特有
/detokenize POST 反分词 vLLM 特有

4.2 Chat Completions API

4.2.1 基础请求

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "messages": [
            {"role": "system", "content": "你是一个有用的助手。"},
            {"role": "user", "content": "什么是深度学习?"}
        ],
        "max_tokens": 300,
        "temperature": 0.7
    }'

4.2.2 响应格式

{
    "id": "chatcmpl-abc123",
    "object": "chat.completion",
    "created": 1700000000,
    "model": "qwen-7b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "深度学习是机器学习的一个子领域..."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 25,
        "completion_tokens": 156,
        "total_tokens": 181
    }
}

4.2.3 请求参数详解

参数 类型 默认值 说明
model string - 模型名称(必需)
messages array - 消息列表(必需)
max_tokens integer null 最大生成 token 数
temperature float 1.0 采样温度(0-2)
top_p float 1.0 Nucleus sampling
n integer 1 生成候选数
stream boolean false 是否流式输出
stop string/array null 停止词
presence_penalty float 0 存在惩罚(-2 到 2)
frequency_penalty float 0 频率惩罚(-2 到 2)
logprobs boolean false 是否返回 log 概率
top_logprobs integer null 返回 top N log 概率
tools array null 工具定义(Function Calling)
tool_choice string/object auto 工具选择策略
response_format object null 响应格式(JSON mode)
seed integer null 随机种子(可复现)

4.2.4 多轮对话

# multi_turn.py
"""多轮对话示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 维护对话历史
conversation = [
    {"role": "system", "content": "你是一个 Python 编程专家。"}
]

def chat(user_input: str) -> str:
    conversation.append({"role": "user", "content": user_input})
    
    response = client.chat.completions.create(
        model="qwen-7b",
        messages=conversation,
        max_tokens=500,
        temperature=0.7,
    )
    
    assistant_msg = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_msg})
    
    return assistant_msg

# 多轮对话
print(chat("如何在 Python 中读取 CSV 文件?"))
print(chat("如果文件很大,有什么优化方法?"))
print(chat("能给出一个完整的代码示例吗?"))

4.2.5 Function Calling

# function_calling.py
"""Function Calling 示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定城市的天气信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "城市名称,如 '北京'",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "温度单位",
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "user", "content": "北京今天天气怎么样?"}
    ],
    tools=tools,
    tool_choice="auto",
)

print(response.choices[0].message)

4.2.6 JSON Mode

# json_mode.py
"""结构化 JSON 输出"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "system", "content": "你是一个 JSON 输出助手。请始终以 JSON 格式回复。"},
        {"role": "user", "content": "列出3种常见的排序算法及其时间复杂度"},
    ],
    response_format={"type": "json_object"},
    max_tokens=500,
    temperature=0.3,
)

print(response.choices[0].message.content)
# 输出合法的 JSON 字符串

4.3 Completions API

4.3.1 基础请求

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "prompt": "The meaning of life is",
        "max_tokens": 100,
        "temperature": 0.7
    }'

4.3.2 Python 客户端

# completions_example.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 单条请求
response = client.completions.create(
    model="qwen-7b",
    prompt="def quicksort(arr):\n",
    max_tokens=200,
    temperature=0.2,
    stop=["\n\ndef"],  # 遇到下一个函数定义时停止
)
print(response.choices[0].text)

# 批量请求
response = client.completions.create(
    model="qwen-7b",
    prompt=["How to learn Python?", "How to learn Rust?"],
    max_tokens=100,
    temperature=0.5,
)
for choice in response.choices:
    print(f"[{choice.index}]: {choice.text[:100]}")

4.3.3 Chat vs Completions 选择

维度 Chat Completions Completions
适用模型 Chat/Instruct 模型 基础模型
输入格式 messages 数组 纯文本字符串
Chat Template 自动应用 不适用
System Prompt ✅ 原生支持 需手动拼接
推荐度 大多数场景推荐 补全/续写场景

4.4 流式输出(Streaming)

4.4.1 基础流式输出

# cURL 流式请求
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "messages": [{"role": "user", "content": "写一首诗"}],
        "max_tokens": 200,
        "stream": true
    }'

4.4.2 流式响应格式

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"春"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"风"},"logprobs":null,"finish_reason":null}]}

...

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop"}]}

data: [DONE]

4.4.3 Python 流式客户端

# stream_example.py
"""流式输出完整示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def stream_chat(prompt: str):
    """流式 Chat 输出"""
    stream = client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
        temperature=0.7,
        stream=True,
        stream_options={"include_usage": True},  # 包含用量统计
    )
    
    for chunk in stream:
        if chunk.choices and len(chunk.choices) > 0:
            delta = chunk.choices[0].delta
            if delta.content:
                print(delta.content, end="", flush=True)
            
            # 检查完成原因
            if chunk.choices[0].finish_reason:
                print(f"\n[完成原因: {chunk.choices[0].finish_reason}]")
        
        # 最后一个 chunk 包含 usage 信息
        if chunk.usage:
            print(f"[用量: prompt={chunk.usage.prompt_tokens}, "
                  f"completion={chunk.usage.completion_tokens}]")

stream_chat("解释量子纠缠的概念")

4.4.4 异步流式客户端

# async_stream.py
"""异步流式输出"""

import asyncio
from openai import AsyncOpenAI

async def stream_chat_async(prompt: str):
    client = AsyncOpenAI(
        base_url="http://localhost:8000/v1",
        api_key="none",
    )
    
    stream = await client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=300,
        stream=True,
    )
    
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

# 并发多个流式请求
async def main():
    tasks = [
        stream_chat_async("什么是机器学习?"),
        stream_chat_async("什么是深度学习?"),
        stream_chat_async("什么是强化学习?"),
    ]
    await asyncio.gather(*tasks)

asyncio.run(main())

4.5 Embeddings API

4.5.1 生成向量

# embeddings.py
"""文本向量化示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 启动服务时加载 embedding 模型
# vllm serve BAAI/bge-base-zh-v1.5 --task embedding

response = client.embeddings.create(
    model="bge-base-zh",
    input=["什么是人工智能?", "机器学习的基本概念"],
)

for i, embedding in enumerate(response.data):
    print(f"文本 {i}: 维度={len(embedding.embedding)}, "
          f"前5个值={embedding.embedding[:5]}")

4.6 并发请求处理

4.6.1 多线程并发

# concurrent_requests.py
"""并发请求示例"""

import time
import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def send_request(prompt: str) -> dict:
    """发送单个请求"""
    start = time.time()
    response = client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.7,
    )
    elapsed = time.time() - start
    return {
        "prompt": prompt[:30],
        "response": response.choices[0].message.content[:50],
        "tokens": response.usage.completion_tokens,
        "time": elapsed,
    }

# 并发 20 个请求
prompts = [f"用一句话解释概念 {i}" for i in range(20)]

start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(send_request, prompts))
total_time = time.time() - start

# 统计
total_tokens = sum(r["tokens"] for r in results)
print(f"总耗时: {total_time:.2f}s")
print(f"总 tokens: {total_tokens}")
print(f"吞吐量: {total_tokens / total_time:.1f} tokens/s")

4.6.2 异步并发

# async_concurrent.py
"""异步并发请求"""

import asyncio
import time
from openai import AsyncOpenAI

async def main():
    client = AsyncOpenAI(
        base_url="http://localhost:8000/v1",
        api_key="none",
    )
    
    async def send_request(prompt: str):
        response = await client.chat.completions.create(
            model="qwen-7b",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=100,
        )
        return response.usage.completion_tokens
    
    prompts = [f"用一句话解释概念 {i}" for i in range(50)]
    
    start = time.time()
    tasks = [send_request(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    total_time = time.time() - start
    
    print(f"50 个请求并发完成,总耗时: {total_time:.2f}s")
    print(f"总 tokens: {sum(results)}")
    print(f"吞吐量: {sum(results) / total_time:.1f} tokens/s")

asyncio.run(main())

4.7 API 服务配置

4.7.1 启动命令详解

vllm serve Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen-7b \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 1 \
    --dtype auto \
    --seed 42 \
    --enable-prefix-caching \
    --disable-log-requests \
    --max-num-seqs 256 \
    --max-num-batched-tokens 8192

4.7.2 多模型服务

vLLM 单实例只能加载一个基础模型(可搭配多个 LoRA)。如需多模型,启动多个实例:

# 实例 1:通用模型
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --port 8000 \
    --served-model-name qwen-7b

# 实例 2:代码模型
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \
    --port 8001 \
    --served-model-name qwen-coder

# 实例 3:数学模型
vllm serve Qwen/Qwen2.5-Math-7B-Instruct \
    --port 8002 \
    --served-model-name qwen-math

4.7.3 使用 Nginx 反向代理

# /etc/nginx/conf.d/vllm.conf
upstream vllm_cluster {
    server 127.0.0.1:8000 weight=1;
    server 127.0.0.1:8001 weight=1;
    server 127.0.0.1:8002 weight=1;
}

server {
    listen 80;
    server_name llm.example.com;

    location /v1/ {
        proxy_pass http://vllm_cluster/v1/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # 流式输出必需
        proxy_buffering off;
        proxy_cache off;
        
        # 超时设置(LLM 生成可能较慢)
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

4.8 安全配置

4.8.1 API Key 认证

# 启动时设置 API Key
# vllm serve model --api-key YOUR_SECRET_KEY
# 带 API Key 的请求
curl http://localhost:8000/v1/chat/completions \
    -H "Authorization: Bearer YOUR_SECRET_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello"}]}'
# Python 客户端
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="YOUR_SECRET_KEY",
)

4.8.2 CORS 配置

# 允许跨域访问(开发环境)
vllm serve model --allowed-origins '["*"]'

4.9 错误处理

4.9.1 常见 HTTP 状态码

状态码 含义 常见原因
200 成功 正常
400 请求错误 参数格式错误
401 未授权 API Key 错误
404 未找到 模型名不匹配
422 验证错误 参数值不合法
500 服务器错误 内部异常
503 服务不可用 队列已满

4.9.2 错误处理最佳实践

# error_handling.py
"""API 错误处理示例"""

from openai import OpenAI, APIError, RateLimitError, APITimeoutError
import time

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none",
    timeout=60.0,
    max_retries=3,
)

def call_with_retry(prompt: str, max_attempts: int = 3) -> str:
    for attempt in range(max_attempts):
        try:
            response = client.chat.completions.create(
                model="qwen-7b",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=200,
            )
            return response.choices[0].message.content
        
        except RateLimitError:
            wait = 2 ** attempt
            print(f"请求限流,等待 {wait}s 后重试...")
            time.sleep(wait)
        
        except APITimeoutError:
            print(f"请求超时,第 {attempt + 1} 次重试...")
        
        except APIError as e:
            print(f"API 错误: {e}")
            if attempt == max_attempts - 1:
                raise
    
    raise Exception("超过最大重试次数")

4.10 业务场景

场景一:API 网关集成

前端应用 → API 网关 → vLLM 服务集群
                ↓
         认证 / 限流 / 日志 / 路由

场景二:多后端路由

# 根据请求内容路由到不同模型
def route_request(prompt: str, model_type: str):
    if model_type == "code":
        return call_vllm("http://coder:8000/v1", prompt)
    elif model_type == "chat":
        return call_vllm("http://chat:8000/v1", prompt)
    else:
        return call_vllm("http://general:8000/v1", prompt)

4.11 注意事项

模型名一致性:请求中的 model 参数必须与 --served-model-name 一致,否则返回 404。

流式超时:流式请求的超时时间应设置较长,因为生成过程可能持续数十秒。

并发限制:vLLM 的并发由 max_num_seqs 参数控制。超出的请求会在队列中等待。

上下文长度:请求的总 token 数(prompt + completion)不能超过 max-model-len


4.12 扩展阅读


上一章03 - 快速开始 | 下一章05 - 核心架构解析