vLLM 高性能推理部署指南 / 04 - OpenAI 兼容 API 服务

04 - OpenAI 兼容 API 服务

全面掌握 vLLM 的 OpenAI 兼容 API，实现与 OpenAI 生态的无缝对接。

4.1 API 服务概览

vLLM 内置了与 OpenAI API 完全兼容的 HTTP 服务，可以作为 OpenAI API 的直接替代品。这意味着现有的 OpenAI 客户端代码只需修改 base_url 即可无缝切换到 vLLM。

4.1.1 架构总览

客户端代码
    │
    ▼
OpenAI Python SDK / cURL / 任意 HTTP 客户端
    │
    ▼ (HTTP Request)
┌───────────────────────────────────┐
│  vLLM OpenAI-Compatible Server    │
│                                   │
│  /v1/chat/completions  ──┐       │
│  /v1/completions       ──┤       │
│  /v1/embeddings        ──┼──→ vLLM Engine
│  /v1/models            ──┤       │
│  /health               ──┘       │
└───────────────────────────────────┘

4.1.2 支持的 API 端点

端点	方法	功能	OpenAI 兼容
`/v1/chat/completions`	POST	Chat 模式对话	✅
`/v1/completions`	POST	文本补全	✅
`/v1/embeddings`	POST	文本向量化	✅
`/v1/models`	GET	模型列表	✅
`/health`	GET	健康检查	-
`/tokenize`	POST	分词	vLLM 特有
`/detokenize`	POST	反分词	vLLM 特有

4.2 Chat Completions API

4.2.1 基础请求

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "messages": [
            {"role": "system", "content": "你是一个有用的助手。"},
            {"role": "user", "content": "什么是深度学习？"}
        ],
        "max_tokens": 300,
        "temperature": 0.7
    }'

4.2.2 响应格式

{
    "id": "chatcmpl-abc123",
    "object": "chat.completion",
    "created": 1700000000,
    "model": "qwen-7b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "深度学习是机器学习的一个子领域..."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 25,
        "completion_tokens": 156,
        "total_tokens": 181
    }
}

4.2.3 请求参数详解

参数	类型	默认值	说明
`model`	string	-	模型名称（必需）
`messages`	array	-	消息列表（必需）
`max_tokens`	integer	null	最大生成 token 数
`temperature`	float	1.0	采样温度（0-2）
`top_p`	float	1.0	Nucleus sampling
`n`	integer	1	生成候选数
`stream`	boolean	false	是否流式输出
`stop`	string/array	null	停止词
`presence_penalty`	float	0	存在惩罚（-2 到 2）
`frequency_penalty`	float	0	频率惩罚（-2 到 2）
`logprobs`	boolean	false	是否返回 log 概率
`top_logprobs`	integer	null	返回 top N log 概率
`tools`	array	null	工具定义（Function Calling）
`tool_choice`	string/object	auto	工具选择策略
`response_format`	object	null	响应格式（JSON mode）
`seed`	integer	null	随机种子（可复现）

4.2.4 多轮对话

# multi_turn.py
"""多轮对话示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 维护对话历史
conversation = [
    {"role": "system", "content": "你是一个 Python 编程专家。"}
]

def chat(user_input: str) -> str:
    conversation.append({"role": "user", "content": user_input})
    
    response = client.chat.completions.create(
        model="qwen-7b",
        messages=conversation,
        max_tokens=500,
        temperature=0.7,
    )
    
    assistant_msg = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_msg})
    
    return assistant_msg

# 多轮对话
print(chat("如何在 Python 中读取 CSV 文件？"))
print(chat("如果文件很大，有什么优化方法？"))
print(chat("能给出一个完整的代码示例吗？"))

4.2.5 Function Calling

# function_calling.py
"""Function Calling 示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定城市的天气信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "城市名称，如 '北京'",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "温度单位",
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "user", "content": "北京今天天气怎么样？"}
    ],
    tools=tools,
    tool_choice="auto",
)

print(response.choices[0].message)

4.2.6 JSON Mode

# json_mode.py
"""结构化 JSON 输出"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "system", "content": "你是一个 JSON 输出助手。请始终以 JSON 格式回复。"},
        {"role": "user", "content": "列出3种常见的排序算法及其时间复杂度"},
    ],
    response_format={"type": "json_object"},
    max_tokens=500,
    temperature=0.3,
)

print(response.choices[0].message.content)
# 输出合法的 JSON 字符串

4.3 Completions API

4.3.1 基础请求

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "prompt": "The meaning of life is",
        "max_tokens": 100,
        "temperature": 0.7
    }'

4.3.2 Python 客户端

# completions_example.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 单条请求
response = client.completions.create(
    model="qwen-7b",
    prompt="def quicksort(arr):\n",
    max_tokens=200,
    temperature=0.2,
    stop=["\n\ndef"],  # 遇到下一个函数定义时停止
)
print(response.choices[0].text)

# 批量请求
response = client.completions.create(
    model="qwen-7b",
    prompt=["How to learn Python?", "How to learn Rust?"],
    max_tokens=100,
    temperature=0.5,
)
for choice in response.choices:
    print(f"[{choice.index}]: {choice.text[:100]}")

4.3.3 Chat vs Completions 选择

维度	Chat Completions	Completions
适用模型	Chat/Instruct 模型	基础模型
输入格式	messages 数组	纯文本字符串
Chat Template	自动应用	不适用
System Prompt	✅ 原生支持	需手动拼接
推荐度	大多数场景推荐	补全/续写场景

4.4 流式输出（Streaming）

4.4.1 基础流式输出

# cURL 流式请求
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "messages": [{"role": "user", "content": "写一首诗"}],
        "max_tokens": 200,
        "stream": true
    }'

4.4.2 流式响应格式

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"春"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"风"},"logprobs":null,"finish_reason":null}]}

...

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop"}]}

data: [DONE]

4.4.3 Python 流式客户端

# stream_example.py
"""流式输出完整示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def stream_chat(prompt: str):
    """流式 Chat 输出"""
    stream = client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
        temperature=0.7,
        stream=True,
        stream_options={"include_usage": True},  # 包含用量统计
    )
    
    for chunk in stream:
        if chunk.choices and len(chunk.choices) > 0:
            delta = chunk.choices[0].delta
            if delta.content:
                print(delta.content, end="", flush=True)
            
            # 检查完成原因
            if chunk.choices[0].finish_reason:
                print(f"\n[完成原因: {chunk.choices[0].finish_reason}]")
        
        # 最后一个 chunk 包含 usage 信息
        if chunk.usage:
            print(f"[用量: prompt={chunk.usage.prompt_tokens}, "
                  f"completion={chunk.usage.completion_tokens}]")

stream_chat("解释量子纠缠的概念")

4.4.4 异步流式客户端

# async_stream.py
"""异步流式输出"""

import asyncio
from openai import AsyncOpenAI

async def stream_chat_async(prompt: str):
    client = AsyncOpenAI(
        base_url="http://localhost:8000/v1",
        api_key="none",
    )
    
    stream = await client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=300,
        stream=True,
    )
    
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

# 并发多个流式请求
async def main():
    tasks = [
        stream_chat_async("什么是机器学习？"),
        stream_chat_async("什么是深度学习？"),
        stream_chat_async("什么是强化学习？"),
    ]
    await asyncio.gather(*tasks)

asyncio.run(main())

4.5 Embeddings API

4.5.1 生成向量

# embeddings.py
"""文本向量化示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 启动服务时加载 embedding 模型
# vllm serve BAAI/bge-base-zh-v1.5 --task embedding

response = client.embeddings.create(
    model="bge-base-zh",
    input=["什么是人工智能？", "机器学习的基本概念"],
)

for i, embedding in enumerate(response.data):
    print(f"文本 {i}: 维度={len(embedding.embedding)}, "
          f"前5个值={embedding.embedding[:5]}")

4.6 并发请求处理

4.6.1 多线程并发

# concurrent_requests.py
"""并发请求示例"""

import time
import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def send_request(prompt: str) -> dict:
    """发送单个请求"""
    start = time.time()
    response = client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.7,
    )
    elapsed = time.time() - start
    return {
        "prompt": prompt[:30],
        "response": response.choices[0].message.content[:50],
        "tokens": response.usage.completion_tokens,
        "time": elapsed,
    }

# 并发 20 个请求
prompts = [f"用一句话解释概念 {i}" for i in range(20)]

start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(send_request, prompts))
total_time = time.time() - start

# 统计
total_tokens = sum(r["tokens"] for r in results)
print(f"总耗时: {total_time:.2f}s")
print(f"总 tokens: {total_tokens}")
print(f"吞吐量: {total_tokens / total_time:.1f} tokens/s")

4.6.2 异步并发

# async_concurrent.py
"""异步并发请求"""

import asyncio
import time
from openai import AsyncOpenAI

async def main():
    client = AsyncOpenAI(
        base_url="http://localhost:8000/v1",
        api_key="none",
    )
    
    async def send_request(prompt: str):
        response = await client.chat.completions.create(
            model="qwen-7b",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=100,
        )
        return response.usage.completion_tokens
    
    prompts = [f"用一句话解释概念 {i}" for i in range(50)]
    
    start = time.time()
    tasks = [send_request(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    total_time = time.time() - start
    
    print(f"50 个请求并发完成，总耗时: {total_time:.2f}s")
    print(f"总 tokens: {sum(results)}")
    print(f"吞吐量: {sum(results) / total_time:.1f} tokens/s")

asyncio.run(main())

4.7 API 服务配置

4.7.1 启动命令详解

vllm serve Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen-7b \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 1 \
    --dtype auto \
    --seed 42 \
    --enable-prefix-caching \
    --disable-log-requests \
    --max-num-seqs 256 \
    --max-num-batched-tokens 8192

4.7.2 多模型服务

vLLM 单实例只能加载一个基础模型（可搭配多个 LoRA）。如需多模型，启动多个实例：

# 实例 1：通用模型
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --port 8000 \
    --served-model-name qwen-7b

# 实例 2：代码模型
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \
    --port 8001 \
    --served-model-name qwen-coder

# 实例 3：数学模型
vllm serve Qwen/Qwen2.5-Math-7B-Instruct \
    --port 8002 \
    --served-model-name qwen-math

4.7.3 使用 Nginx 反向代理

# /etc/nginx/conf.d/vllm.conf
upstream vllm_cluster {
    server 127.0.0.1:8000 weight=1;
    server 127.0.0.1:8001 weight=1;
    server 127.0.0.1:8002 weight=1;
}

server {
    listen 80;
    server_name llm.example.com;

    location /v1/ {
        proxy_pass http://vllm_cluster/v1/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # 流式输出必需
        proxy_buffering off;
        proxy_cache off;
        
        # 超时设置（LLM 生成可能较慢）
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

4.8 安全配置

4.8.1 API Key 认证

# 启动时设置 API Key
# vllm serve model --api-key YOUR_SECRET_KEY

# 带 API Key 的请求
curl http://localhost:8000/v1/chat/completions \
    -H "Authorization: Bearer YOUR_SECRET_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello"}]}'

# Python 客户端
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="YOUR_SECRET_KEY",
)

4.8.2 CORS 配置

# 允许跨域访问（开发环境）
vllm serve model --allowed-origins '["*"]'

4.9 错误处理

4.9.1 常见 HTTP 状态码

状态码	含义	常见原因
200	成功	正常
400	请求错误	参数格式错误
401	未授权	API Key 错误
404	未找到	模型名不匹配
422	验证错误	参数值不合法
500	服务器错误	内部异常
503	服务不可用	队列已满

4.9.2 错误处理最佳实践

# error_handling.py
"""API 错误处理示例"""

from openai import OpenAI, APIError, RateLimitError, APITimeoutError
import time

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none",
    timeout=60.0,
    max_retries=3,
)

def call_with_retry(prompt: str, max_attempts: int = 3) -> str:
    for attempt in range(max_attempts):
        try:
            response = client.chat.completions.create(
                model="qwen-7b",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=200,
            )
            return response.choices[0].message.content
        
        except RateLimitError:
            wait = 2 ** attempt
            print(f"请求限流，等待 {wait}s 后重试...")
            time.sleep(wait)
        
        except APITimeoutError:
            print(f"请求超时，第 {attempt + 1} 次重试...")
        
        except APIError as e:
            print(f"API 错误: {e}")
            if attempt == max_attempts - 1:
                raise
    
    raise Exception("超过最大重试次数")

4.10 业务场景

场景一：API 网关集成

前端应用 → API 网关 → vLLM 服务集群
                ↓
         认证 / 限流 / 日志 / 路由

场景二：多后端路由

# 根据请求内容路由到不同模型
def route_request(prompt: str, model_type: str):
    if model_type == "code":
        return call_vllm("http://coder:8000/v1", prompt)
    elif model_type == "chat":
        return call_vllm("http://chat:8000/v1", prompt)
    else:
        return call_vllm("http://general:8000/v1", prompt)

4.11 注意事项

模型名一致性：请求中的 model 参数必须与 --served-model-name 一致，否则返回 404。

流式超时：流式请求的超时时间应设置较长，因为生成过程可能持续数十秒。

并发限制：vLLM 的并发由 max_num_seqs 参数控制。超出的请求会在队列中等待。

上下文长度：请求的总 token 数（prompt + completion）不能超过 max-model-len。

4.12 扩展阅读

上一章：03 - 快速开始 | 下一章：05 - 核心架构解析