强曰为道

与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

04 - OpenAI 兼容 API 服务

04 - OpenAI 兼容 API 服务

全面掌握 vLLM 的 OpenAI 兼容 API,实现与 OpenAI 生态的无缝对接。


4.1 API 服务概览

vLLM 内置了与 OpenAI API 完全兼容的 HTTP 服务,可以作为 OpenAI API 的直接替代品。这意味着现有的 OpenAI 客户端代码只需修改 base_url 即可无缝切换到 vLLM。

4.1.1 架构总览

客户端代码
    │
    ▼
OpenAI Python SDK / cURL / 任意 HTTP 客户端
    │
    ▼ (HTTP Request)
┌───────────────────────────────────┐
│  vLLM OpenAI-Compatible Server    │
│                                   │
│  /v1/chat/completions  ──┐       │
│  /v1/completions       ──┤       │
│  /v1/embeddings        ──┼──→ vLLM Engine
│  /v1/models            ──┤       │
│  /health               ──┘       │
└───────────────────────────────────┘

4.1.2 支持的 API 端点

端点方法功能OpenAI 兼容
/v1/chat/completionsPOSTChat 模式对话
/v1/completionsPOST文本补全
/v1/embeddingsPOST文本向量化
/v1/modelsGET模型列表
/healthGET健康检查-
/tokenizePOST分词vLLM 特有
/detokenizePOST反分词vLLM 特有

4.2 Chat Completions API

4.2.1 基础请求

curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "messages": [
            {"role": "system", "content": "你是一个有用的助手。"},
            {"role": "user", "content": "什么是深度学习?"}
        ],
        "max_tokens": 300,
        "temperature": 0.7
    }'

4.2.2 响应格式

{
    "id": "chatcmpl-abc123",
    "object": "chat.completion",
    "created": 1700000000,
    "model": "qwen-7b",
    "choices": [
        {
            "index": 0,
            "message": {
                "role": "assistant",
                "content": "深度学习是机器学习的一个子领域..."
            },
            "logprobs": null,
            "finish_reason": "stop"
        }
    ],
    "usage": {
        "prompt_tokens": 25,
        "completion_tokens": 156,
        "total_tokens": 181
    }
}

4.2.3 请求参数详解

参数类型默认值说明
modelstring-模型名称(必需)
messagesarray-消息列表(必需)
max_tokensintegernull最大生成 token 数
temperaturefloat1.0采样温度(0-2)
top_pfloat1.0Nucleus sampling
ninteger1生成候选数
streambooleanfalse是否流式输出
stopstring/arraynull停止词
presence_penaltyfloat0存在惩罚(-2 到 2)
frequency_penaltyfloat0频率惩罚(-2 到 2)
logprobsbooleanfalse是否返回 log 概率
top_logprobsintegernull返回 top N log 概率
toolsarraynull工具定义(Function Calling)
tool_choicestring/objectauto工具选择策略
response_formatobjectnull响应格式(JSON mode)
seedintegernull随机种子(可复现)

4.2.4 多轮对话

# multi_turn.py
"""多轮对话示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 维护对话历史
conversation = [
    {"role": "system", "content": "你是一个 Python 编程专家。"}
]

def chat(user_input: str) -> str:
    conversation.append({"role": "user", "content": user_input})
    
    response = client.chat.completions.create(
        model="qwen-7b",
        messages=conversation,
        max_tokens=500,
        temperature=0.7,
    )
    
    assistant_msg = response.choices[0].message.content
    conversation.append({"role": "assistant", "content": assistant_msg})
    
    return assistant_msg

# 多轮对话
print(chat("如何在 Python 中读取 CSV 文件?"))
print(chat("如果文件很大,有什么优化方法?"))
print(chat("能给出一个完整的代码示例吗?"))

4.2.5 Function Calling

# function_calling.py
"""Function Calling 示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "获取指定城市的天气信息",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {
                        "type": "string",
                        "description": "城市名称,如 '北京'",
                    },
                    "unit": {
                        "type": "string",
                        "enum": ["celsius", "fahrenheit"],
                        "description": "温度单位",
                    },
                },
                "required": ["city"],
            },
        },
    }
]

response = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "user", "content": "北京今天天气怎么样?"}
    ],
    tools=tools,
    tool_choice="auto",
)

print(response.choices[0].message)

4.2.6 JSON Mode

# json_mode.py
"""结构化 JSON 输出"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

response = client.chat.completions.create(
    model="qwen-7b",
    messages=[
        {"role": "system", "content": "你是一个 JSON 输出助手。请始终以 JSON 格式回复。"},
        {"role": "user", "content": "列出3种常见的排序算法及其时间复杂度"},
    ],
    response_format={"type": "json_object"},
    max_tokens=500,
    temperature=0.3,
)

print(response.choices[0].message.content)
# 输出合法的 JSON 字符串

4.3 Completions API

4.3.1 基础请求

curl http://localhost:8000/v1/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "prompt": "The meaning of life is",
        "max_tokens": 100,
        "temperature": 0.7
    }'

4.3.2 Python 客户端

# completions_example.py
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 单条请求
response = client.completions.create(
    model="qwen-7b",
    prompt="def quicksort(arr):\n",
    max_tokens=200,
    temperature=0.2,
    stop=["\n\ndef"],  # 遇到下一个函数定义时停止
)
print(response.choices[0].text)

# 批量请求
response = client.completions.create(
    model="qwen-7b",
    prompt=["How to learn Python?", "How to learn Rust?"],
    max_tokens=100,
    temperature=0.5,
)
for choice in response.choices:
    print(f"[{choice.index}]: {choice.text[:100]}")

4.3.3 Chat vs Completions 选择

维度Chat CompletionsCompletions
适用模型Chat/Instruct 模型基础模型
输入格式messages 数组纯文本字符串
Chat Template自动应用不适用
System Prompt✅ 原生支持需手动拼接
推荐度大多数场景推荐补全/续写场景

4.4 流式输出(Streaming)

4.4.1 基础流式输出

# cURL 流式请求
curl http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "qwen-7b",
        "messages": [{"role": "user", "content": "写一首诗"}],
        "max_tokens": 200,
        "stream": true
    }'

4.4.2 流式响应格式

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"role":"assistant","content":""},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"春"},"logprobs":null,"finish_reason":null}]}

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":"风"},"logprobs":null,"finish_reason":null}]}

...

data: {"id":"chatcmpl-1","object":"chat.completion.chunk","created":1700000000,"model":"qwen-7b","choices":[{"index":0,"delta":{"content":""},"logprobs":null,"finish_reason":"stop"}]}

data: [DONE]

4.4.3 Python 流式客户端

# stream_example.py
"""流式输出完整示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def stream_chat(prompt: str):
    """流式 Chat 输出"""
    stream = client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=500,
        temperature=0.7,
        stream=True,
        stream_options={"include_usage": True},  # 包含用量统计
    )
    
    for chunk in stream:
        if chunk.choices and len(chunk.choices) > 0:
            delta = chunk.choices[0].delta
            if delta.content:
                print(delta.content, end="", flush=True)
            
            # 检查完成原因
            if chunk.choices[0].finish_reason:
                print(f"\n[完成原因: {chunk.choices[0].finish_reason}]")
        
        # 最后一个 chunk 包含 usage 信息
        if chunk.usage:
            print(f"[用量: prompt={chunk.usage.prompt_tokens}, "
                  f"completion={chunk.usage.completion_tokens}]")

stream_chat("解释量子纠缠的概念")

4.4.4 异步流式客户端

# async_stream.py
"""异步流式输出"""

import asyncio
from openai import AsyncOpenAI

async def stream_chat_async(prompt: str):
    client = AsyncOpenAI(
        base_url="http://localhost:8000/v1",
        api_key="none",
    )
    
    stream = await client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=300,
        stream=True,
    )
    
    async for chunk in stream:
        if chunk.choices and chunk.choices[0].delta.content:
            print(chunk.choices[0].delta.content, end="", flush=True)
    print()

# 并发多个流式请求
async def main():
    tasks = [
        stream_chat_async("什么是机器学习?"),
        stream_chat_async("什么是深度学习?"),
        stream_chat_async("什么是强化学习?"),
    ]
    await asyncio.gather(*tasks)

asyncio.run(main())

4.5 Embeddings API

4.5.1 生成向量

# embeddings.py
"""文本向量化示例"""

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

# 启动服务时加载 embedding 模型
# vllm serve BAAI/bge-base-zh-v1.5 --task embedding

response = client.embeddings.create(
    model="bge-base-zh",
    input=["什么是人工智能?", "机器学习的基本概念"],
)

for i, embedding in enumerate(response.data):
    print(f"文本 {i}: 维度={len(embedding.embedding)}, "
          f"前5个值={embedding.embedding[:5]}")

4.6 并发请求处理

4.6.1 多线程并发

# concurrent_requests.py
"""并发请求示例"""

import time
import concurrent.futures
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="none")

def send_request(prompt: str) -> dict:
    """发送单个请求"""
    start = time.time()
    response = client.chat.completions.create(
        model="qwen-7b",
        messages=[{"role": "user", "content": prompt}],
        max_tokens=100,
        temperature=0.7,
    )
    elapsed = time.time() - start
    return {
        "prompt": prompt[:30],
        "response": response.choices[0].message.content[:50],
        "tokens": response.usage.completion_tokens,
        "time": elapsed,
    }

# 并发 20 个请求
prompts = [f"用一句话解释概念 {i}" for i in range(20)]

start = time.time()
with concurrent.futures.ThreadPoolExecutor(max_workers=10) as executor:
    results = list(executor.map(send_request, prompts))
total_time = time.time() - start

# 统计
total_tokens = sum(r["tokens"] for r in results)
print(f"总耗时: {total_time:.2f}s")
print(f"总 tokens: {total_tokens}")
print(f"吞吐量: {total_tokens / total_time:.1f} tokens/s")

4.6.2 异步并发

# async_concurrent.py
"""异步并发请求"""

import asyncio
import time
from openai import AsyncOpenAI

async def main():
    client = AsyncOpenAI(
        base_url="http://localhost:8000/v1",
        api_key="none",
    )
    
    async def send_request(prompt: str):
        response = await client.chat.completions.create(
            model="qwen-7b",
            messages=[{"role": "user", "content": prompt}],
            max_tokens=100,
        )
        return response.usage.completion_tokens
    
    prompts = [f"用一句话解释概念 {i}" for i in range(50)]
    
    start = time.time()
    tasks = [send_request(p) for p in prompts]
    results = await asyncio.gather(*tasks)
    total_time = time.time() - start
    
    print(f"50 个请求并发完成,总耗时: {total_time:.2f}s")
    print(f"总 tokens: {sum(results)}")
    print(f"吞吐量: {sum(results) / total_time:.1f} tokens/s")

asyncio.run(main())

4.7 API 服务配置

4.7.1 启动命令详解

vllm serve Qwen/Qwen2.5-7B-Instruct \
    --host 0.0.0.0 \
    --port 8000 \
    --served-model-name qwen-7b \
    --max-model-len 8192 \
    --gpu-memory-utilization 0.9 \
    --tensor-parallel-size 1 \
    --dtype auto \
    --seed 42 \
    --enable-prefix-caching \
    --disable-log-requests \
    --max-num-seqs 256 \
    --max-num-batched-tokens 8192

4.7.2 多模型服务

vLLM 单实例只能加载一个基础模型(可搭配多个 LoRA)。如需多模型,启动多个实例:

# 实例 1:通用模型
vllm serve Qwen/Qwen2.5-7B-Instruct \
    --port 8000 \
    --served-model-name qwen-7b

# 实例 2:代码模型
vllm serve Qwen/Qwen2.5-Coder-7B-Instruct \
    --port 8001 \
    --served-model-name qwen-coder

# 实例 3:数学模型
vllm serve Qwen/Qwen2.5-Math-7B-Instruct \
    --port 8002 \
    --served-model-name qwen-math

4.7.3 使用 Nginx 反向代理

# /etc/nginx/conf.d/vllm.conf
upstream vllm_cluster {
    server 127.0.0.1:8000 weight=1;
    server 127.0.0.1:8001 weight=1;
    server 127.0.0.1:8002 weight=1;
}

server {
    listen 80;
    server_name llm.example.com;

    location /v1/ {
        proxy_pass http://vllm_cluster/v1/;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        
        # 流式输出必需
        proxy_buffering off;
        proxy_cache off;
        
        # 超时设置(LLM 生成可能较慢)
        proxy_read_timeout 300s;
        proxy_send_timeout 300s;
    }
}

4.8 安全配置

4.8.1 API Key 认证

# 启动时设置 API Key
# vllm serve model --api-key YOUR_SECRET_KEY
# 带 API Key 的请求
curl http://localhost:8000/v1/chat/completions \
    -H "Authorization: Bearer YOUR_SECRET_KEY" \
    -H "Content-Type: application/json" \
    -d '{"model": "qwen-7b", "messages": [{"role": "user", "content": "Hello"}]}'
# Python 客户端
client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="YOUR_SECRET_KEY",
)

4.8.2 CORS 配置

# 允许跨域访问(开发环境)
vllm serve model --allowed-origins '["*"]'

4.9 错误处理

4.9.1 常见 HTTP 状态码

状态码含义常见原因
200成功正常
400请求错误参数格式错误
401未授权API Key 错误
404未找到模型名不匹配
422验证错误参数值不合法
500服务器错误内部异常
503服务不可用队列已满

4.9.2 错误处理最佳实践

# error_handling.py
"""API 错误处理示例"""

from openai import OpenAI, APIError, RateLimitError, APITimeoutError
import time

client = OpenAI(
    base_url="http://localhost:8000/v1",
    api_key="none",
    timeout=60.0,
    max_retries=3,
)

def call_with_retry(prompt: str, max_attempts: int = 3) -> str:
    for attempt in range(max_attempts):
        try:
            response = client.chat.completions.create(
                model="qwen-7b",
                messages=[{"role": "user", "content": prompt}],
                max_tokens=200,
            )
            return response.choices[0].message.content
        
        except RateLimitError:
            wait = 2 ** attempt
            print(f"请求限流,等待 {wait}s 后重试...")
            time.sleep(wait)
        
        except APITimeoutError:
            print(f"请求超时,第 {attempt + 1} 次重试...")
        
        except APIError as e:
            print(f"API 错误: {e}")
            if attempt == max_attempts - 1:
                raise
    
    raise Exception("超过最大重试次数")

4.10 业务场景

场景一:API 网关集成

前端应用 → API 网关 → vLLM 服务集群
                ↓
         认证 / 限流 / 日志 / 路由

场景二:多后端路由

# 根据请求内容路由到不同模型
def route_request(prompt: str, model_type: str):
    if model_type == "code":
        return call_vllm("http://coder:8000/v1", prompt)
    elif model_type == "chat":
        return call_vllm("http://chat:8000/v1", prompt)
    else:
        return call_vllm("http://general:8000/v1", prompt)

4.11 注意事项

模型名一致性:请求中的 model 参数必须与 --served-model-name 一致,否则返回 404。

流式超时:流式请求的超时时间应设置较长,因为生成过程可能持续数十秒。

并发限制:vLLM 的并发由 max_num_seqs 参数控制。超出的请求会在队列中等待。

上下文长度:请求的总 token 数(prompt + completion)不能超过 max-model-len


4.12 扩展阅读


上一章03 - 快速开始 | 下一章05 - 核心架构解析