强曰为道

与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

05 - 视觉理解 API

第 05 章 · 视觉理解 API (Vision)

GPT-4o / GPT-4.1 系列支持图片输入,实现"看图说话"的多模态理解能力。本章详解图片输入方式、OCR、图表分析等场景。


5.1 支持的模型

模型Vision 支持说明
GPT-4o多模态旗舰
GPT-4o mini高性价比
GPT-4.1超长上下文
GPT-4.1 mini轻量版
o3推理+视觉
o4-mini高效推理+视觉

5.2 图片输入方式

5.2.1 URL 方式

from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "这张图片描述了什么?"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/PNG_transparency_demonstration_1.png/300px-PNG_transparency_demonstration_1.png"
                    }
                }
            ]
        }
    ]
)
print(response.choices[0].message.content)

5.2.2 Base64 方式

import base64
from pathlib import Path

def encode_image(image_path: str) -> str:
    """将本地图片编码为 base64"""
    with open(image_path, "rb") as f:
        return base64.b64encode(f.read()).decode("utf-8")

def get_mime_type(image_path: str) -> str:
    """根据扩展名获取 MIME 类型"""
    ext = Path(image_path).suffix.lower()
    mime_map = {
        ".jpg": "image/jpeg", ".jpeg": "image/jpeg",
        ".png": "image/png", ".gif": "image/gif",
        ".webp": "image/webp",
    }
    return mime_map.get(ext, "image/jpeg")

# 使用 Base64
image_path = "/path/to/image.jpg"
base64_image = encode_image(image_path)
mime_type = get_mime_type(image_path)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "详细描述这张图片"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": f"data:{mime_type};base64,{base64_image}"
                    }
                }
            ]
        }
    ]
)

5.2.3 多图片输入

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "比较这两张图片的区别"},
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image1.jpg"}
                },
                {
                    "type": "image_url",
                    "image_url": {"url": "https://example.com/image2.jpg"}
                }
            ]
        }
    ]
)

5.3 图片 Detail 参数

参数值说明Token 消耗适用场景
auto自动选择(默认)不确定通用
low低分辨率,85×85 缩略图~85 tokens简单分类、整体描述
high高分辨率,细节分析数百~数千 tokensOCR、细节分析
response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "识别图中的所有文字"},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://example.com/document.jpg",
                        "detail": "high"  # OCR 需要高分辨率
                    }
                }
            ]
        }
    ]
)

5.4 实用场景代码

场景一:OCR 文字识别

def ocr_from_image(image_path: str) -> str:
    """从图片中提取文字"""
    base64_image = encode_image_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """请识别图片中的所有文字,按以下要求输出:
1. 保持原文的段落结构
2. 表格内容用 Markdown 表格格式
3. 如有不确定的字,用 [?] 标注"""
                    },
                    {
                        "type": "image_url",
                        "image_url": {
                            "url": f"data:image/jpeg;base64,{base64_image}",
                            "detail": "high"
                        }
                    }
                ]
            }
        ],
        max_tokens=4096,
    )
    return response.choices[0].message.content

场景二:图表数据分析

def analyze_chart(image_path: str) -> str:
    """分析图表并提取数据"""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """分析这张图表,请输出:
1. 图表类型(柱状图/折线图/饼图等)
2. 图表标题
3. X轴和Y轴的含义
4. 关键数据点
5. 趋势分析
6. 以JSON格式返回提取的数据"""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/png;base64,{base64_image}"}
                    }
                ]
            }
        ],
        response_format={"type": "json_object"},
    )
    return response.choices[0].message.content

场景三:商品图片描述生成

def generate_product_description(image_path: str) -> str:
    """根据商品图片生成描述文案"""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": "你是一个专业的电商文案撰写专家。根据商品图片生成吸引人的描述。"
            },
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": "根据这张商品图片,生成一段电商产品描述,包括标题、卖点(3-5个)、详细描述。"
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                    }
                ]
            }
        ],
        max_tokens=500,
        temperature=0.7,
    )
    return response.choices[0].message.content

场景四:图片内容审核

def moderate_image(image_path: str) -> dict:
    """审核图片内容是否合规"""
    base64_image = encode_image(image_path)

    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {
                "role": "user",
                "content": [
                    {
                        "type": "text",
                        "text": """审核这张图片,以JSON格式返回:
{
  "safe": true/false,
  "categories": ["暴力", "色情", "政治", "正常"],
  "confidence": 0.0-1.0,
  "description": "图片内容简述"
}"""
                    },
                    {
                        "type": "image_url",
                        "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
                    }
                ]
            }
        ],
        response_format={"type": "json_object"},
        temperature=0.0,
    )
    import json
    return json.loads(response.choices[0].message.content)

5.5 多模态对话类

from openai import OpenAI
import base64

class VisionChat:
    """多模态视觉对话"""

    def __init__(self, model: str = "gpt-4o"):
        self.client = OpenAI()
        self.model = model
        self.messages: list = []

    def add_text(self, text: str):
        """添加文本消息"""
        self.messages.append({"role": "user", "content": text})

    def add_image(self, image_source: str, text: str = "描述这张图片", detail: str = "auto"):
        """添加图片+文本消息"""
        if image_source.startswith("http"):
            url = image_source
        else:
            with open(image_source, "rb") as f:
                b64 = base64.b64encode(f.read()).decode()
            url = f"data:image/jpeg;base64,{b64}"

        self.messages.append({
            "role": "user",
            "content": [
                {"type": "text", "text": text},
                {"type": "image_url", "image_url": {"url": url, "detail": detail}}
            ]
        })

    def chat(self) -> str:
        """发送对话并获取回复"""
        response = self.client.chat.completions.create(
            model=self.model,
            messages=self.messages,
            max_tokens=1000,
        )
        reply = response.choices[0].message.content
        self.messages.append({"role": "assistant", "content": reply})
        return reply

# 使用
vision = VisionChat()
vision.add_image("product.jpg", "这是什么商品?")
print(vision.chat())

vision.add_text("它的主要卖点是什么?")
print(vision.chat())

5.6 支持的图片格式与限制

格式支持说明
JPEG最常用
PNG支持透明通道
GIF只读取第一帧
WebP体积小

限制

限制项
单次请求图片数无硬性上限,但受 token 限制
Base64 最大尺寸~20MB
URL 响应超时需快速返回
最大分辨率~2000×2000(高 detail 模式)

5.7 注意事项

  1. detail 参数选择:OCR 和细节分析务必用 high,简单分类用 low 省 token
  2. 图片清晰度:模糊图片会导致识别准确率大幅下降
  3. Token 消耗:高分辨率图片可能消耗数千 token,注意成本
  4. 隐私安全:敏感图片(身份证、银行卡)建议脱敏后发送
  5. 多图限制:多图片时注意总 token 不超过上下文窗口

5.8 扩展阅读


下一章06 - Function Calling — 工具调用、并行执行、结构化输出。