HTTP/2 与 RPC 精讲教程 / 04 - 头部压缩 HPACK

第 04 章：头部压缩 HPACK

用算法对抗冗余——HTTP/2 的带宽优化利器

4.1 为什么需要头部压缩

在 HTTP/1.1 中，每个请求和响应都携带完整的文本头部。对于典型的 Web 应用，大量请求携带几乎相同的头部信息，造成严重的带宽浪费。

4.1.1 头部冗余的量化分析

典型 HTTP/1.1 请求头部（约 500-800 字节）：

GET /api/users HTTP/1.1\r\n
Host: api.example.com\r\n                           # 24 字节
User-Agent: Mozilla/5.0 (Windows NT 10.0; ...)      # ~120 字节
Accept: application/json, text/plain, */*            # ~40 字节
Accept-Language: zh-CN,zh;q=0.9,en;q=0.8            # ~35 字节
Accept-Encoding: gzip, deflate, br                   # ~30 字节
Authorization: Bearer eyJhbGciOiJIUzI1NiIs...       # ~200 字节（JWT）
Cookie: session_id=abc123; theme=dark; lang=zh-CN    # ~60 字节
Connection: keep-alive                                # ~24 字节

总计：约 533 字节 × 每个请求

页面加载 100 个资源 = 约 53KB 仅用于头部
其中 80% 的头部内容在请求之间完全相同！

4.1.2 gzip 为什么不适合头部压缩

问题	说明
安全漏洞	CRIME 攻击：通过观察压缩后大小变化推断明文内容
上下文隔离	gzip 需要连续数据流，而 HTTP 请求是独立的
CPU 开销	全量压缩/解压的 CPU 消耗较高
内存占用	需要维护较大的压缩上下文

⚠️ CRIME 攻击原理简述：

攻击者可以猜测 Authorization 头部中的 Token：
1. 受害者发送携带 Token 的请求
2. 攻击者控制请求中的部分数据（如 URL 参数）
3. 通过暴力枚举猜测 Token 前缀
4. 观察压缩后请求大小变化
5. 如果猜测正确，重复的模式会使请求变小
6. 逐字符恢复完整 Token

4.2 HPACK 设计原理

HPACK（HTTP/2 Header Compression）是专为 HTTP/2 设计的头部压缩方案。它通过三种机制实现高效压缩：

机制	说明	效果
静态表（Static Table）	预定义的 61 个常用头部键值对	消除最常见头部的传输
动态表（Dynamic Table）	连接级别新增的头部条目	消除重复头部的传输
Huffman 编码	对字符串进行变长编码	压缩字符串长度

4.3 静态表（Static Table）

静态表是一张全局共享的、只读的查找表，包含 61 个最常用的头部键值对。

4.3.1 完整静态表

索引	头部名称	头部值
1	:authority	（空）
2	:method	GET
3	:method	POST
4	:path	/
5	:path	/index.html
6	:scheme	http
7	:scheme	https
8	:status	200
9	:status	204
10	:status	206
11	:status	304
12	:status	400
13	:status	404
14	:status	500
15	accept-charset	（空）
16	accept-encoding	gzip, deflate
17	accept-language	（空）
18-21	accept-ranges, access-control-*, age, allow	（空）
22-28	authorization, cache-control, content-disposition, …	（空）
29	content-type	（空）
30-61	cookie, date, etag, expect, expires, from, …	（空）

💡 设计特点：

索引 1-15 包含键和值（最常见的键值对）
索引 16-61 仅包含键（值变化太大，不值得预定义）
所有头部名称都是小写的

4.3.2 静态表查询示例

# 静态表实现
STATIC_TABLE = {
    1: (":authority", ""),
    2: (":method", "GET"),
    3: (":method", "POST"),
    4: (":path", "/"),
    5: (":path", "/index.html"),
    6: (":scheme", "http"),
    7: (":scheme", "https"),
    8: (":status", "200"),
    9: (":status", "204"),
    10: (":status", "206"),
    11: (":status", "304"),
    12: (":status", "400"),
    13: (":status", "404"),
    14: (":status", "500"),
    15: ("accept-charset", ""),
    16: ("accept-encoding", "gzip, deflate"),
    # ... 完整 61 个条目
}

def find_in_static_table(name: str, value: str = "") -> int:
    """在静态表中查找匹配的条目"""
    for idx, (n, v) in STATIC_TABLE.items():
        if n == name and (v == value or value == ""):
            return idx
    return -1

# 示例查询
idx = find_in_static_table(":method", "GET")
print(f":method:GET 在静态表索引 {idx}")  # 输出: 2

idx = find_in_static_table(":status", "200")
print(f":status:200 在静态表索引 {idx}")  # 输出: 8

4.4 动态表（Dynamic Table）

动态表是一个连接级别的先进先出（FIFO）队列，存储在通信过程中出现的新头部条目。

4.4.1 动态表工作原理

初始状态：动态表为空
最大大小：由 SETTINGS_HEADER_TABLE_SIZE 决定（默认 4096 字节）

请求 1 发送：
  HEADERS: :method=GET, :path=/api/users, authorization=Bearer abc123
  
  编码结果：
  - :method=GET → 静态表索引 2（直接引用）
  - :path=/api/users → 动态表新增索引 62
  - authorization=Bearer abc123 → 动态表新增索引 63
  
  动态表状态：
  索引 62: :path = /api/users
  索引 63: authorization = Bearer abc123

请求 2 发送：
  HEADERS: :method=GET, :path=/api/orders, authorization=Bearer abc123
  
  编码结果：
  - :method=GET → 静态表索引 2
  - :path=/api/orders → 动态表新增索引 64
  - authorization=Bearer abc123 → 动态表索引 63（已有！不传输）
  
  节省了 authorization 头部的传输！

4.4.2 动态表实现

from collections import OrderedDict
from typing import Optional, Tuple

class DynamicTable:
    def __init__(self, max_size: int = 4096):
        self.max_size = max_size
        self.current_size = 0
        self.entries: OrderedDict[int, Tuple[str, str]] = OrderedDict()
        self.next_index = 62  # 静态表最大索引 + 1
    
    def entry_size(self, name: str, value: str) -> int:
        """计算条目占用的字节数"""
        return len(name.encode()) + len(value.encode()) + 32  # 32 字节开销
    
    def add(self, name: str, value: str):
        """添加新条目"""
        entry_size = self.entry_size(name, value)
        
        # 如果单个条目超过最大大小，清空表
        if entry_size > self.max_size:
            self.entries.clear()
            self.current_size = 0
            return
        
        # 驱逐旧条目以腾出空间
        while self.current_size + entry_size > self.max_size:
            self._evict()
        
        # 插入新条目（头部插入，最近使用的在前）
        self.entries[self.next_index] = (name, value)
        self.next_index += 1
        self.current_size += entry_size
    
    def _evict(self):
        """驱逐最旧的条目"""
        if self.entries:
            _, (name, value) = self.entries.popitem(last=False)
            self.current_size -= self.entry_size(name, value)
    
    def find(self, name: str, value: str = "") -> Optional[int]:
        """查找匹配的条目"""
        for idx, (n, v) in self.entries.items():
            if n == name and (v == value or value == ""):
                return idx
        return None
    
    def resize(self, new_max_size: int):
        """调整表大小"""
        while self.current_size > new_max_size:
            self._evict()
        self.max_size = new_max_size

# 使用示例
table = DynamicTable(max_size=256)
table.add(":path", "/api/users")
table.add("authorization", "Bearer abc123")

idx = table.find("authorization", "Bearer abc123")
print(f"找到 authorization 在索引 {idx}")  # 62 或 63

table.resize(128)  # 缩小表大小，驱逐旧条目

4.5 索引地址空间

HPACK 的索引地址空间由静态表和动态表组成：

索引地址空间：

  1          62         62+动态表大小
  │          │          │
  ▼          ▼          ▼
┌───┬────────┬──────────┐
│静态表(61) │  动态表    │
└───┴────────┴──────────┘
  1..61      62..N

查找顺序：先查静态表，再查动态表

4.6 Huffman 编码

HPACK 使用 Huffman 编码对头部值中的字符串进行压缩。

4.6.1 Huffman 编码原理

原理：高频字符使用短编码，低频字符使用长编码

HTTP 头部中字符频率统计：
字符    频率      Huffman 编码（示例）
'a'    高频      000
'e'    高频      001
'o'    中频      010
空格   高频      011
'a'-'z'         10xxxx
'A'-'Z'         11xxxxx

4.6.2 Huffman 编码实现

# HPACK Huffman 编码简化示例
# 注意：实际 HPACK 使用 RFC 7541 定义的 Huffman 表

class HuffmanEncoder:
    """简化的 Huffman 编码器（演示用）"""
    
    # 简化的 Huffman 表（仅演示部分字符）
    HUFFMAN_TABLE = {
        'a': '000',    # 3 bits
        'e': '001',    # 3 bits
        'i': '010',    # 3 bits
        'o': '011',    # 3 bits
        't': '100',    # 3 bits
        'n': '101',    # 3 bits
        's': '1100',   # 4 bits
        'r': '1101',   # 4 bits
        ' ': '1110',   # 4 bits (空格)
        'h': '11110',  # 5 bits
        'l': '111110', # 6 bits
        'd': '111111', # 6 bits
    }
    
    # 反向表
    REVERSE_TABLE = {v: k for k, v in HUFFMAN_TABLE.items()}
    
    @classmethod
    def encode(cls, text: str) -> str:
        result = []
        for ch in text:
            if ch in cls.HUFFMAN_TABLE:
                result.append(cls.HUFFMAN_TABLE[ch])
            else:
                # 不在表中的字符使用 8 位原始编码
                result.append(format(ord(ch), '08b'))
        return ''.join(result)
    
    @classmethod
    def decode(cls, bits: str) -> str:
        result = []
        i = 0
        while i < len(bits):
            for length in range(3, 9):  # 尝试不同长度
                chunk = bits[i:i+length]
                if chunk in cls.REVERSE_TABLE:
                    result.append(cls.REVERSE_TABLE[chunk])
                    i += length
                    break
            else:
                # 8 位原始字符
                result.append(chr(int(bits[i:i+8], 2)))
                i += 8
        return ''.join(result)

# 示例
text = "the cat"
encoded = HuffmanEncoder.encode(text)
decoded = HuffmanEncoder.decode(encoded)

print(f"原文: {text}")
print(f"编码: {encoded}")
print(f"原文比特数: {len(text) * 8}")
print(f"压缩后比特数: {len(encoded)}")
print(f"压缩率: {len(encoded) / (len(text) * 8) * 100:.1f}%")
print(f"解码: {decoded}")

4.7 编码类型

HPACK 定义了三种头部字段表示方式：

4.7.1 索引头部字段（Indexed Header Field）

完整匹配静态表或动态表中的条目，仅需传输索引号。

格式：
  0   1   2   3   4   5   6   7
+---+---+---+---+---+---+---+---+
| 1 |        Index (7+)         |
+---+---------------------------+

最高位为 1 表示这是索引头部字段
7 位可表示索引 1-127，超出则用扩展编码

示例：:method=GET → 索引 2 → 编码为 0x82（10000010）

4.7.2 字面量头部字段（Literal Header Field）

不匹配表中的条目，需传输名称和值。

格式（带索引名）：
  0   1   2   3   4   5   6   7
+---+---+---+---+---+---+---+---+
| 0 | 1 |      Index (6+)       |
+---+---+-----------------------+
| H |     Value Length (7+)      |
+---+---------------------------+
| Value String (Length octets)   |
+-------------------------------+

H=0: 不使用 Huffman 编码
H=1: 使用 Huffman 编码

4.7.3 带索引更新的字面量头部字段

新增到动态表的头部字段。

格式（不带索引名，新增到表）：
  0   1   2   3   4   5   6   7
+---+---+---+---+---+---+---+---+
| 0 | 0 | 0 | 0 |       0       |
+---+---+---+---+---+-----------+
| H |     Name Length (7+)       |
+---+---------------------------+
| Name String (Length octets)    |
+---+---------------------------+
| H |     Value Length (7+)      |
+---+---------------------------+
| Value String (Length octets)   |
+-------------------------------+

4.8 完整编码示例

from typing import List

class HPACKEncoder:
    def __init__(self, dynamic_table_size: int = 4096):
        self.dynamic_table = DynamicTable(dynamic_table_size)
    
    def encode_header(self, name: str, value: str) -> bytes:
        """编码单个头部字段"""
        # 1. 检查静态表
        static_idx = find_in_static_table(name, value)
        if static_idx > 0:
            # 完整匹配静态表 → 索引头部字段
            return self._encode_indexed(static_idx)
        
        # 2. 检查动态表
        dynamic_idx = self.dynamic_table.find(name, value)
        if dynamic_idx:
            return self._encode_indexed(dynamic_idx)
        
        # 3. 检查名称是否在表中（仅名称匹配）
        name_idx = self._find_name(name)
        
        # 4. 编码为字面量
        encoded = self._encode_literal(name, value, name_idx)
        
        # 5. 添加到动态表
        self.dynamic_table.add(name, value)
        
        return encoded
    
    def _encode_indexed(self, index: int) -> bytes:
        """编码索引头部字段"""
        if index < 128:
            return bytes([0x80 | index])
        else:
            # 多字节编码（简化处理）
            return bytes([0x80 | (index & 0x7F)])
    
    def _encode_literal(self, name: str, value: str, name_idx: int = -1) -> bytes:
        """编码字面量头部字段"""
        result = bytearray()
        
        if name_idx > 0:
            # 使用索引的名称
            result.append(0x40 | (name_idx & 0x3F))
        else:
            # 名称也需要编码
            result.append(0x00)
            result.extend(self._encode_string(name))
        
        result.extend(self._encode_string(value))
        return bytes(result)
    
    def _encode_string(self, s: str) -> bytes:
        """编码字符串（带长度前缀）"""
        encoded = s.encode('utf-8')
        length = len(encoded)
        
        if length < 128:
            return bytes([length]) + encoded
        else:
            # 多字节长度编码
            return bytes([0x80 | (length & 0x7F)]) + encoded
    
    def _find_name(self, name: str) -> int:
        """仅在表中查找名称（不匹配值）"""
        for idx, (n, _) in STATIC_TABLE.items():
            if n == name:
                return idx
        return self.dynamic_table.find(name) or -1

# 使用示例
encoder = HPACKEncoder()

# 编码一组请求头部
headers = [
    (":method", "GET"),
    (":path", "/api/users"),
    (":scheme", "https"),
    (":authority", "api.example.com"),
    ("accept", "application/json"),
    ("authorization", "Bearer mytoken123"),
]

total_bytes = 0
for name, value in headers:
    encoded = encoder.encode_header(name, value)
    total_bytes += len(encoded)
    print(f"{name}: {value}")
    print(f"  编码: {encoded.hex()}")
    print(f"  大小: {len(encoded)} 字节")

print(f"\n总计: {total_bytes} 字节")
print(f"文本格式估算: {sum(len(n)+len(v) for n,v in headers)} 字节")

4.9 安全问题与对策

4.9.1 时序攻击（Timing Attack）

⚠️ 动态表的信息泄露风险：

攻击场景：
1. 攻击者观察到请求 A 的 HEADERS 帧大小为 100 字节
2. 攻击者猜测 authorization 头部为 "Bearer abc123"
3. 攻击者诱导受害者发送相同头部
4. 如果请求 B 的 HEADERS 帧更小（动态表命中），
   则说明攻击者的猜测正确

对策：
- 定期清空动态表
- 使用 SETTINGS_HEADER_TABLE_SIZE = 0 禁用动态表
- 敏感头部设置为不入表（never indexed）

4.9.2 敏感头部处理

# 敏感头部标记
SENSITIVE_HEADERS = {
    "authorization",
    "cookie",
    "set-cookie",
    "proxy-authorization",
}

class SecureHPACKEncoder(HPACKEncoder):
    def encode_header(self, name: str, value: str) -> bytes:
        if name.lower() in SENSITIVE_HEADERS:
            # never-indexed：不添加到动态表
            return self._encode_never_indexed(name, value)
        return super().encode_header(name, value)
    
    def _encode_never_indexed(self, name: str, value: str) -> bytes:
        """never-indexed 编码"""
        result = bytearray()
        result.append(0x10)  # never-indexed 标志
        result.extend(self._encode_string(name))
        result.extend(self._encode_string(value))
        return bytes(result)

4.9.3 禁用动态表

// 禁用 HPACK 动态表（最高安全级别）
package main

import (
	"net/http"
)

func main() {
	server := &http.Server{
		Addr: ":8443",
		Handler: http.HandlerFunc(func(w http.ResponseWriter, r *http.Request) {
			w.Write([]byte("Hello"))
		}),
		// 通过自定义 HTTP/2 Transport 设置动态表大小为 0
	}
	
	// 方法 1：客户端禁用动态表
	// transport := &http2.Transport{
	//     MaxDecoderHeaderTableSize: 0,
	//     MaxEncoderHeaderTableSize: 0,
	// }
	
	_ = server
}

4.10 业务场景：大规模 API 网关

场景：API 网关处理每秒 10 万个请求

无头部压缩（HTTP/1.1）：
- 每请求头部大小：~600 字节
- 每秒带宽：600 × 100,000 = 60 MB/s = 480 Mbps（仅头部）

HPACK 压缩后（HTTP/2）：
- 首次请求：~200 字节
- 后续请求（动态表命中）：~30 字节
- 平均每秒带宽：30 × 100,000 = 3 MB/s = 24 Mbps

节省带宽：95%！

// API 网关 HPACK 调优
package main

import (
	"golang.org/x/net/http2"
	"net/http"
)

func main() {
	server := &http.Server{
		Addr: ":8443",
	}
	
	// 自定义 HTTP/2 配置
	h2Server := &http2.Server{
		// 动态表大小（根据内存预算调整）
		MaxDecoderHeaderTableSize: 8192,
		MaxEncoderHeaderTableSize: 8192,
		
		// 最大帧大小
		MaxReadFrameSize: 1 << 20, // 1MB
		
		// 并发流数
		MaxConcurrentStreams: 1000,
	}
	
	_ = h2Server
	_ = server
}

4.11 注意事项

⚠️ 动态表大小设置：

过小：压缩效果差，重复头部仍需传输
过大：消耗服务器内存（每个连接独立一份）
建议：根据可用内存和并发连接数计算

⚠️ Huffman 编码的开销：

短字符串的 Huffman 编码可能比原文更长
编码器应比较编码前后大小，选择更小的方案
大多数实现已自动处理此问题

💡 最佳实践：

使用小写的头部名称（HPACK 规范要求）
避免过长的头部值（如超长 Cookie）
敏感头部使用 never-indexed 标记
定期审计动态表的内存使用

4.12 扩展阅读

📖 RFC 7541 - HPACK: Header Compression for HTTP/2
📖 RFC 7541 Appendix B - Huffman Code Table
📖 CRIME Attack Paper
📖 QPACK: HTTP/3 头部压缩

← 第 03 章 - 多路复用 | 第 05 章 - 服务器推送 →