ClickHouse 教程 / 数据压缩与编码
数据压缩与编码
数据压缩是 ClickHouse 高性能的核心支柱之一。列式存储天然适合压缩——同列数据类型相同、相邻值相似度高,压缩率远超行式存储。
1. 压缩算法概述
ClickHouse 内置了多种压缩算法,可在列级别灵活配置。
| 算法 | 压缩速度 | 解压速度 | 压缩率 | 适用场景 |
|---|---|---|---|---|
| LZ4(默认) | 极快 | 极快 | 中等 | 通用场景,写入密集型 |
| ZSTD | 较慢 | 快 | 高 | 存储敏感场景,冷数据 |
| None | — | — | — | 已压缩数据或临时表 |
-- 查看默认压缩算法
SELECT name, value
FROM system.settings
WHERE name LIKE '%compression%';
-- 创建表时指定默认压缩算法
CREATE TABLE lz4_table
(
event_date Date,
user_id UInt64,
data String
)
ENGINE = MergeTree()
ORDER BY event_date
SETTINGS min_compress_block_size = 65536,
max_compress_block_size = 1048576;
-- 对整个表设置压缩算法
CREATE TABLE zstd_table
(
event_date Date,
user_id UInt64,
data String
)
ENGINE = MergeTree()
ORDER BY event_date;
-- 修改列的压缩算法
ALTER TABLE zstd_table MODIFY COLUMN data CODEC(ZSTD(3));
2. 列压缩 CODEC
ClickHouse 支持在列级别指定专用编码(Codec),可以与通用压缩算法组合使用,进一步提升压缩率。
2.1 支持的编码类型
| CODEC | 适用数据类型 | 原理 | 压缩率提升 |
|---|---|---|---|
Delta(N) | 整数/日期/时间 | 存储相邻值的差值 | 时序数据极佳 |
DoubleDelta(N) | 整数/日期/时间 | 存储差值的差值 | 线性递增数据 |
Gorilla | Float32/Float64/UInt32 | XOR 编码,类似 Facebook Gorilla | 时序浮点数据 |
T64 | 整数类型 | 将 64 个值打包到矩阵中转置编码 | 低变化整数 |
FPC | Float64 | 浮点压缩算法 | 大量浮点数据 |
LZ4 | 任何类型 | 通用 LZ4 压缩 | 通用 |
ZSTD(N) | 任何类型 | ZSTD 压缩,N=1~22 | 高压缩率需求 |
NONE | 任何类型 | 不压缩 | 已压缩的外部数据 |
2.2 Delta + ZSTD 组合(时序数据最佳)
CREATE TABLE timeseries_data
(
timestamp DateTime64(3),
host LowCardinality(String),
cpu_usage Float64,
memory_usage Float64,
disk_read UInt64,
disk_write UInt64,
network_in UInt64,
network_out UInt64
)
ENGINE = MergeTree()
ORDER BY (host, timestamp)
-- 列级编码
CODEC(
timestamp Delta(8) ZSTD(1), -- 时间戳差值 + ZSTD
cpu_usage Gorilla ZSTD(1), -- 浮点 XOR + ZSTD
memory_usage Gorilla ZSTD(1), -- 浮点 XOR + ZSTD
disk_read Delta(8) ZSTD(1), -- 整数差值 + ZSTD
disk_write Delta(8) ZSTD(1),
network_in Delta(8) ZSTD(1),
network_out Delta(8) ZSTD(1)
);
-- 插入模拟数据
INSERT INTO timeseries_data
SELECT
toDateTime('2026-05-01 00:00:00') + toIntervalSecond(number * 10) AS timestamp,
'server-01' AS host,
50 + sin(number / 100) * 30 AS cpu_usage,
60 + cos(number / 50) * 20 AS memory_usage,
toUInt64(rand() % 1000000) AS disk_read,
toUInt64(rand() % 500000) AS disk_write,
toUInt64(rand() % 10000000) AS network_in,
toUInt64(rand() % 5000000) AS network_out
FROM numbers(8640); -- 一天的数据,每 10 秒一条
-- 查看压缩效果
SELECT
name,
formatReadableSize(data_uncompressed_bytes) AS uncompressed,
formatReadableSize(data_compressed_bytes) AS compressed,
round(data_uncompressed_bytes / data_compressed_bytes, 2) AS ratio,
formatReadableSize(marks_bytes) AS marks_size
FROM system.columns
WHERE table = 'timeseries_data' AND database = currentDatabase();
预期压缩率对比:
列名 | 未压缩 | 压缩后 | 压缩比
timestamp | 69.12 KB | 1.23 KB | 56.2x
cpu_usage | 69.12 KB | 8.45 KB | 8.2x
memory_usage | 69.12 KB | 9.12 KB | 7.6x
disk_read | 69.12 KB | 15.23 KB | 4.5x
2.3 T64 编码(低变化整数)
-- T64 适合值范围较小的整数列
CREATE TABLE temperature_sensors
(
sensor_id UInt32,
reading_time DateTime,
temperature Int16, -- 温度变化范围小(-50 ~ 50 度)
humidity UInt8 -- 湿度 0~100
)
ENGINE = MergeTree()
ORDER BY (sensor_id, reading_time)
CODEC(
temperature T64 ZSTD(1),
humidity T64 LZ4
);
-- 查看编码效果
SELECT
column,
formatReadableSize(data_uncompressed_bytes) AS uncompressed,
formatReadableSize(data_compressed_bytes) AS compressed,
round(data_uncompressed_bytes / data_compressed_bytes, 2) AS ratio
FROM system.columns
WHERE table = 'temperature_sensors' AND database = currentDatabase();
2.4 Gorilla 编码(浮点数据)
-- Gorilla(Facebook 论文): 相邻浮点值 XOR 后用变长编码
-- 适合变化平缓的浮点数据
CREATE TABLE stock_prices
(
trade_date Date,
stock_code LowCardinality(String),
open_price Float64,
close_price Float64,
high_price Float64,
low_price Float64,
volume UInt64
)
ENGINE = MergeTree()
ORDER BY (stock_code, trade_date)
CODEC(
open_price Gorilla ZSTD(1),
close_price Gorilla ZSTD(1),
high_price Gorilla ZSTD(1),
low_price Gorilla ZSTD(1),
volume Delta(8) ZSTD(1)
);
3. 压缩比测试
3.1 不同算法压缩比对比
-- 创建测试表对比不同压缩算法
CREATE TABLE compress_test_lz4
(
event_date Date,
value UInt64,
description String
) ENGINE = MergeTree() ORDER BY event_date
CODEC(LZ4);
CREATE TABLE compress_test_zstd
(
event_date Date,
value UInt64,
description String
) ENGINE = MergeTree() ORDER BY event_date
CODEC(ZSTD(1));
CREATE TABLE compress_test_zstd_high
(
event_date Date,
value UInt64,
description String
) ENGINE = MergeTree() ORDER BY event_date
CODEC(ZSTD(9));
-- 插入相同数据
INSERT INTO compress_test_lz4 SELECT
toDate('2026-01-01') + toIntervalDay(number % 365),
rand64(),
concat('Event description for record number ', toString(number), ' with some random text content')
FROM numbers(1000000);
INSERT INTO compress_test_zstd SELECT * FROM compress_test_lz4;
INSERT INTO compress_test_zstd_high SELECT * FROM compress_test_lz4;
-- 对比压缩效果
SELECT
table,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio,
round(sum(data_compressed_bytes) / sum(data_uncompressed_bytes) * 100, 2) AS compression_pct
FROM system.parts
WHERE database = currentDatabase()
AND table LIKE 'compress_test_%'
AND active = 1
GROUP BY table
ORDER BY ratio DESC;
预期输出(示例):
┌─table──────────────────┬─uncompressed─┬─compressed─┬─ratio─┬─compression_pct─┐
│ compress_test_zstd_high│ 68.53 MiB │ 12.45 MiB │ 5.50 │ 18.17% │
│ compress_test_zstd │ 68.53 MiB │ 14.23 MiB │ 4.82 │ 20.76% │
│ compress_test_lz4 │ 68.53 MiB │ 22.67 MiB │ 3.02 │ 33.08% │
└────────────────────────┴──────────────┴────────────┴───────┴─────────────────┘
4. 压缩与查询性能权衡
| 算法 | 压缩率 | 写入速度 | 读取速度 | 适用场景 |
|---|---|---|---|---|
| LZ4 | 中 | 最快 | 最快 | 热数据、高频写入 |
| ZSTD(1) | 高 | 较快 | 快 | 温数据、均衡场景 |
| ZSTD(9) | 极高 | 较慢 | 较快 | 冷数据、存储敏感 |
| Delta(8) + ZSTD | 极高 | 中 | 快 | 时序数据(推荐) |
| Gorilla + ZSTD | 高 | 中 | 快 | 浮点时序数据 |
-- 查询压缩算法对读取性能的影响
-- 使用系统表查看实际读取统计
SELECT
query,
read_rows,
formatReadableSize(read_bytes) AS read_size,
query_duration_ms,
formatReadableSize(memory_usage) AS memory
FROM system.query_log
WHERE type = 'QueryFinish'
AND query LIKE '%compress_test_%'
AND event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms
LIMIT 10;
💡 提示:对于频繁查询的热数据,推荐使用 LZ4 或 Delta + LZ4(解压速度最快)。对于归档冷数据,使用 ZSTD(9) 可以节省 30%~50% 的存储空间。
5. 自适应压缩
ClickHouse 会自动选择压缩算法,当数据块的压缩率不理想时(压缩后反而更大),会自动切换为不压缩。
-- 查看压缩统计
SELECT
event_time,
type,
database,
table,
compressed_size,
uncompressed_size,
compression_codec
FROM system.part_log
WHERE table = 'your_table'
AND event_time > now() - INTERVAL 1 DAY
ORDER BY event_time DESC
LIMIT 20;
-- 查看压缩块大小设置
SELECT name, value
FROM system.settings
WHERE name IN (
'min_compress_block_size',
'max_compress_block_size',
'max_block_size'
);
6. 压缩配置 SETTINGS
-- 建表时配置压缩参数
CREATE TABLE optimized_compress
(
event_date Date,
user_id UInt64,
data String
)
ENGINE = MergeTree()
ORDER BY event_date
SETTINGS
min_compress_block_size = 65536, -- 最小压缩块大小(字节)
max_compress_block_size = 1048576; -- 最大压缩块大小(字节)
-- 全局压缩配置(config.xml)
-- <compression>
-- <case>
-- <min_part_size>10000000000</min_part_size>
-- <method>zstd</method>
-- </case>
-- </compression>
| 参数 | 默认值 | 说明 |
|---|---|---|
min_compress_block_size | 65536 | 写入时最小压缩块大小 |
max_compress_block_size | 1048576 | 写入时最大压缩块大小 |
7. 压缩对写入性能影响
-- 测试不同压缩算法的写入速度
-- 创建表
CREATE TABLE write_test_lz4 (id UInt64, data String) ENGINE = MergeTree() ORDER BY id CODEC(LZ4);
CREATE TABLE write_test_zstd (id UInt64, data String) ENGINE = MergeTree() ORDER BY id CODEC(ZSTD(1));
-- 批量写入 LZ4 压缩表
INSERT INTO write_test_lz4 SELECT
number,
concat('row_', toString(number), '_data_padding_content')
FROM numbers(1000000);
-- 批量写入 ZSTD 压缩表
INSERT INTO write_test_zstd SELECT
number,
concat('row_', toString(number), '_data_padding_content')
FROM numbers(1000000);
-- 查看写入统计
SELECT
table,
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.parts
WHERE database = currentDatabase()
AND table LIKE 'write_test_%'
AND active = 1
GROUP BY table;
⚠️ 注意:ZSTD 的压缩级别越高(1→22),压缩率越好但 CPU 开销越大。写入密集型场景建议使用 ZSTD(1) 或 LZ4。
8. 实际案例
8.1 时序数据压缩方案
-- 监控系统:每天 100 亿条指标数据
CREATE TABLE metrics.compressed_metrics
(
timestamp DateTime64(3) CODEC(Delta(8), ZSTD(1)),
host LowCardinality(String) CODEC(LZ4),
metric_name LowCardinality(String) CODEC(LZ4),
value Float64 CODEC(Gorilla, ZSTD(1)),
tags Map(String, String) CODEC(ZSTD(3))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (host, metric_name, timestamp);
-- 预期压缩率: 20:1 ~ 50:1
-- 100 亿条 × ~100 字节 = ~1TB 原始数据
-- 压缩后: ~20GB ~ 50GB
8.2 日志存储压缩方案
CREATE TABLE logs.compressed_logs
(
log_time DateTime CODEC(Delta(4), LZ4),
service LowCardinality(String) CODEC(LZ4),
level LowCardinality(String) CODEC(LZ4),
host LowCardinality(String) CODEC(LZ4),
message String CODEC(ZSTD(3)),
stack_trace String CODEC(ZSTD(5)),
extra Map(String, String) CODEC(ZSTD(1))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(log_time)
ORDER BY (log_time, service, host)
TTL log_time + INTERVAL 90 DAY DELETE;
-- 查看实际压缩效果
SELECT
column,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 1) AS ratio
FROM system.parts
ARRAY JOIN [name] AS column
WHERE table = 'compressed_logs' AND active = 1
GROUP BY column;
8.3 用户行为埋点压缩方案
CREATE TABLE analytics.compressed_events
(
event_date Date CODEC(Delta(2), LZ4),
event_time DateTime64(3) CODEC(Delta(8), ZSTD(1)),
user_id UInt64 CODEC(T64, ZSTD(1)),
session_id UUID CODEC(LZ4),
event_type LowCardinality(String) CODEC(LZ4),
page_url String CODEC(ZSTD(1)),
referrer String CODEC(ZSTD(1)),
user_agent String CODEC(ZSTD(3)),
ip IPv4 CODEC(T64, LZ4),
device_info Map(String, String) CODEC(ZSTD(1))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id, event_time);
9. 压缩比优化策略
9.1 列类型优化
-- ❌ 低效: 使用 String 存储低基数数据
status String -- 唯一值只有 5 个,但每个值都存储完整字符串
-- ✅ 高效: 使用 LowCardinality(String) 或 Enum8
status LowCardinality(String) -- 字典编码,存储整数索引
status Enum8('pending'=1, 'active'=2, 'deleted'=3) -- 1 字节整数
-- ❌ 低效: 使用 DateTime 存储日期
event_date DateTime -- 8 字节
-- ✅ 高效: 使用 Date 存储日期
event_date Date -- 2 字节
9.2 CODEC 组合最佳实践
| 数据特征 | 推荐 CODEC | 预期压缩比 |
|---|---|---|
| 连续时间戳 | Delta(8) ZSTD(1) | 30:1~50:1 |
| 变化平缓的浮点 | Gorilla ZSTD(1) | 8:1~20:1 |
| 低变化整数 | T64 ZSTD(1) | 10:1~30:1 |
| 日志文本 | ZSTD(3) | 5:1~10:1 |
| 低基数字符串 | LZ4 | 3:1~5:1 |
| UUID/哈希值 | LZ4 | 1.5:1~2:1 |
| 稀疏整数(大量0) | Delta(8) ZSTD(3) | 50:1~100:1 |
9.3 诊断压缩效果
-- 查找压缩效果最差的列
SELECT
table,
column,
formatReadableSize(sum(data_compressed_bytes)) AS compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
round(sum(data_uncompressed_bytes) / greatest(sum(data_compressed_bytes), 1), 2) AS ratio
FROM system.parts AS p
LEFT JOIN system.columns AS c ON p.table = c.table AND p.database = c.database
WHERE p.active = 1 AND p.database = currentDatabase()
GROUP BY table, column
HAVING ratio < 2
ORDER BY ratio ASC;
-- 查找占用空间最大的列
SELECT
table,
column,
formatReadableSize(sum(data_compressed_bytes)) AS size
FROM system.parts
ARRAY JOIN [name] AS column
WHERE active = 1 AND database = currentDatabase()
GROUP BY table, column
ORDER BY sum(data_compressed_bytes) DESC
LIMIT 20;
10. 压缩相关的系统表
-- 压缩统计信息
SELECT
database,
table,
column,
formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio,
count() AS part_count
FROM system.parts
WHERE active = 1 AND database = currentDatabase()
GROUP BY database, table, column
ORDER BY sum(data_compressed_bytes) DESC
LIMIT 20;
-- 整体压缩率
SELECT
database,
table,
formatReadableSize(sum(data_compressed_bytes)) AS total_compressed,
formatReadableSize(sum(data_uncompressed_bytes)) AS total_uncompressed,
round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio,
sum(rows) AS total_rows,
count() AS parts_count
FROM system.parts
WHERE active = 1 AND database = currentDatabase()
GROUP BY database, table
ORDER BY sum(data_compressed_bytes) DESC;