强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

ClickHouse 教程 / 数据压缩与编码

数据压缩与编码

数据压缩是 ClickHouse 高性能的核心支柱之一。列式存储天然适合压缩——同列数据类型相同、相邻值相似度高,压缩率远超行式存储。


1. 压缩算法概述

ClickHouse 内置了多种压缩算法,可在列级别灵活配置。

算法压缩速度解压速度压缩率适用场景
LZ4(默认)极快极快中等通用场景,写入密集型
ZSTD较慢存储敏感场景,冷数据
None已压缩数据或临时表
-- 查看默认压缩算法
SELECT name, value
FROM system.settings
WHERE name LIKE '%compression%';

-- 创建表时指定默认压缩算法
CREATE TABLE lz4_table
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY event_date
SETTINGS min_compress_block_size = 65536,
         max_compress_block_size = 1048576;

-- 对整个表设置压缩算法
CREATE TABLE zstd_table
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY event_date;

-- 修改列的压缩算法
ALTER TABLE zstd_table MODIFY COLUMN data CODEC(ZSTD(3));

2. 列压缩 CODEC

ClickHouse 支持在列级别指定专用编码(Codec),可以与通用压缩算法组合使用,进一步提升压缩率。

2.1 支持的编码类型

CODEC适用数据类型原理压缩率提升
Delta(N)整数/日期/时间存储相邻值的差值时序数据极佳
DoubleDelta(N)整数/日期/时间存储差值的差值线性递增数据
GorillaFloat32/Float64/UInt32XOR 编码,类似 Facebook Gorilla时序浮点数据
T64整数类型将 64 个值打包到矩阵中转置编码低变化整数
FPCFloat64浮点压缩算法大量浮点数据
LZ4任何类型通用 LZ4 压缩通用
ZSTD(N)任何类型ZSTD 压缩,N=1~22高压缩率需求
NONE任何类型不压缩已压缩的外部数据

2.2 Delta + ZSTD 组合(时序数据最佳)

CREATE TABLE timeseries_data
(
    timestamp DateTime64(3),
    host LowCardinality(String),
    cpu_usage Float64,
    memory_usage Float64,
    disk_read UInt64,
    disk_write UInt64,
    network_in UInt64,
    network_out UInt64
)
ENGINE = MergeTree()
ORDER BY (host, timestamp)
-- 列级编码
CODEC(
    timestamp    Delta(8) ZSTD(1),      -- 时间戳差值 + ZSTD
    cpu_usage    Gorilla ZSTD(1),       -- 浮点 XOR + ZSTD
    memory_usage Gorilla ZSTD(1),       -- 浮点 XOR + ZSTD
    disk_read    Delta(8) ZSTD(1),      -- 整数差值 + ZSTD
    disk_write   Delta(8) ZSTD(1),
    network_in   Delta(8) ZSTD(1),
    network_out  Delta(8) ZSTD(1)
);

-- 插入模拟数据
INSERT INTO timeseries_data
SELECT
    toDateTime('2026-05-01 00:00:00') + toIntervalSecond(number * 10) AS timestamp,
    'server-01' AS host,
    50 + sin(number / 100) * 30 AS cpu_usage,
    60 + cos(number / 50) * 20 AS memory_usage,
    toUInt64(rand() % 1000000) AS disk_read,
    toUInt64(rand() % 500000) AS disk_write,
    toUInt64(rand() % 10000000) AS network_in,
    toUInt64(rand() % 5000000) AS network_out
FROM numbers(8640);  -- 一天的数据,每 10 秒一条

-- 查看压缩效果
SELECT
    name,
    formatReadableSize(data_uncompressed_bytes) AS uncompressed,
    formatReadableSize(data_compressed_bytes) AS compressed,
    round(data_uncompressed_bytes / data_compressed_bytes, 2) AS ratio,
    formatReadableSize(marks_bytes) AS marks_size
FROM system.columns
WHERE table = 'timeseries_data' AND database = currentDatabase();

预期压缩率对比:

列名          | 未压缩    | 压缩后    | 压缩比
timestamp     | 69.12 KB | 1.23 KB  | 56.2x
cpu_usage     | 69.12 KB | 8.45 KB  | 8.2x
memory_usage  | 69.12 KB | 9.12 KB  | 7.6x
disk_read     | 69.12 KB | 15.23 KB | 4.5x

2.3 T64 编码(低变化整数)

-- T64 适合值范围较小的整数列
CREATE TABLE temperature_sensors
(
    sensor_id UInt32,
    reading_time DateTime,
    temperature Int16,          -- 温度变化范围小(-50 ~ 50 度)
    humidity UInt8              -- 湿度 0~100
)
ENGINE = MergeTree()
ORDER BY (sensor_id, reading_time)
CODEC(
    temperature T64 ZSTD(1),
    humidity T64 LZ4
);

-- 查看编码效果
SELECT
    column,
    formatReadableSize(data_uncompressed_bytes) AS uncompressed,
    formatReadableSize(data_compressed_bytes) AS compressed,
    round(data_uncompressed_bytes / data_compressed_bytes, 2) AS ratio
FROM system.columns
WHERE table = 'temperature_sensors' AND database = currentDatabase();

2.4 Gorilla 编码(浮点数据)

-- Gorilla(Facebook 论文): 相邻浮点值 XOR 后用变长编码
-- 适合变化平缓的浮点数据

CREATE TABLE stock_prices
(
    trade_date Date,
    stock_code LowCardinality(String),
    open_price Float64,
    close_price Float64,
    high_price Float64,
    low_price Float64,
    volume UInt64
)
ENGINE = MergeTree()
ORDER BY (stock_code, trade_date)
CODEC(
    open_price   Gorilla ZSTD(1),
    close_price  Gorilla ZSTD(1),
    high_price   Gorilla ZSTD(1),
    low_price    Gorilla ZSTD(1),
    volume       Delta(8) ZSTD(1)
);

3. 压缩比测试

3.1 不同算法压缩比对比

-- 创建测试表对比不同压缩算法
CREATE TABLE compress_test_lz4
(
    event_date Date,
    value UInt64,
    description String
) ENGINE = MergeTree() ORDER BY event_date
CODEC(LZ4);

CREATE TABLE compress_test_zstd
(
    event_date Date,
    value UInt64,
    description String
) ENGINE = MergeTree() ORDER BY event_date
CODEC(ZSTD(1));

CREATE TABLE compress_test_zstd_high
(
    event_date Date,
    value UInt64,
    description String
) ENGINE = MergeTree() ORDER BY event_date
CODEC(ZSTD(9));

-- 插入相同数据
INSERT INTO compress_test_lz4 SELECT
    toDate('2026-01-01') + toIntervalDay(number % 365),
    rand64(),
    concat('Event description for record number ', toString(number), ' with some random text content')
FROM numbers(1000000);

INSERT INTO compress_test_zstd SELECT * FROM compress_test_lz4;
INSERT INTO compress_test_zstd_high SELECT * FROM compress_test_lz4;

-- 对比压缩效果
SELECT
    table,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio,
    round(sum(data_compressed_bytes) / sum(data_uncompressed_bytes) * 100, 2) AS compression_pct
FROM system.parts
WHERE database = currentDatabase()
  AND table LIKE 'compress_test_%'
  AND active = 1
GROUP BY table
ORDER BY ratio DESC;

预期输出(示例):

┌─table──────────────────┬─uncompressed─┬─compressed─┬─ratio─┬─compression_pct─┐
│ compress_test_zstd_high│ 68.53 MiB    │ 12.45 MiB  │  5.50 │  18.17%         │
│ compress_test_zstd     │ 68.53 MiB    │ 14.23 MiB  │  4.82 │  20.76%         │
│ compress_test_lz4      │ 68.53 MiB    │ 22.67 MiB  │  3.02 │  33.08%         │
└────────────────────────┴──────────────┴────────────┴───────┴─────────────────┘

4. 压缩与查询性能权衡

算法压缩率写入速度读取速度适用场景
LZ4最快最快热数据、高频写入
ZSTD(1)较快温数据、均衡场景
ZSTD(9)极高较慢较快冷数据、存储敏感
Delta(8) + ZSTD极高时序数据(推荐)
Gorilla + ZSTD浮点时序数据
-- 查询压缩算法对读取性能的影响
-- 使用系统表查看实际读取统计
SELECT
    query,
    read_rows,
    formatReadableSize(read_bytes) AS read_size,
    query_duration_ms,
    formatReadableSize(memory_usage) AS memory
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query LIKE '%compress_test_%'
  AND event_time > now() - INTERVAL 1 HOUR
ORDER BY query_duration_ms
LIMIT 10;

💡 提示:对于频繁查询的热数据,推荐使用 LZ4Delta + LZ4(解压速度最快)。对于归档冷数据,使用 ZSTD(9) 可以节省 30%~50% 的存储空间。


5. 自适应压缩

ClickHouse 会自动选择压缩算法,当数据块的压缩率不理想时(压缩后反而更大),会自动切换为不压缩。

-- 查看压缩统计
SELECT
    event_time,
    type,
    database,
    table,
    compressed_size,
    uncompressed_size,
    compression_codec
FROM system.part_log
WHERE table = 'your_table'
  AND event_time > now() - INTERVAL 1 DAY
ORDER BY event_time DESC
LIMIT 20;

-- 查看压缩块大小设置
SELECT name, value
FROM system.settings
WHERE name IN (
    'min_compress_block_size',
    'max_compress_block_size',
    'max_block_size'
);

6. 压缩配置 SETTINGS

-- 建表时配置压缩参数
CREATE TABLE optimized_compress
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY event_date
SETTINGS
    min_compress_block_size = 65536,       -- 最小压缩块大小(字节)
    max_compress_block_size = 1048576;     -- 最大压缩块大小(字节)

-- 全局压缩配置(config.xml)
-- <compression>
--     <case>
--         <min_part_size>10000000000</min_part_size>
--         <method>zstd</method>
--     </case>
-- </compression>
参数默认值说明
min_compress_block_size65536写入时最小压缩块大小
max_compress_block_size1048576写入时最大压缩块大小

7. 压缩对写入性能影响

-- 测试不同压缩算法的写入速度
-- 创建表
CREATE TABLE write_test_lz4 (id UInt64, data String) ENGINE = MergeTree() ORDER BY id CODEC(LZ4);
CREATE TABLE write_test_zstd (id UInt64, data String) ENGINE = MergeTree() ORDER BY id CODEC(ZSTD(1));

-- 批量写入 LZ4 压缩表
INSERT INTO write_test_lz4 SELECT
    number,
    concat('row_', toString(number), '_data_padding_content')
FROM numbers(1000000);

-- 批量写入 ZSTD 压缩表
INSERT INTO write_test_zstd SELECT
    number,
    concat('row_', toString(number), '_data_padding_content')
FROM numbers(1000000);

-- 查看写入统计
SELECT
    table,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio
FROM system.parts
WHERE database = currentDatabase()
  AND table LIKE 'write_test_%'
  AND active = 1
GROUP BY table;

⚠️ 注意:ZSTD 的压缩级别越高(1→22),压缩率越好但 CPU 开销越大。写入密集型场景建议使用 ZSTD(1)LZ4


8. 实际案例

8.1 时序数据压缩方案

-- 监控系统:每天 100 亿条指标数据
CREATE TABLE metrics.compressed_metrics
(
    timestamp DateTime64(3) CODEC(Delta(8), ZSTD(1)),
    host LowCardinality(String) CODEC(LZ4),
    metric_name LowCardinality(String) CODEC(LZ4),
    value Float64 CODEC(Gorilla, ZSTD(1)),
    tags Map(String, String) CODEC(ZSTD(3))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMMDD(timestamp)
ORDER BY (host, metric_name, timestamp);

-- 预期压缩率: 20:1 ~ 50:1
-- 100 亿条 × ~100 字节 = ~1TB 原始数据
-- 压缩后: ~20GB ~ 50GB

8.2 日志存储压缩方案

CREATE TABLE logs.compressed_logs
(
    log_time DateTime CODEC(Delta(4), LZ4),
    service LowCardinality(String) CODEC(LZ4),
    level LowCardinality(String) CODEC(LZ4),
    host LowCardinality(String) CODEC(LZ4),
    message String CODEC(ZSTD(3)),
    stack_trace String CODEC(ZSTD(5)),
    extra Map(String, String) CODEC(ZSTD(1))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(log_time)
ORDER BY (log_time, service, host)
TTL log_time + INTERVAL 90 DAY DELETE;

-- 查看实际压缩效果
SELECT
    column,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 1) AS ratio
FROM system.parts
ARRAY JOIN [name] AS column
WHERE table = 'compressed_logs' AND active = 1
GROUP BY column;

8.3 用户行为埋点压缩方案

CREATE TABLE analytics.compressed_events
(
    event_date Date CODEC(Delta(2), LZ4),
    event_time DateTime64(3) CODEC(Delta(8), ZSTD(1)),
    user_id UInt64 CODEC(T64, ZSTD(1)),
    session_id UUID CODEC(LZ4),
    event_type LowCardinality(String) CODEC(LZ4),
    page_url String CODEC(ZSTD(1)),
    referrer String CODEC(ZSTD(1)),
    user_agent String CODEC(ZSTD(3)),
    ip IPv4 CODEC(T64, LZ4),
    device_info Map(String, String) CODEC(ZSTD(1))
)
ENGINE = MergeTree()
PARTITION BY toYYYYMM(event_date)
ORDER BY (event_date, user_id, event_time);

9. 压缩比优化策略

9.1 列类型优化

-- ❌ 低效: 使用 String 存储低基数数据
status String     -- 唯一值只有 5 个,但每个值都存储完整字符串

-- ✅ 高效: 使用 LowCardinality(String) 或 Enum8
status LowCardinality(String)  -- 字典编码,存储整数索引
status Enum8('pending'=1, 'active'=2, 'deleted'=3)  -- 1 字节整数

-- ❌ 低效: 使用 DateTime 存储日期
event_date DateTime  -- 8 字节

-- ✅ 高效: 使用 Date 存储日期
event_date Date      -- 2 字节

9.2 CODEC 组合最佳实践

数据特征推荐 CODEC预期压缩比
连续时间戳Delta(8) ZSTD(1)30:1~50:1
变化平缓的浮点Gorilla ZSTD(1)8:1~20:1
低变化整数T64 ZSTD(1)10:1~30:1
日志文本ZSTD(3)5:1~10:1
低基数字符串LZ43:1~5:1
UUID/哈希值LZ41.5:1~2:1
稀疏整数(大量0)Delta(8) ZSTD(3)50:1~100:1

9.3 诊断压缩效果

-- 查找压缩效果最差的列
SELECT
    table,
    column,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed,
    round(sum(data_uncompressed_bytes) / greatest(sum(data_compressed_bytes), 1), 2) AS ratio
FROM system.parts AS p
LEFT JOIN system.columns AS c ON p.table = c.table AND p.database = c.database
WHERE p.active = 1 AND p.database = currentDatabase()
GROUP BY table, column
HAVING ratio < 2
ORDER BY ratio ASC;

-- 查找占用空间最大的列
SELECT
    table,
    column,
    formatReadableSize(sum(data_compressed_bytes)) AS size
FROM system.parts
ARRAY JOIN [name] AS column
WHERE active = 1 AND database = currentDatabase()
GROUP BY table, column
ORDER BY sum(data_compressed_bytes) DESC
LIMIT 20;

10. 压缩相关的系统表

-- 压缩统计信息
SELECT
    database,
    table,
    column,
    formatReadableSize(sum(data_compressed_bytes)) AS compressed_size,
    formatReadableSize(sum(data_uncompressed_bytes)) AS uncompressed_size,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS compression_ratio,
    count() AS part_count
FROM system.parts
WHERE active = 1 AND database = currentDatabase()
GROUP BY database, table, column
ORDER BY sum(data_compressed_bytes) DESC
LIMIT 20;

-- 整体压缩率
SELECT
    database,
    table,
    formatReadableSize(sum(data_compressed_bytes)) AS total_compressed,
    formatReadableSize(sum(data_uncompressed_bytes)) AS total_uncompressed,
    round(sum(data_uncompressed_bytes) / sum(data_compressed_bytes), 2) AS ratio,
    sum(rows) AS total_rows,
    count() AS parts_count
FROM system.parts
WHERE active = 1 AND database = currentDatabase()
GROUP BY database, table
ORDER BY sum(data_compressed_bytes) DESC;

扩展阅读

  1. ClickHouse 压缩编解码器
  2. LZ4 压缩算法
  3. ZSTD 压缩算法
  4. Facebook Gorilla 论文
  5. ClickHouse 数据压缩深入