ClickHouse 教程 / 索引系统（主键/跳数索引）

索引系统（主键/跳数索引）

ClickHouse 的索引体系与传统关系数据库（B+Tree 索引）完全不同，它采用稀疏索引配合跳数索引的方案，专为大规模分析查询优化。

1. 主键索引（稀疏索引）

1.1 稀疏索引原理

传统数据库（如 MySQL）为每一行建立索引，而 ClickHouse 每隔 index_granularity（默认 8192）行存储一个索引标记（mark）。

数据（按排序键有序存储）:
┌─ Granule 0 ──────────────────────────┐
│  Row 0: (2026-05-01, user_1, click)  │
│  Row 1: (2026-05-01, user_1, view)   │
│  ...                                  │
│  Row 8191: (2026-05-01, user_50, ...) │
└──────────────────────────────────────┘
         ↑ 索引标记 0: (2026-05-01, user_1)
┌─ Granule 1 ──────────────────────────┐
│  Row 8192: (2026-05-01, user_51, ...)│
│  ...                                  │
│  Row 16383: (2026-05-02, user_20, ..)│
└──────────────────────────────────────┘
         ↑ 索引标记 1: (2026-05-01, user_51)
┌─ Granule 2 ──────────────────────────┐
│  Row 16384: (2026-05-02, user_21, .) │
│  ...                                  │
└──────────────────────────────────────┘
         ↑ 索引标记 2: (2026-05-02, user_21)

查询 WHERE event_date = '2026-05-02' AND user_id = 20 时：

在稀疏索引上二分查找，找到包含目标数据的 Granule 范围
只读取这些 Granule 的数据（而非全表扫描）
在 Granule 内线性过滤

-- 查看稀疏索引加载情况
SELECT
    database,
    table,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_memory,
    formatReadableSize(primary_key_bytes_in_memory_allocated) AS pk_allocated
FROM system.tables
WHERE database = currentDatabase();

-- 查看索引标记
SELECT
    partition,
    name AS part_name,
    rows,
    marks,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_size
FROM system.parts
WHERE table = 'your_table' AND active = 1;

1.2 索引粒度（index_granularity）

-- 默认粒度: 8192 行
CREATE TABLE default_granularity
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
SETTINGS index_granularity = 8192;

-- 自定义粒度: 4096 行（索引更精细，但内存开销更大）
CREATE TABLE fine_granularity
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
SETTINGS index_granularity = 4096;

-- 自适应粒度（推荐，按字节大小自适应）
CREATE TABLE adaptive_granularity
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
SETTINGS
    index_granularity_bytes = 10485760,   -- 10MB 目标 granule 大小
    min_bytes_for_wide_part = 0;

粒度设置	索引大小	查询精度	适用场景
8192 行	小	一般	通用场景
4096 行	中	较好	行数据较小，需要更精确索引
自适应	自动	好	行大小差异大，推荐使用

1.3 排序键与主键的关系

-- 场景: ORDER BY 包含 4 列，但 PRIMARY KEY 只取前 2 列
CREATE TABLE partial_pk
(
    event_date Date,
    user_id UInt64,
    event_type LowCardinality(String),
    url String,
    duration UInt32
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id, event_type, url)
PRIMARY KEY (event_date, user_id);  -- 索引只包含前两列

-- 查询时仍可利用完整的排序键进行过滤
SELECT * FROM partial_pk
WHERE event_date = '2026-05-11'
  AND user_id = 123
  AND event_type = 'click'       -- 排序键第三列
  AND url = '/home';             -- 排序键第四列
-- 稀疏索引定位到 granule 后，在 granule 内部利用排序键继续过滤

💡 提示：当排序键列数较多时，可以只取前几列作为 PRIMARY KEY，减少索引内存占用。排序键后面的列仍可用于 granule 内的快速过滤。

2. 跳数索引（Secondary Indexes）

跳数索引是 ClickHouse 的二级索引，可以在查询时跳过不符合条件的 granule，进一步减少数据读取量。

2.1 跳数索引类型总览

类型	适用数据	功能	典型场景
`minmax`	任何可比较类型	记录 granule 的最小/最大值	范围查询
`set(N)`	任何类型	记录 granule 中的唯一值集合	等值查询（低基数）
`bloom_filter`	数值/字符串	布隆过滤器，概率性判断值是否存在	等值查询（高基数）
`ngrambf_v1(N, size, hash, false_pos)`	String	n-gram 布隆过滤器	模糊搜索/全文检索
`tokenbf_v1(size, hash, false_pos)`	String	token 布隆过滤器	按单词搜索
`vocabulary`	String	词典索引	等值查询

2.2 minmax 索引

-- minmax 索引：记录每个 granule 中列的最小值和最大值
-- 适合有序或范围查询

CREATE TABLE logs_minmax
(
    log_time DateTime,
    host String,
    level LowCardinality(String),
    message String
)
ENGINE = MergeTree()
ORDER BY log_time
INDEX idx_host host TYPE minmax GRANULARITY 4;

-- 执行逻辑:
-- granule 0: host min='server-01', max='server-05'
-- granule 1: host min='server-03', max='server-08'
-- 查询 host='server-06' 时，跳过 granule 0，扫描 granule 1

SELECT * FROM logs_minmax WHERE host = 'server-06';

2.3 set 索引

-- set 索引：记录 granule 中的唯一值集合
-- 适合低基数等值查询
CREATE TABLE events_set
(
    event_date Date,
    user_id UInt64,
    event_type LowCardinality(String),
    region LowCardinality(String),
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
INDEX idx_region region TYPE set(100) GRANULARITY 4;
-- set(100) 表示最多记录 100 个唯一值

SELECT * FROM events_set WHERE region = '华东';
-- 利用 set 索引快速跳过不含 '华东' 的 granule

2.4 bloom_filter 索引

-- bloom_filter 索引：适合高基数列的等值查询
CREATE TABLE user_actions_bf
(
    action_time DateTime,
    user_id UInt64,
    action String,
    url String,
    ip IPv4
)
ENGINE = MergeTree()
ORDER BY action_time
INDEX idx_user_id user_id TYPE bloom_filter(0.01) GRANULARITY 4;
-- bloom_filter(0.01) 表示误报率 1%

-- 高基数 user_id 查询时，bloom_filter 可以快速排除不含该 user_id 的 granule
SELECT count() FROM user_actions_bf WHERE user_id = 12345;

2.5 ngrambf_v1 索引（文本搜索）

-- n-gram 布隆过滤器：适合子串搜索
CREATE TABLE fulltext_search
(
    doc_id UInt64,
    title String,
    content String,
    created_at DateTime
)
ENGINE = MergeTree()
ORDER BY created_at
INDEX idx_content content TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 4;
-- ngrambf_v1(ngram_size, bloom_filter_size, hash_functions, false_positive_rate)
-- ngram_size=3: 使用 3-gram（trigram）分词

-- 搜索包含关键词的文档
SELECT doc_id, title
FROM fulltext_search
WHERE hasToken(content, 'ClickHouse');

2.6 tokenbf_v1 索引（单词搜索）

-- token 布隆过滤器：按空格/标点分词后建立索引
CREATE TABLE articles
(
    article_id UInt64,
    title String,
    tags Array(String),
    content String,
    published_at DateTime
)
ENGINE = MergeTree()
ORDER BY published_at
INDEX idx_content content TYPE tokenbf_v1(1024, 3, 0) GRANULARITY 4;

-- 按单词搜索
SELECT article_id, title
FROM articles
WHERE hasToken(content, 'Python');

💡 提示：tokenbf_v1 按标点和空格分词，适合英文等空格分隔的语言。中文搜索建议配合 ngrambf_v1 使用。

2.7 vocabulary 索引

-- 词典索引：适合精确匹配高基数字符串
CREATE TABLE products
(
    product_id UInt64,
    sku String,
    name String,
    price Decimal128(2)
)
ENGINE = MergeTree()
ORDER BY product_id
INDEX idx_sku sku TYPE vocabulary(10) GRANULARITY 4;

SELECT * FROM products WHERE sku = 'SKU-12345-XYZ';

3. 索引创建与管理

3.1 创建索引

-- 创建表时定义索引
CREATE TABLE my_table
(
    id UInt64,
    name String,
    category LowCardinality(String),
    tags Array(String),
    content String
)
ENGINE = MergeTree()
ORDER BY id
INDEX idx_name name TYPE bloom_filter(0.01) GRANULARITY 4,
INDEX idx_category category TYPE set(10) GRANULARITY 4,
INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 2,
INDEX idx_content content TYPE tokenbf_v1(1024, 3, 0) GRANULARITY 4;

-- 动态添加索引（需要 rebuild 数据）
ALTER TABLE my_table ADD INDEX idx_name name TYPE bloom_filter(0.01) GRANULARITY 4;
ALTER TABLE my_table MATERIALIZE INDEX idx_name;  -- 触发索引构建

-- 删除索引
ALTER TABLE my_table DROP INDEX idx_name;

3.2 GRANULARITY 参数详解

-- GRANULARITY = N: 每 N 个 granule 建立一个索引标记
-- 设 GRANULARITY = 4, index_granularity = 8192:
-- 每 4 * 8192 = 32768 行建立一个跳数索引标记

-- GRANULARITY = 1: 每个 granule 都建立索引（最精确，但内存/空间最大）
-- GRANULARITY = 8: 每 8 个 granule 一个索引标记（更紧凑，但可能跳过较多）

-- 选择建议:
-- 高选择性列（如 user_id）: GRANULARITY 1~4
-- 低选择性列（如 status）: GRANULARITY 4~8
-- 文本搜索列: GRANULARITY 2~4

4. 索引使用分析

4.1 查看索引命中情况

-- 方法 1: 使用 EXPLAIN
EXPLAIN indexes = 1
SELECT count() FROM my_table WHERE user_id = 12345;

-- 方法 2: 查看查询日志
SELECT
    query_id,
    query,
    read_rows,
    read_bytes,
    result_rows,
    query_duration_ms
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query LIKE '%my_table%'
ORDER BY event_time DESC
LIMIT 10;

-- 方法 3: 查看部分扫描统计
SELECT
    query_id,
    tables[1] AS table_name,
    segments,
    read_rows,
    formatReadableSize(read_bytes) AS read_size
FROM system.query_log
WHERE type = 'QueryFinish'
  AND tables[1] LIKE '%my_table%'
ORDER BY event_time DESC
LIMIT 10;

4.2 EXPLAIN 输出示例

EXPLAIN indexes = 1
SELECT count() FROM user_actions_bf WHERE user_id = 12345;

-- 输出类似:
-- Expression ((Projection + Before ORDER BY))
--   Aggregating
--     Filter (WHERE)
--       ReadFromMergeTree (default.user_actions_bf)
--         Indexes:
--           PrimaryKey
--             Keys:
--               action_time
--             Condition: true
--             Parts: 5/5
--             Granules: 120/120
--           Skip
--             Name: idx_user_id
--             Description: bloom_filter GRANULARITY 4
--             Parts: 5/5
--             Granules: 120/30  ← 只扫描 30 个 granule（从 120 个中）

4.3 索引效率评估

-- 评估不同索引类型对查询的加速效果
-- 创建测试表
CREATE TABLE idx_benchmark
(
    event_date Date,
    user_id UInt64,
    category LowCardinality(String),
    ip IPv4,
    url String
)
ENGINE = MergeTree()
ORDER BY event_date;

-- 无跳数索引基线查询
SET send_logs_level = 'trace';
SELECT count() FROM idx_benchmark WHERE user_id = 12345;

-- 添加 bloom_filter 索引
ALTER TABLE idx_benchmark ADD INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4;
ALTER TABLE idx_benchmark MATERIALIZE INDEX idx_user;

-- 再次查询，对比 read_rows
SELECT count() FROM idx_benchmark WHERE user_id = 12345;

5. 二级索引设计策略

5.1 按查询模式选择索引类型

查询模式	推荐索引	示例
`WHERE col = value`（高基数）	`bloom_filter`	user_id, ip
`WHERE col = value`（低基数）	`set` 或 `minmax`	status, region
`WHERE col BETWEEN a AND b`	`minmax`	timestamp, amount
`WHERE hasToken(col, word)`	`tokenbf_v1`	日志消息
`WHERE col LIKE '%keyword%'`	`ngrambf_v1`	中文搜索
`WHERE has(arr_col, value)`	`bloom_filter`	数组类型

5.2 索引数量控制

-- ✅ 合理: 2~5 个跳数索引
CREATE TABLE reasonable_indexes
(
    id UInt64,
    user_id UInt64,
    ip IPv4,
    category LowCardinality(String),
    content String
)
ENGINE = MergeTree()
ORDER BY id
INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4,
INDEX idx_ip ip TYPE bloom_filter(0.01) GRANULARITY 4,
INDEX idx_category category TYPE set(10) GRANULARITY 4;

-- ❌ 过多: 10+ 个跳数索引
-- 每个索引都有内存和磁盘开销，索引过多反而降低写入性能

⚠️ 注意：跳数索引会占用额外的磁盘空间和内存。如果某个索引很少被查询命中，建议删除以节省资源。

6. 索引空间开销

-- 查看索引占用空间
SELECT
    table,
    name AS part_name,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_size,
    formatReadableSize(secondary_indices_bytes_in_memory) AS idx_size,
    formatReadableSize(bytes_on_disk) AS total_size,
    round(secondary_indices_bytes_in_memory / bytes_on_disk * 100, 2) AS idx_ratio
FROM system.parts
WHERE active = 1
  AND database = currentDatabase()
ORDER BY secondary_indices_bytes_in_memory DESC;

-- 查看所有表的索引总大小
SELECT
    table,
    formatReadableSize(sum(primary_key_bytes_in_memory)) AS total_pk,
    formatReadableSize(sum(secondary_indices_bytes_in_memory)) AS total_idx,
    formatReadableSize(sum(bytes_on_disk)) AS total_size
FROM system.parts
WHERE active = 1 AND database = currentDatabase()
GROUP BY table
ORDER BY sum(secondary_indices_bytes_in_memory) DESC;

7. 实际案例

7.1 日志搜索系统

CREATE TABLE app_logs
(
    log_time DateTime,
    service LowCardinality(String),
    level LowCardinality(String),
    host LowCardinality(String),
    trace_id String,
    user_id UInt64,
    message String
)
ENGINE = MergeTree()
ORDER BY (log_time, service)
-- 跳数索引设计:
INDEX idx_host host TYPE set(20) GRANULARITY 4,           -- 按主机过滤
INDEX idx_level level TYPE set(5) GRANULARITY 2,          -- 按日志级别过滤
INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4,  -- 按用户追踪
INDEX idx_trace trace_id TYPE bloom_filter(0.001) GRANULARITY 4, -- 链路追踪
INDEX idx_message message TYPE tokenbf_v1(2048, 3, 0) GRANULARITY 2;  -- 关键词搜索
TTL log_time + INTERVAL 90 DAY DELETE;

-- 查询: 查找某用户的 ERROR 日志
SELECT log_time, service, message
FROM app_logs
WHERE log_time >= now() - INTERVAL 1 DAY
  AND user_id = 12345
  AND level = 'ERROR'
ORDER BY log_time DESC;
-- 利用: 分区裁剪 + 排序键 + bloom_filter(user_id) + set(level)

-- 查询: 全文搜索日志消息
SELECT log_time, service, message
FROM app_logs
WHERE log_time >= now() - INTERVAL 1 DAY
  AND hasToken(message, 'timeout')
ORDER BY log_time DESC;
-- 利用: tokenbf_v1 索引快速跳过不含 'timeout' 的 granule

7.2 用户画像系统

CREATE TABLE user_profiles
(
    user_id UInt64,
    age UInt8,
    gender LowCardinality(String),
    city LowCardinality(String),
    interests Array(String),
    last_active DateTime
)
ENGINE = MergeTree()
ORDER BY user_id
-- 只有 interests 需要跳数索引（Array 类型不能用 minmax/set）
INDEX idx_interests interests TYPE bloom_filter(0.01) GRANULARITY 4;

-- 查询: 找到对 'AI' 感兴趣的用户
SELECT user_id, age, city
FROM user_profiles
WHERE has(interests, 'AI')
  AND city = 'Beijing';
-- interests 利用 bloom_filter 索引
-- city 利用排序键的 minmax 特性（整列有序，相同值连续分布）

7.3 电商商品搜索

CREATE TABLE product_catalog
(
    product_id UInt64,
    sku String,
    name String,
    brand LowCardinality(String),
    category_id UInt32,
    tags Array(String),
    description String,
    price Decimal128(2),
    created_at DateTime
)
ENGINE = MergeTree()
ORDER BY product_id
INDEX idx_sku sku TYPE vocabulary(10) GRANULARITY 4,
INDEX idx_brand brand TYPE set(100) GRANULARITY 4,
INDEX idx_name name TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 4,
INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 4;

-- 按 SKU 精确查找
SELECT * FROM product_catalog WHERE sku = 'SKU-12345';

-- 按品牌筛选
SELECT count() FROM product_catalog WHERE brand = 'Apple';

-- 按名称模糊搜索
SELECT product_id, name, price
FROM product_catalog
WHERE hasToken(name, 'iPhone');

-- 按标签筛选
SELECT product_id, name
FROM product_catalog
WHERE has(tags, 'hot_sale') AND has(tags, 'electronics');

8. 索引设计检查清单

检查项	建议
✅ 主键索引	排序键前缀，列数 ≤ 5
✅ 跳数索引数量	总数控制在 2~6 个
✅ 索引类型选择	按查询模式选择（bloom_filter / set / minmax / tokenbf）
✅ GRANULARITY	高选择性列 1~~4，低选择性列 4~~8
✅ 索引空间监控	定期检查索引大小，删除无用索引
✅ 查询验证	用 EXPLAIN 验证索引是否被利用