强曰为道
与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

ClickHouse 教程 / 索引系统(主键/跳数索引)

索引系统(主键/跳数索引)

ClickHouse 的索引体系与传统关系数据库(B+Tree 索引)完全不同,它采用稀疏索引配合跳数索引的方案,专为大规模分析查询优化。


1. 主键索引(稀疏索引)

1.1 稀疏索引原理

传统数据库(如 MySQL)为每一行建立索引,而 ClickHouse 每隔 index_granularity(默认 8192)行存储一个索引标记(mark)。

数据(按排序键有序存储):
┌─ Granule 0 ──────────────────────────┐
│  Row 0: (2026-05-01, user_1, click)  │
│  Row 1: (2026-05-01, user_1, view)   │
│  ...                                  │
│  Row 8191: (2026-05-01, user_50, ...) │
└──────────────────────────────────────┘
         ↑ 索引标记 0: (2026-05-01, user_1)
┌─ Granule 1 ──────────────────────────┐
│  Row 8192: (2026-05-01, user_51, ...)│
│  ...                                  │
│  Row 16383: (2026-05-02, user_20, ..)│
└──────────────────────────────────────┘
         ↑ 索引标记 1: (2026-05-01, user_51)
┌─ Granule 2 ──────────────────────────┐
│  Row 16384: (2026-05-02, user_21, .) │
│  ...                                  │
└──────────────────────────────────────┘
         ↑ 索引标记 2: (2026-05-02, user_21)

查询 WHERE event_date = '2026-05-02' AND user_id = 20 时:

  1. 在稀疏索引上二分查找,找到包含目标数据的 Granule 范围
  2. 只读取这些 Granule 的数据(而非全表扫描)
  3. 在 Granule 内线性过滤
-- 查看稀疏索引加载情况
SELECT
    database,
    table,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_memory,
    formatReadableSize(primary_key_bytes_in_memory_allocated) AS pk_allocated
FROM system.tables
WHERE database = currentDatabase();

-- 查看索引标记
SELECT
    partition,
    name AS part_name,
    rows,
    marks,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_size
FROM system.parts
WHERE table = 'your_table' AND active = 1;

1.2 索引粒度(index_granularity)

-- 默认粒度: 8192 行
CREATE TABLE default_granularity
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
SETTINGS index_granularity = 8192;

-- 自定义粒度: 4096 行(索引更精细,但内存开销更大)
CREATE TABLE fine_granularity
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
SETTINGS index_granularity = 4096;

-- 自适应粒度(推荐,按字节大小自适应)
CREATE TABLE adaptive_granularity
(
    event_date Date,
    user_id UInt64,
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
SETTINGS
    index_granularity_bytes = 10485760,   -- 10MB 目标 granule 大小
    min_bytes_for_wide_part = 0;
粒度设置索引大小查询精度适用场景
8192 行一般通用场景
4096 行较好行数据较小,需要更精确索引
自适应自动行大小差异大,推荐使用

1.3 排序键与主键的关系

-- 场景: ORDER BY 包含 4 列,但 PRIMARY KEY 只取前 2 列
CREATE TABLE partial_pk
(
    event_date Date,
    user_id UInt64,
    event_type LowCardinality(String),
    url String,
    duration UInt32
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id, event_type, url)
PRIMARY KEY (event_date, user_id);  -- 索引只包含前两列

-- 查询时仍可利用完整的排序键进行过滤
SELECT * FROM partial_pk
WHERE event_date = '2026-05-11'
  AND user_id = 123
  AND event_type = 'click'       -- 排序键第三列
  AND url = '/home';             -- 排序键第四列
-- 稀疏索引定位到 granule 后,在 granule 内部利用排序键继续过滤

💡 提示:当排序键列数较多时,可以只取前几列作为 PRIMARY KEY,减少索引内存占用。排序键后面的列仍可用于 granule 内的快速过滤。


2. 跳数索引(Secondary Indexes)

跳数索引是 ClickHouse 的二级索引,可以在查询时跳过不符合条件的 granule,进一步减少数据读取量。

2.1 跳数索引类型总览

类型适用数据功能典型场景
minmax任何可比较类型记录 granule 的最小/最大值范围查询
set(N)任何类型记录 granule 中的唯一值集合等值查询(低基数)
bloom_filter数值/字符串布隆过滤器,概率性判断值是否存在等值查询(高基数)
ngrambf_v1(N, size, hash, false_pos)Stringn-gram 布隆过滤器模糊搜索/全文检索
tokenbf_v1(size, hash, false_pos)Stringtoken 布隆过滤器按单词搜索
vocabularyString词典索引等值查询

2.2 minmax 索引

-- minmax 索引:记录每个 granule 中列的最小值和最大值
-- 适合有序或范围查询

CREATE TABLE logs_minmax
(
    log_time DateTime,
    host String,
    level LowCardinality(String),
    message String
)
ENGINE = MergeTree()
ORDER BY log_time
INDEX idx_host host TYPE minmax GRANULARITY 4;

-- 执行逻辑:
-- granule 0: host min='server-01', max='server-05'
-- granule 1: host min='server-03', max='server-08'
-- 查询 host='server-06' 时,跳过 granule 0,扫描 granule 1

SELECT * FROM logs_minmax WHERE host = 'server-06';

2.3 set 索引

-- set 索引:记录 granule 中的唯一值集合
-- 适合低基数等值查询
CREATE TABLE events_set
(
    event_date Date,
    user_id UInt64,
    event_type LowCardinality(String),
    region LowCardinality(String),
    data String
)
ENGINE = MergeTree()
ORDER BY (event_date, user_id)
INDEX idx_region region TYPE set(100) GRANULARITY 4;
-- set(100) 表示最多记录 100 个唯一值

SELECT * FROM events_set WHERE region = '华东';
-- 利用 set 索引快速跳过不含 '华东' 的 granule

2.4 bloom_filter 索引

-- bloom_filter 索引:适合高基数列的等值查询
CREATE TABLE user_actions_bf
(
    action_time DateTime,
    user_id UInt64,
    action String,
    url String,
    ip IPv4
)
ENGINE = MergeTree()
ORDER BY action_time
INDEX idx_user_id user_id TYPE bloom_filter(0.01) GRANULARITY 4;
-- bloom_filter(0.01) 表示误报率 1%

-- 高基数 user_id 查询时,bloom_filter 可以快速排除不含该 user_id 的 granule
SELECT count() FROM user_actions_bf WHERE user_id = 12345;

2.5 ngrambf_v1 索引(文本搜索)

-- n-gram 布隆过滤器:适合子串搜索
CREATE TABLE fulltext_search
(
    doc_id UInt64,
    title String,
    content String,
    created_at DateTime
)
ENGINE = MergeTree()
ORDER BY created_at
INDEX idx_content content TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 4;
-- ngrambf_v1(ngram_size, bloom_filter_size, hash_functions, false_positive_rate)
-- ngram_size=3: 使用 3-gram(trigram)分词

-- 搜索包含关键词的文档
SELECT doc_id, title
FROM fulltext_search
WHERE hasToken(content, 'ClickHouse');

2.6 tokenbf_v1 索引(单词搜索)

-- token 布隆过滤器:按空格/标点分词后建立索引
CREATE TABLE articles
(
    article_id UInt64,
    title String,
    tags Array(String),
    content String,
    published_at DateTime
)
ENGINE = MergeTree()
ORDER BY published_at
INDEX idx_content content TYPE tokenbf_v1(1024, 3, 0) GRANULARITY 4;

-- 按单词搜索
SELECT article_id, title
FROM articles
WHERE hasToken(content, 'Python');

💡 提示tokenbf_v1 按标点和空格分词,适合英文等空格分隔的语言。中文搜索建议配合 ngrambf_v1 使用。

2.7 vocabulary 索引

-- 词典索引:适合精确匹配高基数字符串
CREATE TABLE products
(
    product_id UInt64,
    sku String,
    name String,
    price Decimal128(2)
)
ENGINE = MergeTree()
ORDER BY product_id
INDEX idx_sku sku TYPE vocabulary(10) GRANULARITY 4;

SELECT * FROM products WHERE sku = 'SKU-12345-XYZ';

3. 索引创建与管理

3.1 创建索引

-- 创建表时定义索引
CREATE TABLE my_table
(
    id UInt64,
    name String,
    category LowCardinality(String),
    tags Array(String),
    content String
)
ENGINE = MergeTree()
ORDER BY id
INDEX idx_name name TYPE bloom_filter(0.01) GRANULARITY 4,
INDEX idx_category category TYPE set(10) GRANULARITY 4,
INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 2,
INDEX idx_content content TYPE tokenbf_v1(1024, 3, 0) GRANULARITY 4;

-- 动态添加索引(需要 rebuild 数据)
ALTER TABLE my_table ADD INDEX idx_name name TYPE bloom_filter(0.01) GRANULARITY 4;
ALTER TABLE my_table MATERIALIZE INDEX idx_name;  -- 触发索引构建

-- 删除索引
ALTER TABLE my_table DROP INDEX idx_name;

3.2 GRANULARITY 参数详解

-- GRANULARITY = N: 每 N 个 granule 建立一个索引标记
-- 设 GRANULARITY = 4, index_granularity = 8192:
-- 每 4 * 8192 = 32768 行建立一个跳数索引标记

-- GRANULARITY = 1: 每个 granule 都建立索引(最精确,但内存/空间最大)
-- GRANULARITY = 8: 每 8 个 granule 一个索引标记(更紧凑,但可能跳过较多)

-- 选择建议:
-- 高选择性列(如 user_id): GRANULARITY 1~4
-- 低选择性列(如 status): GRANULARITY 4~8
-- 文本搜索列: GRANULARITY 2~4

4. 索引使用分析

4.1 查看索引命中情况

-- 方法 1: 使用 EXPLAIN
EXPLAIN indexes = 1
SELECT count() FROM my_table WHERE user_id = 12345;

-- 方法 2: 查看查询日志
SELECT
    query_id,
    query,
    read_rows,
    read_bytes,
    result_rows,
    query_duration_ms
FROM system.query_log
WHERE type = 'QueryFinish'
  AND query LIKE '%my_table%'
ORDER BY event_time DESC
LIMIT 10;

-- 方法 3: 查看部分扫描统计
SELECT
    query_id,
    tables[1] AS table_name,
    segments,
    read_rows,
    formatReadableSize(read_bytes) AS read_size
FROM system.query_log
WHERE type = 'QueryFinish'
  AND tables[1] LIKE '%my_table%'
ORDER BY event_time DESC
LIMIT 10;

4.2 EXPLAIN 输出示例

EXPLAIN indexes = 1
SELECT count() FROM user_actions_bf WHERE user_id = 12345;

-- 输出类似:
-- Expression ((Projection + Before ORDER BY))
--   Aggregating
--     Filter (WHERE)
--       ReadFromMergeTree (default.user_actions_bf)
--         Indexes:
--           PrimaryKey
--             Keys:
--               action_time
--             Condition: true
--             Parts: 5/5
--             Granules: 120/120
--           Skip
--             Name: idx_user_id
--             Description: bloom_filter GRANULARITY 4
--             Parts: 5/5
--             Granules: 120/30  ← 只扫描 30 个 granule(从 120 个中)

4.3 索引效率评估

-- 评估不同索引类型对查询的加速效果
-- 创建测试表
CREATE TABLE idx_benchmark
(
    event_date Date,
    user_id UInt64,
    category LowCardinality(String),
    ip IPv4,
    url String
)
ENGINE = MergeTree()
ORDER BY event_date;

-- 无跳数索引基线查询
SET send_logs_level = 'trace';
SELECT count() FROM idx_benchmark WHERE user_id = 12345;

-- 添加 bloom_filter 索引
ALTER TABLE idx_benchmark ADD INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4;
ALTER TABLE idx_benchmark MATERIALIZE INDEX idx_user;

-- 再次查询,对比 read_rows
SELECT count() FROM idx_benchmark WHERE user_id = 12345;

5. 二级索引设计策略

5.1 按查询模式选择索引类型

查询模式推荐索引示例
WHERE col = value(高基数)bloom_filteruser_id, ip
WHERE col = value(低基数)setminmaxstatus, region
WHERE col BETWEEN a AND bminmaxtimestamp, amount
WHERE hasToken(col, word)tokenbf_v1日志消息
WHERE col LIKE '%keyword%'ngrambf_v1中文搜索
WHERE has(arr_col, value)bloom_filter数组类型

5.2 索引数量控制

-- ✅ 合理: 2~5 个跳数索引
CREATE TABLE reasonable_indexes
(
    id UInt64,
    user_id UInt64,
    ip IPv4,
    category LowCardinality(String),
    content String
)
ENGINE = MergeTree()
ORDER BY id
INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4,
INDEX idx_ip ip TYPE bloom_filter(0.01) GRANULARITY 4,
INDEX idx_category category TYPE set(10) GRANULARITY 4;

-- ❌ 过多: 10+ 个跳数索引
-- 每个索引都有内存和磁盘开销,索引过多反而降低写入性能

⚠️ 注意:跳数索引会占用额外的磁盘空间和内存。如果某个索引很少被查询命中,建议删除以节省资源。


6. 索引空间开销

-- 查看索引占用空间
SELECT
    table,
    name AS part_name,
    formatReadableSize(primary_key_bytes_in_memory) AS pk_size,
    formatReadableSize(secondary_indices_bytes_in_memory) AS idx_size,
    formatReadableSize(bytes_on_disk) AS total_size,
    round(secondary_indices_bytes_in_memory / bytes_on_disk * 100, 2) AS idx_ratio
FROM system.parts
WHERE active = 1
  AND database = currentDatabase()
ORDER BY secondary_indices_bytes_in_memory DESC;

-- 查看所有表的索引总大小
SELECT
    table,
    formatReadableSize(sum(primary_key_bytes_in_memory)) AS total_pk,
    formatReadableSize(sum(secondary_indices_bytes_in_memory)) AS total_idx,
    formatReadableSize(sum(bytes_on_disk)) AS total_size
FROM system.parts
WHERE active = 1 AND database = currentDatabase()
GROUP BY table
ORDER BY sum(secondary_indices_bytes_in_memory) DESC;

7. 实际案例

7.1 日志搜索系统

CREATE TABLE app_logs
(
    log_time DateTime,
    service LowCardinality(String),
    level LowCardinality(String),
    host LowCardinality(String),
    trace_id String,
    user_id UInt64,
    message String
)
ENGINE = MergeTree()
ORDER BY (log_time, service)
-- 跳数索引设计:
INDEX idx_host host TYPE set(20) GRANULARITY 4,           -- 按主机过滤
INDEX idx_level level TYPE set(5) GRANULARITY 2,          -- 按日志级别过滤
INDEX idx_user user_id TYPE bloom_filter(0.01) GRANULARITY 4,  -- 按用户追踪
INDEX idx_trace trace_id TYPE bloom_filter(0.001) GRANULARITY 4, -- 链路追踪
INDEX idx_message message TYPE tokenbf_v1(2048, 3, 0) GRANULARITY 2;  -- 关键词搜索
TTL log_time + INTERVAL 90 DAY DELETE;

-- 查询: 查找某用户的 ERROR 日志
SELECT log_time, service, message
FROM app_logs
WHERE log_time >= now() - INTERVAL 1 DAY
  AND user_id = 12345
  AND level = 'ERROR'
ORDER BY log_time DESC;
-- 利用: 分区裁剪 + 排序键 + bloom_filter(user_id) + set(level)

-- 查询: 全文搜索日志消息
SELECT log_time, service, message
FROM app_logs
WHERE log_time >= now() - INTERVAL 1 DAY
  AND hasToken(message, 'timeout')
ORDER BY log_time DESC;
-- 利用: tokenbf_v1 索引快速跳过不含 'timeout' 的 granule

7.2 用户画像系统

CREATE TABLE user_profiles
(
    user_id UInt64,
    age UInt8,
    gender LowCardinality(String),
    city LowCardinality(String),
    interests Array(String),
    last_active DateTime
)
ENGINE = MergeTree()
ORDER BY user_id
-- 只有 interests 需要跳数索引(Array 类型不能用 minmax/set)
INDEX idx_interests interests TYPE bloom_filter(0.01) GRANULARITY 4;

-- 查询: 找到对 'AI' 感兴趣的用户
SELECT user_id, age, city
FROM user_profiles
WHERE has(interests, 'AI')
  AND city = 'Beijing';
-- interests 利用 bloom_filter 索引
-- city 利用排序键的 minmax 特性(整列有序,相同值连续分布)

7.3 电商商品搜索

CREATE TABLE product_catalog
(
    product_id UInt64,
    sku String,
    name String,
    brand LowCardinality(String),
    category_id UInt32,
    tags Array(String),
    description String,
    price Decimal128(2),
    created_at DateTime
)
ENGINE = MergeTree()
ORDER BY product_id
INDEX idx_sku sku TYPE vocabulary(10) GRANULARITY 4,
INDEX idx_brand brand TYPE set(100) GRANULARITY 4,
INDEX idx_name name TYPE ngrambf_v1(3, 256, 2, 0) GRANULARITY 4,
INDEX idx_tags tags TYPE bloom_filter(0.01) GRANULARITY 4;

-- 按 SKU 精确查找
SELECT * FROM product_catalog WHERE sku = 'SKU-12345';

-- 按品牌筛选
SELECT count() FROM product_catalog WHERE brand = 'Apple';

-- 按名称模糊搜索
SELECT product_id, name, price
FROM product_catalog
WHERE hasToken(name, 'iPhone');

-- 按标签筛选
SELECT product_id, name
FROM product_catalog
WHERE has(tags, 'hot_sale') AND has(tags, 'electronics');

8. 索引设计检查清单

检查项建议
✅ 主键索引排序键前缀,列数 ≤ 5
✅ 跳数索引数量总数控制在 2~6 个
✅ 索引类型选择按查询模式选择(bloom_filter / set / minmax / tokenbf)
✅ GRANULARITY高选择性列 14,低选择性列 48
✅ 索引空间监控定期检查索引大小,删除无用索引
✅ 查询验证用 EXPLAIN 验证索引是否被利用

扩展阅读

  1. ClickHouse 索引文档
  2. 稀疏索引深入解析
  3. bloom_filter 索引
  4. ngrambf_v1 索引
  5. ClickHouse 索引最佳实践