jemalloc 内存分配器完全指南 / 06 - 性能调优

第 6 章：性能调优

6.1 调优原则

调优 jemalloc 之前，先明确两个核心指标：

指标	说明	优先级
吞吐量 (Throughput)	每秒完成的 malloc/free 次数	计算密集型任务优先
内存占用 (RSS)	进程的常驻内存集	内存敏感场景优先

重要：吞吐量和内存占用通常是此消彼长的关系。调优的本质是在两者之间找到最优平衡点。

调优决策树

                    开始调优
                      │
            ┌─────────▼──────────┐
            │ 当前瓶颈是什么？     │
            └────┬──────────┬────┘
                 │          │
         ┌───────▼───┐  ┌──▼──────────┐
         │ CPU 使用率高│  │ 内存占用高   │
         │ (分配慢)    │  │ (RSS 大)    │
         └───────┬───┘  └──┬──────────┘
                 │         │
    ┌────────────▼───┐  ┌──▼───────────────┐
    │ 增大 TC        │  │ 减小 dirty_decay  │
    │ 调整 Arena 数  │  │ 减小 Arena 数     │
    │ 禁用 profiling │  │ 启用 background   │
    └────────────────┘  └──────────────────┘

6.2 Arena 数量调优

为什么 Arena 数量重要

Arena 数量	效果	适用场景
过多	元数据开销大，RSS 增加	—
过少	锁竞争增加，吞吐量下降	—
适中	平衡性能和内存	✅ 推荐

默认值与推荐值

默认：4 × CPU 核心数
推荐起点：min(4 × ncpu, 16) 或直接 16

# 查看当前 CPU 核心数
nproc

# 对于 64 核机器，默认 256 个 Arena（可能过多）
# 推荐减少到 8-16
export MALLOC_CONF="narenas:8"

调优实验

// arena_test.c - 测试不同 Arena 数量对性能的影响
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <time.h>
#include <jemalloc/jemalloc.h>

#define N_THREADS  16
#define N_ALLOCS   100000
#define OBJ_SIZE   256

static void *worker(void *arg) {
    void *ptrs[N_ALLOCS];
    for (int i = 0; i < N_ALLOCS; i++) {
        ptrs[i] = malloc(OBJ_SIZE);
        memset(ptrs[i], 0, OBJ_SIZE);
    }
    for (int i = 0; i < N_ALLOCS; i++) {
        free(ptrs[i]);
    }
    return NULL;
}

int main(int argc, char *argv[]) {
    pthread_t threads[N_THREADS];
    struct timespec t0, t1;

    clock_gettime(CLOCK_MONOTONIC, &t0);
    for (int i = 0; i < N_THREADS; i++)
        pthread_create(&threads[i], NULL, worker, NULL);
    for (int i = 0; i < N_THREADS; i++)
        pthread_join(threads[i], NULL);
    clock_gettime(CLOCK_MONOTONIC, &t1);

    double elapsed = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    printf("narenas=%-4s  elapsed=%.3fs\n",
           argc > 1 ? argv[1] : "default", elapsed);

    return 0;
}

gcc -O2 -o arena_test arena_test.c -ljemalloc -lpthread

# 不同 Arena 数量对比
for n in 1 4 8 16 32 64; do
    MALLOC_CONF="narenas:$n" ./arena_test $n
done

典型结果（16 线程，64 核机器）：

narenas=1     elapsed=2.831s   # 严重锁竞争
narenas=4     elapsed=1.204s
narenas=8     elapsed=0.856s   # 推荐
narenas=16    elapsed=0.812s   # 推荐
narenas=32    elapsed=0.835s
narenas=64    elapsed=0.891s   # 元数据开销增加

6.3 脏页回收调优

脏页与 muzzy 页

状态	说明	对应 madvise
Clean	未使用，已归还 OS	—
Dirty	已释放，jemalloc 仍持有	—
Muzzy	已 madvise(MADV_FREE)，OS 可在内存压力下回收	`MADV_FREE`

进程释放内存 → Dirty 页（仍计入 RSS）
                │
                ├─ dirty_decay_ms 后 → Muzzy 页（MADV_FREE，RSS 可能降低）
                │
                └─ muzzy_decay_ms 后 → Clean 页（MADV_DONTNEED，RSS 降低）

参数调优

# 场景 1：内存敏感（容器化服务）
export MALLOC_CONF="dirty_decay_ms:1000,muzzy_decay_ms:3000"
# 特点：快速归还 OS，RSS 稳定

# 场景 2：性能优先（批处理）
export MALLOC_CONF="dirty_decay_ms:30000,muzzy_decay_ms:60000"
# 特点：减少系统调用，吞吐量更高

# 场景 3：永不变还（极端性能优化）
export MALLOC_CONF="dirty_decay_ms:-1,muzzy_decay_ms:-1"
# 特点：RSS 只增不减，但分配性能最高

手动触发回收

#include <jemalloc/jemalloc.h>

// 手动触发所有 Arena 的脏页回收
void purge_all_arenas() {
    unsigned narenas;
    size_t len = sizeof(narenas);
    je_mallctl("arenas.narenas", &narenas, &len, NULL, 0);

    for (unsigned i = 0; i < narenas; i++) {
        char cmd[64];
        snprintf(cmd, sizeof(cmd), "arena.%u.purge", i);
        je_mallctl(cmd, NULL, NULL, NULL, 0);
    }
}

// 在内存紧张时调用
void on_memory_pressure() {
    purge_all_arenas();
}

6.4 Thread Cache 调优

tcache_max 参数

# 默认 TC 上限为 32KB（某些版本为 16KB）
# 增大可提升中等大小对象的分配性能
export MALLOC_CONF="tcache_max:65536"  # 64KB

# 查看当前值
MALLOC_CONF="stats_print:true" ./my_program 2>&1 | grep tcache_max

TC 对不同大小分布的影响

对象大小分布	tcache_max 建议	说明
主要 < 4KB	8192 (默认)	默认即可
主要 4KB - 64KB	65536	增大 TC 覆盖更多大小类
主要 > 64KB	不影响	大对象不经过 TC

6.5 透明大页 (Transparent Huge Pages / THP)

背景

Linux 的透明大页 (THP) 可以将多个 4KB 页合并为 2MB 大页，减少 TLB miss，提升性能。

jemalloc 与 THP

参数	说明	默认值
`metadata_thp`	元数据是否使用 THP	`disabled`
`opt.metadata_thp`	元数据 THP 策略	`disabled`

可选值：

disabled：不使用 THP
auto：有需要时使用
always：始终使用

# 启用元数据 THP（可减少 TLB miss）
export MALLOC_CONF="metadata_thp:auto"

# 查看系统 THP 状态
cat /sys/kernel/mm/transparent_hugepage/enabled
# [always] madvise never

# 验证效果
MALLOC_CONF="metadata_thp:auto,stats_print:true" ./my_program 2>&1 | grep thp

注意：THP 在某些场景下可能导致延迟抖动（kswapd compaction），建议在生产环境测试后决定是否启用。

6.6 NUMA 优化

NUMA 架构简介

┌──────────────────────┐    ┌──────────────────────┐
│     NUMA Node 0      │    │     NUMA Node 1      │
│  ┌─────────────────┐ │    │ ┌─────────────────┐  │
│  │ CPU 0-7         │ │    │ │ CPU 8-15        │  │
│  │ Local Memory    │ │ QPI│ │ Local Memory    │  │
│  └─────────────────┘ │◄──►│ └─────────────────┘  │
└──────────────────────┘    └──────────────────────┘

跨 NUMA 节点访问内存的延迟比本地访问高 2-3 倍。

jemalloc 的 NUMA 支持

jemalloc 默认的 Arena 分配策略已经可以在一定程度上实现 NUMA 本地化：

线程通常被调度到某个 NUMA 节点的 CPU 上
该线程使用的 Arena 的内存页会被分配到该 CPU 本地的 NUMA 节点

进一步优化

// 绑定线程到特定 NUMA 节点并使用对应 Arena
#include <pthread.h>
#include <sched.h>
#include <jemalloc/jemalloc.h>

void setup_numa_thread(int node, unsigned arena_id) {
    // 绑定 CPU
    cpu_set_t cpuset;
    CPU_ZERO(&cpuset);
    // 简化示例：假设 node 0 对应 CPU 0-7，node 1 对应 CPU 8-15
    for (int i = node * 8; i < (node + 1) * 8; i++) {
        CPU_SET(i, &cpuset);
    }
    pthread_setaffinity_np(pthread_self(), sizeof(cpuset), &cpuset);

    // 绑定 Arena
    je_mallctl("thread.arena", NULL, NULL, &arena_id, sizeof(arena_id));
}

NUMA 关键建议

建议	说明
Arena 数 ≥ NUMA 节点数	确保每个节点有独立 Arena
线程绑定 CPU	确保线程不跨节点迁移
使用 `numactl`	限制进程到特定 NUMA 节点

# 限制进程到 NUMA 节点 0
numactl --cpunodebind=0 --membind=0 ./my_server

6.7 Background Thread

作用

启用后台线程后，jemalloc 会创建专用线程来异步执行脏页回收和 TC GC，避免在业务线程的分配/释放路径上执行这些操作。

# 启用后台线程
export MALLOC_CONF="background_thread:true"

# 查看后台线程状态
MALLOC_CONF="background_thread:true,stats_print:true" ./my_program 2>&1 | grep background

参数	说明	默认值
`background_thread`	启用后台线程	`false`
`max_background_threads`	最大后台线程数	`ncpu`

效果

减少分配延迟抖动：脏页回收不再阻塞 malloc/free
CPU 开销转移：从分配路径转移到后台线程
适合延迟敏感场景：如实时系统、API 服务

# 适合高并发低延迟服务
export MALLOC_CONF="background_thread:true,dirty_decay_ms:2000"

6.8 调优实验框架

// tune_bench.c - 系统化调优测试
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <pthread.h>
#include <time.h>
#include <jemalloc/jemalloc.h>

typedef struct {
    int id;
    int n_allocs;
    int obj_size_min;
    int obj_size_max;
} thread_arg_t;

static void *bench_thread(void *arg) {
    thread_arg_t *a = (thread_arg_t *)arg;
    void **ptrs = malloc(a->n_allocs * sizeof(void *));

    for (int i = 0; i < a->n_allocs; i++) {
        size_t sz = a->obj_size_min +
                    rand() % (a->obj_size_max - a->obj_size_min + 1);
        ptrs[i] = malloc(sz);
        if (ptrs[i]) memset(ptrs[i], 0xAB, sz);
    }
    for (int i = 0; i < a->n_allocs; i++) {
        free(ptrs[i]);
    }
    free(ptrs);
    return NULL;
}

int main(int argc, char *argv[]) {
    int n_threads = argc > 1 ? atoi(argv[1]) : 8;
    int n_allocs  = argc > 2 ? atoi(argv[2]) : 50000;

    pthread_t *threads = malloc(n_threads * sizeof(pthread_t));
    thread_arg_t *args = malloc(n_threads * sizeof(thread_arg_t));

    struct timespec t0, t1;
    clock_gettime(CLOCK_MONOTONIC, &t0);

    for (int i = 0; i < n_threads; i++) {
        args[i] = (thread_arg_t){i, n_allocs, 16, 4096};
        pthread_create(&threads[i], NULL, bench_thread, &args[i]);
    }
    for (int i = 0; i < n_threads; i++) {
        pthread_join(threads[i], NULL);
    }

    clock_gettime(CLOCK_MONOTONIC, &t1);
    double elapsed = (t1.tv_sec - t0.tv_sec) + (t1.tv_nsec - t0.tv_nsec) / 1e9;
    double ops = (double)n_threads * n_allocs * 2 / elapsed;

    printf("threads=%d allocs=%d elapsed=%.3fs ops=%.0f/s\n",
           n_threads, n_allocs, elapsed, ops);

    free(threads);
    free(args);
    return 0;
}

gcc -O2 -o tune_bench tune_bench.c -ljemalloc -lpthread

# 测试脚本
cat << 'SCRIPT' > tune_test.sh
#!/bin/bash
echo "=== Arena Count Test ==="
for n in 1 4 8 16 32; do
    printf "narenas=%-3d " $n
    MALLOC_CONF="narenas:$n" ./tune_bench 16 50000
done

echo ""
echo "=== Dirty Decay Test ==="
for ms in 0 1000 5000 30000 -1; do
    printf "dirty_decay=%-6d " $ms
    MALLOC_CONF="dirty_decay_ms:$ms" ./tune_bench 16 50000
done

echo ""
echo "=== Background Thread Test ==="
for bt in true false; do
    printf "bg_thread=%-5s " $bt
    MALLOC_CONF="background_thread:$bt" ./tune_bench 16 50000
done
SCRIPT
chmod +x tune_test.sh
./tune_test.sh

6.9 调优参数速查表

参数	调优方向	推荐值范围
`narenas`	减少 → 省内存；增多 → 提高并发	4-16
`dirty_decay_ms`	减少 → 省内存；增大 → 提高性能	1000-30000
`muzzy_decay_ms`	减少 → 省内存；增大 → 提高性能	5000-60000
`tcache_max`	增大 → 中等对象更快	16KB-128KB
`background_thread`	开启 → 降低延迟抖动	true/false
`metadata_thp`	开启 → 减少 TLB miss	auto/always/disabled

6.10 本章小结

调优目标	关键参数
降低延迟	`background_thread:true`, 较大 `tcache_max`
降低内存	减小 `narenas`, 减小 `dirty_decay_ms`
提高吞吐	增大 `narenas`, 增大 `dirty_decay_ms`, 关闭 profiling
NUMA 优化	Arena 数 ≥ NUMA 节点数 + CPU 绑定

扩展阅读

上一章：第 5 章：内存分析与 Profiling 下一章：第 7 章：系统集成