GCC 完全指南 / 15 - 性能分析

15 - 性能分析

学习使用 gprof、perf 和火焰图进行程序性能分析，找出热点函数并优化。

15.1 性能分析概述

性能分析（Profiling）是识别程序瓶颈的关键技术。

工具	类型	侵入性	精度	说明
gprof	采样+插桩	需重编译	中等	GNU Profiler，经典工具
perf	硬件计数器	无需重编译	高	Linux 内核性能工具
cachegrind	模拟	无需重编译	高	Valgrind 缓存模拟
Callgrind	模拟	无需重编译	高	Valgrind 调用图
Intel VTune	硬件	无需重编译	最高	商业工具

15.2 gprof

基本使用

# 步骤 1: 使用 -pg 编译
gcc -pg -O2 -o hello main.c

# 步骤 2: 运行程序（生成 gmon.out）
./hello

# 步骤 3: 分析结果
gprof hello gmon.out > profile.txt

gprof 输出解读

# 平坦分析（Flat Profile）
gprof -p hello gmon.out
# %   cumulative   self              self     total
# time   seconds   seconds    calls   s/call   s/call  name
# 45.2     0.45     0.45  1000000   0.0000   0.0000  compute_inner
# 30.1     0.75     0.30        1   0.3000   0.9000  compute_all
# 24.7     1.00     0.25  1000000   0.0000   0.0000  process_item

# 调用图（Call Graph）
gprof -q hello gmon.out

gprof 示例程序

// profile_test.c
#include <stdio.h>
#include <math.h>

__attribute__((noinline))
double compute_inner(int n) {
    double sum = 0;
    for (int i = 0; i < n; i++) {
        sum += sin(i * 0.001) * cos(i * 0.001);
    }
    return sum;
}

__attribute__((noinline))
double compute_outer(int iterations) {
    double total = 0;
    for (int i = 0; i < iterations; i++) {
        total += compute_inner(1000);
    }
    return total;
}

int main(void) {
    printf("Result: %f\n", compute_outer(1000));
    return 0;
}

gcc -pg -O2 -o profile_test profile_test.c -lm
./profile_test
gprof profile_test gmon.out | head -30

gprof 的局限性

局限	说明
需要重新编译	必须加 `-pg` 选项
不支持共享库	默认不分析动态库函数
精度有限	基于定时器采样，短函数可能不准
I/O 阻塞不计时	等待 I/O 的时间不计入函数时间
内联函数不可见	被内联的函数不出现在分析结果中

15.3 perf

perf 是 Linux 内核自带的性能分析工具，使用 CPU 硬件性能计数器。

基本使用

# 安装 perf
sudo apt install linux-tools-common linux-tools-$(uname -r)

# 基本性能分析
perf stat ./hello

# 详细分析（记录事件）
perf record -g ./hello
perf report

# 实时热点查看
perf top

perf stat 输出

perf stat ./profile_test

# 输出:
#  Performance counter stats for './profile_test':
#
#          1,234.56 msec  task-clock                #    0.998 CPUs utilized
#                12       context-switches          #    9.722 /sec
#                 2       cpu-migrations            #    1.620 /sec
#               125       page-faults               #  101.25 /sec
#     4,567,890,123       cycles                    #    3.700 GHz
#     2,345,678,901       instructions              #    0.51  insn per cycle
#       456,789,012       branches                  #  370.04 M/sec
#        12,345,678       branch-misses             #    2.70% of all branches
#       123,456,789       cache-references          #  100.00 M/sec
#         1,234,567       cache-misses              #    1.00% of all cache refs
#
#       1.236789 seconds time elapsed

perf record/report

# 记录性能数据
perf record -g -o perf.data ./profile_test

# 交互式报告
perf report

# 命令行报告
perf report --stdio

# 只看特定函数
perf report --stdio --sort=dso,symbol | grep compute

# 可视化（生成火焰图数据）
perf script | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg

perf 事件类型

# 列出可用事件
perf list

# 常用事件
perf stat -e cycles,instructions,cache-misses,branch-misses ./hello

# 跟踪缓存未命中
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./hello

# 分支预测
perf stat -e branch-misses,branches ./hello

# 上下文切换
perf stat -e context-switches,cpu-migrations ./hello

15.4 火焰图（Flame Graph）

火焰图是性能数据的可视化工具，直观显示调用栈和时间分布。

生成火焰图

# 克隆 FlameGraph 工具
git clone https://github.com/brendangregg/FlameGraph.git

# 使用 perf 生成火焰图
perf record -g ./profile_test
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flamegraph.svg

# 打开 SVG 文件
xdg-open flamegraph.svg

火焰图解读

火焰图阅读方法:
  ┌─────────────────────────────────────────┐
  │              main (100%)                 │  ← 宽度 = 占总时间比例
  ├─────────────────────────────────────────┤
  │        compute_outer (75%)               │
  ├─────────────────────────────────────────┤
  │   compute_inner (45%)  │ process (30%)  │
  └─────────────────────────────────────────┘

  - X 轴: 不是时间顺序，是字母排序（或按占比）
  - Y 轴: 调用栈深度（底部是根，顶部是叶子函数）
  - 宽度: 函数占用的 CPU 时间比例
  - 颜色: 通常无特殊含义（随机暖色调）

gprof 数据生成火焰图

# gprof 转换为火焰图格式
gprof ./profile_test gmon.out | ./FlameGraph/gprof2flamegraph.pl | ./FlameGraph/flamegraph.pl > gprof_flamegraph.svg

15.5 Valgrind Callgrind

# 使用 Callgrind 分析
valgrind --tool=callgrind ./profile_test

# 生成调用图
callgrind_annotate callgrind.out.<PID>

# 使用 KCachegrind 可视化
kcachegrind callgrind.out.<PID>

Cachegrind（缓存分析）

# 分析缓存性能
valgrind --tool=cachegrind ./profile_test

# 输出:
# ==12345== D   refs:      123,456,789  (100,000,000 rd + 23,456,789 wr)
# ==12345== D1  misses:         12,345  (     10,000 rd +      2,345 wr)
# ==12345== LLd misses:          1,234  (      1,000 rd +        234 wr)
# ==12345== D1  miss rate:         0.0% (        0.0% +        0.0%)

# 可视化
cg_annotate cachegrind.out.<PID>

15.6 简单的计时测量

使用 clock()

#include <stdio.h>
#include <time.h>

int main(void) {
    clock_t start = clock();

    // 被测量的代码
    volatile long long sum = 0;
    for (long long i = 0; i < 1000000000LL; i++) {
        sum += i;
    }

    clock_t end = clock();
    double elapsed = (double)(end - start) / CLOCKS_PER_SEC;
    printf("Time: %.3f sec\n", elapsed);
    return 0;
}

使用 clock_gettime()（更精确）

#include <stdio.h>
#include <time.h>

static double get_time(void) {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec + ts.tv_nsec * 1e-9;
}

int main(void) {
    double start = get_time();

    // 被测量的代码
    volatile long long sum = 0;
    for (long long i = 0; i < 1000000000LL; i++) {
        sum += i;
    }

    double end = get_time();
    printf("Time: %.3f sec\n", end - start);
    return 0;
}

使用 RDTSC（CPU 周期计数）

#include <stdio.h>
#include <stdint.h>

static inline uint64_t rdtsc(void) {
    uint32_t lo, hi;
    asm volatile("rdtsc" : "=a"(lo), "=d"(hi));
    return ((uint64_t)hi << 32) | lo;
}

int main(void) {
    uint64_t start = rdtsc();

    // 被测量的代码
    volatile long long sum = 0;
    for (long long i = 0; i < 1000000LL; i++) {
        sum += i;
    }

    uint64_t end = rdtsc();
    printf("Cycles: %lu\n", end - start);
    return 0;
}

15.7 GCC 生成的性能提示

# 查看优化报告
gcc -O2 -fopt-info-optimized -o hello main.c
# 显示哪些优化被应用

# 查看向量化报告
gcc -O3 -fopt-info-vec-optimized -o hello main.c
# 显示哪些循环被向量化

# 查看未向量化的原因
gcc -O3 -fopt-info-vec-missed -o hello main.c

# 查看内联决策
gcc -O2 -fopt-info-inline-optimized -o hello main.c

# 所有优化信息
gcc -O2 -fopt-info-all -o hello main.c 2> opt_report.txt

15.8 性能分析工作流程

性能分析最佳实践:

1. 建立基准
   └── 先测量，记录当前性能指标

2. 找到瓶颈
   ├── 不要猜测！用 profiler 找真正的热点
   ├── perf record + perf report
   └── 或 gprof + 火焰图

3. 分析原因
   ├── CPU 密集？→ 算法优化、向量化
   ├── 缓存未命中？→ 数据结构优化
   ├── 分支预测失败？→ 代码布局优化
   └── I/O 等待？→ 异步 I/O、批处理

4. 优化
   └── 一次只改一个地方

5. 重新测量
   └── 对比优化前后，确认效果

6. 重复
   └── 性能优化是迭代过程

要点回顾

要点	核心内容
gprof	需要 `-pg` 编译，生成 `gmon.out`，经典但有局限
perf	使用硬件计数器，无需重编译，功能强大
火焰图	直观显示调用栈和时间分布
Callgrind	Valgrind 工具，精确的调用图和缓存分析
工作流程	测量 → 找瓶颈 → 分析 → 优化 → 重新测量

注意事项

不要猜测瓶颈: 90% 的情况下，程序员对性能瓶颈的直觉是错的。始终先 profile。

优化编译级别: 性能分析时使用与生产相同的优化级别（通常是 -O2），否则结果不代表实际。

gprof 的精度: gprof 对短小频繁的函数可能不准确，perf 是更好的选择。

采样频率: perf 默认采样频率可能不够高，使用 -F 997 设置每秒采样次数（质数避免与程序周期重叠）。

扩展阅读

Brendan Gregg’s Website — 性能分析权威资源
Linux perf Tutorial — perf 官方教程
FlameGraph — 火焰图工具
gprof Documentation — gprof 手册
Optimizing Software in C++ — Agner Fog 的优化指南

下一步

→ 16 - GCC 插件开发：了解 GCC 插件架构，学习如何开发自定义编译器插件。