Julia 教程 / Julia 数据处理（DataFrames.jl）

Julia 数据处理（DataFrames.jl）

DataFrames.jl 是 Julia 生态中最核心的数据处理库，类似 Python 的 pandas 或 R 的 data.frame。本文全面介绍 DataFrame 的创建、操作、聚合、连接和性能优化。

1. DataFrame 创建

using DataFrames

# 方法1：从列创建
df = DataFrame(
    name = ["Alice", "Bob", "Charlie", "Diana"],
    age = [28, 35, 42, 31],
    salary = [55000.0, 72000.0, 88000.0, 63000.0],
    department = ["IT", "Finance", "IT", "HR"]
)

# 方法2：从字典创建
d = Dict("x" => [1,2,3], "y" => [4,5,6])
df2 = DataFrame(d)

# 方法3：从矩阵创建
M = rand(5, 3)
df3 = DataFrame(M, [:feature1, :feature2, :feature3])

# 方法4：空 DataFrame 逐步添加
df4 = DataFrame()
df4.id = 1:10
df4.value = rand(10)

# 查看基本信息
size(df)          # (行数, 列数)
names(df)         # 列名
eltype.(eachcol(df))  # 每列类型
describe(df)      # 汇总统计
first(df, 3)      # 前 3 行
last(df, 3)       # 后 3 行
show(df, allcols=true)  # 显示所有列

2. 列操作（select / transform / rename）

using DataFrames, Statistics

df = DataFrame(
    id = 1:5,
    first_name = ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    last_name = ["Wang", "Li", "Zhang", "Chen", "Liu"],
    score_math = [90, 85, 78, 95, 88],
    score_eng = [88, 92, 75, 89, 91]
)

# select: 选择/重排列（不修改原 DataFrame）
select(df, :id, :first_name)                   # 选择指定列
select(df, Not(:last_name))                    # 排除某列
select(df, r"score")                           # 正则匹配列名
select(df, :id, :first_name => :name)          # 选择并重命名

# transform: 添加新列（保留所有原列）
transform(df, [:score_math, :score_eng] => ByRow(+) => :total_score)
transform(df, :score_math => (x -> x ./ 100 * 150) => :scaled_math)
transform(df, [:first_name, :last_name] => ByRow(* ) => :full_name)

# rename: 重命名列
rename(df, :first_name => :fname, :last_name => :lname)
rename(x -> uppercase(x), df)   # 所有列名大写

# combine: 聚合列（不保留所有行）
combine(df, :score_math => mean => :avg_math)
combine(df, [:score_math, :score_eng] .=> [mean, std, minimum, maximum])

函数	返回行数	保留原列	用途
`select`	同原表	按指定	列选择/重排/重命名
`transform`	同原表	全部保留	添加新列
`combine`	聚合行数	不保留	聚合计算
`rename`	同原表	全部保留	仅重命名

3. 行操作（filter / subset / sort）

using DataFrames

df = DataFrame(
    id = 1:8,
    name = ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Hank"],
    age = [28, 35, 42, 31, 26, 55, 38, 29],
    city = ["Beijing", "Shanghai", "Beijing", "Shenzhen", "Shanghai", "Beijing", "Shenzhen", "Shanghai"],
    salary = [55000, 72000, 88000, 63000, 48000, 95000, 71000, 59000]
)

# filter: 使用函数式语法
filter(row -> row.age > 30, df)
filter(row -> row.city == "Beijing" && row.salary > 60000, df)

# subset: 使用列名和 ByRow 语法（推荐）
subset(df, :age => ByRow(>(30)))
subset(df, :city => ByRow(==("Beijing")), :salary => ByRow(>(60000)))

# subset 去除缺失值
df_missing = DataFrame(x = [1, missing, 3, missing, 5])
subset(df_missing, :x => ByRow(!ismissing))

# sort: 排序
sort(df, :salary)                              # 升序
sort(df, :salary, rev=true)                    # 降序
sort(df, [:city, :salary], rev=[false, true])  # 多列排序

# 去重
unique(df, :city)          # 按 city 去重，保留第一次出现
unique(df, [:city, :age])  # 按多列去重

# 取样
df[rand(1:nrow(df), 3), :]    # 随机 3 行

4. 分组与聚合（groupby / combine）

using DataFrames, Statistics

df = DataFrame(
    department = ["IT","IT","HR","HR","Finance","Finance","IT","HR"],
    employee = ["A","B","C","D","E","F","G","H"],
    salary = [55000, 72000, 48000, 52000, 85000, 78000, 63000, 45000],
    rating = [4.2, 3.8, 4.5, 3.9, 4.0, 4.3, 3.7, 4.1]
)

# 分组
gdf = groupby(df, :department)

# 对每个组进行聚合
combine(gdf, :salary => mean => :avg_salary)
combine(gdf, :salary => mean => :avg_salary,
              :salary => length => :count,
              :rating => maximum => :best_rating)

# 多列聚合
combine(gdf,
    :salary => mean => :avg_salary,
    :salary => std => :std_salary,
    nrow => :count
)

# 分组后 transform（保留所有行）
transform(gdf, :salary => (x -> x .- mean(x)) => :salary_centered)

# 多级分组
df2 = DataFrame(
    dept = repeat(["IT", "HR"], inner=4),
    level = repeat(["Junior", "Senior"], outer=4),
    salary = rand(8) .* 50000 .+ 30000
)
gdf2 = groupby(df2, [:dept, :level])
combine(gdf2, :salary => mean => :avg)

💡 提示: groupby 返回的是 GroupedDataFrame，它是一个惰性视图，不会复制数据。combine、transform、select 都可以直接作用于 GroupedDataFrame。

5. 连接操作（join）

using DataFrames

employees = DataFrame(
    id = 1:5,
    name = ["Alice", "Bob", "Charlie", "Diana", "Eve"],
    dept_id = [1, 2, 1, 3, 2]
)

departments = DataFrame(
    dept_id = [1, 2, 4],
    dept_name = ["Engineering", "Marketing", "Sales"]
)

# inner join: 两表都有的行
innerjoin(employees, departments, on=:dept_id)

# left join: 保留左表所有行
leftjoin(employees, departments, on=:dept_id)

# right join: 保留右表所有行
rightjoin(employees, departments, on=:dept_id)

# outer join: 保留所有行
outerjoin(employees, departments, on=:dept_id)

# anti join: 左表中没有匹配的行
antijoin(employees, departments, on=:dept_id)

# semi join: 左表中有匹配的行（不合并右表列）
semijoin(employees, departments, on=:dept_id)

# 不同列名的连接
df_a = DataFrame(emp_id=1:3, name=["A","B","C"])
df_b = DataFrame(id=1:3, city=["X","Y","Z"])
innerjoin(df_a, df_b, on=:emp_id => :id)

# 多列连接
innerjoin(df_a, df_b, on=[:col1, :col2])

join 类型	函数	说明
inner	`innerjoin`	只保留匹配行
left	`leftjoin`	保留左表所有行
right	`rightjoin`	保留右表所有行
outer	`outerjoin`	保留所有行
anti	`antijoin`	左表中不匹配的行
semi	`semijoin`	左表中匹配的行（不添加右表列）
cross	`crossjoin`	笛卡尔积

6. 缺失值处理

using DataFrames

df = DataFrame(
    name = ["Alice", "Bob", missing, "Diana", "Eve"],
    age = [28, missing, 42, 31, missing],
    salary = [55000.0, 72000.0, missing, 63000.0, 48000.0]
)

# 检测缺失值
ismissing.(df.age)
count(ismissing, df.age)

# 删除含缺失值的行
dropmissing(df)                     # 删除任何列有缺失的行
dropmissing(df, :age)              # 只看 age 列
dropmissing(df, [:name, :salary])  # 只看指定列

# 替换缺失值
coalesce.(df.name, "Unknown")      # 用默认值替换
coalesce.(df.age, 0)

# 填充缺失值
df_filled = copy(df)
df_filled.age = coalesce.(df_filled.age, 0)
df_filled.name = coalesce.(df_filled.name, "N/A")

# 缺失值统计
mapcols(x -> count(ismissing, x), df)  # 每列缺失值数量

# 配合 subset 去除缺失行
subset(df, :age => ByRow(!ismissing))

⚠️ 注意: Julia 中 missing 类型的任何运算结果都是 missing（类似 SQL 的 NULL）。使用 skipmissing 可以跳过缺失值：

mean(skipmissing(df.salary))    # 忽略缺失值计算均值
collect(skipmissing(df.age))    # 收集非缺失值

7. 长宽转换（stack / unstack）

using DataFrames

# 宽数据 → 长数据（melt）
df_wide = DataFrame(
    id = 1:3,
    math = [90, 85, 78],
    english = [88, 92, 75],
    science = [95, 80, 82]
)

df_long = stack(df_wide, [:math, :english, :science];
                variable_name = :subject,
                value_name = :score)
# 3*3=9 行

# 长数据 → 宽数据（pivot）
df_back = unstack(df_long, :subject, :score)

# 带 id 列的 unstack
df_long2 = DataFrame(
    id = repeat(1:3, inner=2),
    year = repeat([2023, 2024], outer=3),
    revenue = [100, 120, 200, 230, 150, 180]
)
unstack(df_long2, :year, :revenue)

💡 提示: stack 类似 pandas 的 melt，unstack 类似 pivot。处理面板数据时非常常用。

8. 管道操作链

using DataFrames, Statistics

df = DataFrame(
    id = 1:100,
    city = rand(["Beijing", "Shanghai", "Shenzhen"], 100),
    age = rand(20:60, 100),
    salary = rand(30000:1000:120000, 100)
)

# 使用管道操作链处理数据
result = @chain df begin
    subset(:age => ByRow(>(25)))                         # 过滤
    transform(:salary => (x -> x / 1000) => :salary_k)  # 新列
    groupby(:city)                                       # 分组
    combine(
        :salary => mean => :avg_salary,
        :salary => median => :median_salary,
        nrow => :count
    )
    sort(:avg_salary, rev=true)                          # 排序
end

# 不用管道的等价写法
result2 = sort(
    combine(
        groupby(
            transform(
                subset(df, :age => ByRow(>(25))),
                :salary => (x -> x / 1000) => :salary_k
            ),
            :city
        ),
        :salary => mean => :avg_salary,
        :salary => median => :median_salary,
        nrow => :count
    ),
    :avg_salary, rev=true
)

⚠️ 注意: 使用 @chain 宏需要加载 Chain.jl：

using Chain

9. CSV 读写

using CSV, DataFrames

# 读取 CSV
df = CSV.read("data.csv", DataFrame)
df = CSV.read("data.csv", DataFrame;
    delim = ',',              # 分隔符
    header = 1,               # 表头行号
    skipto = 3,               # 跳到第 3 行开始读
    types = Dict(:age => Int, :salary => Float64),
    missingstring = ["NA", "NULL", ""],  # 缺失值标记
    bufsize = 8192            # 缓冲区大小
)

# 写入 CSV
CSV.write("output.csv", df)
CSV.write("output.csv", df;
    delim = '\t',
    writeheader = true
)

# 流式读取大文件（不全部加载到内存）
rows = CSV.Rows("big_data.csv"; buffer_in_memory=true)
for row in rows
    # 逐行处理
    println(row.name)
end

# 读取远程文件
df_remote = CSV.read("https://example.com/data.csv", DataFrame)

10. 性能优化（10 万+ 行数据）

using DataFrames, BenchmarkTools

# 技巧1：使用 ByRow 而非广播
# ❌ 慢
transform(df, :x => (x -> sin.(x) .+ cos.(x)) => :y)
# ✅ 快
transform(df, [:x] => ByRow((x) -> sin(x) + cos(x)) => :y)

# 技巧2：避免在循环中 vcat
# ❌ 慢
result = DataFrame()
for i in 1:1000
    global result = vcat(result, DataFrame(x=rand(100)))
end
# ✅ 快
result = reduce(vcat, [DataFrame(x=rand(100)) for i in 1:1000])

# 技巧3：使用 CategoricalArray 存储重复字符串
using CategoricalArrays
df.city = categorical(df.city)    # 内存和性能优化

# 技巧4：使用 @view 避免复制
subset(df, :age => ByRow(>(30)); view=true)

# 技巧5：预分配并赋值
df = DataFrame(x = zeros(1_000_000), y = zeros(1_000_000))
# 直接修改列向量
df.x .= rand(1_000_000)

数据量	推荐方案
< 1 万行	DataFrames.jl 直接使用
1-100 万行	注意预分配、ByRow、CategoricalArray
> 100 万行	考虑 InMemoryDatasets.jl 或分块处理
> 内存	CSV.Rows 流式处理或 DuckDB.jl