Julia 教程 / Julia 数据处理(DataFrames.jl)
Julia 数据处理(DataFrames.jl)
DataFrames.jl 是 Julia 生态中最核心的数据处理库,类似 Python 的 pandas 或 R 的 data.frame。本文全面介绍 DataFrame 的创建、操作、聚合、连接和性能优化。
1. DataFrame 创建
using DataFrames
# 方法1:从列创建
df = DataFrame(
name = ["Alice", "Bob", "Charlie", "Diana"],
age = [28, 35, 42, 31],
salary = [55000.0, 72000.0, 88000.0, 63000.0],
department = ["IT", "Finance", "IT", "HR"]
)
# 方法2:从字典创建
d = Dict("x" => [1,2,3], "y" => [4,5,6])
df2 = DataFrame(d)
# 方法3:从矩阵创建
M = rand(5, 3)
df3 = DataFrame(M, [:feature1, :feature2, :feature3])
# 方法4:空 DataFrame 逐步添加
df4 = DataFrame()
df4.id = 1:10
df4.value = rand(10)
# 查看基本信息
size(df) # (行数, 列数)
names(df) # 列名
eltype.(eachcol(df)) # 每列类型
describe(df) # 汇总统计
first(df, 3) # 前 3 行
last(df, 3) # 后 3 行
show(df, allcols=true) # 显示所有列
2. 列操作(select / transform / rename)
using DataFrames, Statistics
df = DataFrame(
id = 1:5,
first_name = ["Alice", "Bob", "Charlie", "Diana", "Eve"],
last_name = ["Wang", "Li", "Zhang", "Chen", "Liu"],
score_math = [90, 85, 78, 95, 88],
score_eng = [88, 92, 75, 89, 91]
)
# select: 选择/重排列(不修改原 DataFrame)
select(df, :id, :first_name) # 选择指定列
select(df, Not(:last_name)) # 排除某列
select(df, r"score") # 正则匹配列名
select(df, :id, :first_name => :name) # 选择并重命名
# transform: 添加新列(保留所有原列)
transform(df, [:score_math, :score_eng] => ByRow(+) => :total_score)
transform(df, :score_math => (x -> x ./ 100 * 150) => :scaled_math)
transform(df, [:first_name, :last_name] => ByRow(* ) => :full_name)
# rename: 重命名列
rename(df, :first_name => :fname, :last_name => :lname)
rename(x -> uppercase(x), df) # 所有列名大写
# combine: 聚合列(不保留所有行)
combine(df, :score_math => mean => :avg_math)
combine(df, [:score_math, :score_eng] .=> [mean, std, minimum, maximum])
| 函数 | 返回行数 | 保留原列 | 用途 |
|---|---|---|---|
select | 同原表 | 按指定 | 列选择/重排/重命名 |
transform | 同原表 | 全部保留 | 添加新列 |
combine | 聚合行数 | 不保留 | 聚合计算 |
rename | 同原表 | 全部保留 | 仅重命名 |
3. 行操作(filter / subset / sort)
using DataFrames
df = DataFrame(
id = 1:8,
name = ["Alice", "Bob", "Charlie", "Diana", "Eve", "Frank", "Grace", "Hank"],
age = [28, 35, 42, 31, 26, 55, 38, 29],
city = ["Beijing", "Shanghai", "Beijing", "Shenzhen", "Shanghai", "Beijing", "Shenzhen", "Shanghai"],
salary = [55000, 72000, 88000, 63000, 48000, 95000, 71000, 59000]
)
# filter: 使用函数式语法
filter(row -> row.age > 30, df)
filter(row -> row.city == "Beijing" && row.salary > 60000, df)
# subset: 使用列名和 ByRow 语法(推荐)
subset(df, :age => ByRow(>(30)))
subset(df, :city => ByRow(==("Beijing")), :salary => ByRow(>(60000)))
# subset 去除缺失值
df_missing = DataFrame(x = [1, missing, 3, missing, 5])
subset(df_missing, :x => ByRow(!ismissing))
# sort: 排序
sort(df, :salary) # 升序
sort(df, :salary, rev=true) # 降序
sort(df, [:city, :salary], rev=[false, true]) # 多列排序
# 去重
unique(df, :city) # 按 city 去重,保留第一次出现
unique(df, [:city, :age]) # 按多列去重
# 取样
df[rand(1:nrow(df), 3), :] # 随机 3 行
4. 分组与聚合(groupby / combine)
using DataFrames, Statistics
df = DataFrame(
department = ["IT","IT","HR","HR","Finance","Finance","IT","HR"],
employee = ["A","B","C","D","E","F","G","H"],
salary = [55000, 72000, 48000, 52000, 85000, 78000, 63000, 45000],
rating = [4.2, 3.8, 4.5, 3.9, 4.0, 4.3, 3.7, 4.1]
)
# 分组
gdf = groupby(df, :department)
# 对每个组进行聚合
combine(gdf, :salary => mean => :avg_salary)
combine(gdf, :salary => mean => :avg_salary,
:salary => length => :count,
:rating => maximum => :best_rating)
# 多列聚合
combine(gdf,
:salary => mean => :avg_salary,
:salary => std => :std_salary,
nrow => :count
)
# 分组后 transform(保留所有行)
transform(gdf, :salary => (x -> x .- mean(x)) => :salary_centered)
# 多级分组
df2 = DataFrame(
dept = repeat(["IT", "HR"], inner=4),
level = repeat(["Junior", "Senior"], outer=4),
salary = rand(8) .* 50000 .+ 30000
)
gdf2 = groupby(df2, [:dept, :level])
combine(gdf2, :salary => mean => :avg)
💡 提示: groupby 返回的是 GroupedDataFrame,它是一个惰性视图,不会复制数据。combine、transform、select 都可以直接作用于 GroupedDataFrame。
5. 连接操作(join)
using DataFrames
employees = DataFrame(
id = 1:5,
name = ["Alice", "Bob", "Charlie", "Diana", "Eve"],
dept_id = [1, 2, 1, 3, 2]
)
departments = DataFrame(
dept_id = [1, 2, 4],
dept_name = ["Engineering", "Marketing", "Sales"]
)
# inner join: 两表都有的行
innerjoin(employees, departments, on=:dept_id)
# left join: 保留左表所有行
leftjoin(employees, departments, on=:dept_id)
# right join: 保留右表所有行
rightjoin(employees, departments, on=:dept_id)
# outer join: 保留所有行
outerjoin(employees, departments, on=:dept_id)
# anti join: 左表中没有匹配的行
antijoin(employees, departments, on=:dept_id)
# semi join: 左表中有匹配的行(不合并右表列)
semijoin(employees, departments, on=:dept_id)
# 不同列名的连接
df_a = DataFrame(emp_id=1:3, name=["A","B","C"])
df_b = DataFrame(id=1:3, city=["X","Y","Z"])
innerjoin(df_a, df_b, on=:emp_id => :id)
# 多列连接
innerjoin(df_a, df_b, on=[:col1, :col2])
| join 类型 | 函数 | 说明 |
|---|---|---|
| inner | innerjoin | 只保留匹配行 |
| left | leftjoin | 保留左表所有行 |
| right | rightjoin | 保留右表所有行 |
| outer | outerjoin | 保留所有行 |
| anti | antijoin | 左表中不匹配的行 |
| semi | semijoin | 左表中匹配的行(不添加右表列) |
| cross | crossjoin | 笛卡尔积 |
6. 缺失值处理
using DataFrames
df = DataFrame(
name = ["Alice", "Bob", missing, "Diana", "Eve"],
age = [28, missing, 42, 31, missing],
salary = [55000.0, 72000.0, missing, 63000.0, 48000.0]
)
# 检测缺失值
ismissing.(df.age)
count(ismissing, df.age)
# 删除含缺失值的行
dropmissing(df) # 删除任何列有缺失的行
dropmissing(df, :age) # 只看 age 列
dropmissing(df, [:name, :salary]) # 只看指定列
# 替换缺失值
coalesce.(df.name, "Unknown") # 用默认值替换
coalesce.(df.age, 0)
# 填充缺失值
df_filled = copy(df)
df_filled.age = coalesce.(df_filled.age, 0)
df_filled.name = coalesce.(df_filled.name, "N/A")
# 缺失值统计
mapcols(x -> count(ismissing, x), df) # 每列缺失值数量
# 配合 subset 去除缺失行
subset(df, :age => ByRow(!ismissing))
⚠️ 注意: Julia 中 missing 类型的任何运算结果都是 missing(类似 SQL 的 NULL)。使用 skipmissing 可以跳过缺失值:
mean(skipmissing(df.salary)) # 忽略缺失值计算均值
collect(skipmissing(df.age)) # 收集非缺失值
7. 长宽转换(stack / unstack)
using DataFrames
# 宽数据 → 长数据(melt)
df_wide = DataFrame(
id = 1:3,
math = [90, 85, 78],
english = [88, 92, 75],
science = [95, 80, 82]
)
df_long = stack(df_wide, [:math, :english, :science];
variable_name = :subject,
value_name = :score)
# 3*3=9 行
# 长数据 → 宽数据(pivot)
df_back = unstack(df_long, :subject, :score)
# 带 id 列的 unstack
df_long2 = DataFrame(
id = repeat(1:3, inner=2),
year = repeat([2023, 2024], outer=3),
revenue = [100, 120, 200, 230, 150, 180]
)
unstack(df_long2, :year, :revenue)
💡 提示: stack 类似 pandas 的 melt,unstack 类似 pivot。处理面板数据时非常常用。
8. 管道操作链
using DataFrames, Statistics
df = DataFrame(
id = 1:100,
city = rand(["Beijing", "Shanghai", "Shenzhen"], 100),
age = rand(20:60, 100),
salary = rand(30000:1000:120000, 100)
)
# 使用管道操作链处理数据
result = @chain df begin
subset(:age => ByRow(>(25))) # 过滤
transform(:salary => (x -> x / 1000) => :salary_k) # 新列
groupby(:city) # 分组
combine(
:salary => mean => :avg_salary,
:salary => median => :median_salary,
nrow => :count
)
sort(:avg_salary, rev=true) # 排序
end
# 不用管道的等价写法
result2 = sort(
combine(
groupby(
transform(
subset(df, :age => ByRow(>(25))),
:salary => (x -> x / 1000) => :salary_k
),
:city
),
:salary => mean => :avg_salary,
:salary => median => :median_salary,
nrow => :count
),
:avg_salary, rev=true
)
⚠️ 注意: 使用 @chain 宏需要加载 Chain.jl:
using Chain
9. CSV 读写
using CSV, DataFrames
# 读取 CSV
df = CSV.read("data.csv", DataFrame)
df = CSV.read("data.csv", DataFrame;
delim = ',', # 分隔符
header = 1, # 表头行号
skipto = 3, # 跳到第 3 行开始读
types = Dict(:age => Int, :salary => Float64),
missingstring = ["NA", "NULL", ""], # 缺失值标记
bufsize = 8192 # 缓冲区大小
)
# 写入 CSV
CSV.write("output.csv", df)
CSV.write("output.csv", df;
delim = '\t',
writeheader = true
)
# 流式读取大文件(不全部加载到内存)
rows = CSV.Rows("big_data.csv"; buffer_in_memory=true)
for row in rows
# 逐行处理
println(row.name)
end
# 读取远程文件
df_remote = CSV.read("https://example.com/data.csv", DataFrame)
10. 性能优化(10 万+ 行数据)
using DataFrames, BenchmarkTools
# 技巧1:使用 ByRow 而非广播
# ❌ 慢
transform(df, :x => (x -> sin.(x) .+ cos.(x)) => :y)
# ✅ 快
transform(df, [:x] => ByRow((x) -> sin(x) + cos(x)) => :y)
# 技巧2:避免在循环中 vcat
# ❌ 慢
result = DataFrame()
for i in 1:1000
global result = vcat(result, DataFrame(x=rand(100)))
end
# ✅ 快
result = reduce(vcat, [DataFrame(x=rand(100)) for i in 1:1000])
# 技巧3:使用 CategoricalArray 存储重复字符串
using CategoricalArrays
df.city = categorical(df.city) # 内存和性能优化
# 技巧4:使用 @view 避免复制
subset(df, :age => ByRow(>(30)); view=true)
# 技巧5:预分配并赋值
df = DataFrame(x = zeros(1_000_000), y = zeros(1_000_000))
# 直接修改列向量
df.x .= rand(1_000_000)
| 数据量 | 推荐方案 |
|---|---|
| < 1 万行 | DataFrames.jl 直接使用 |
| 1-100 万行 | 注意预分配、ByRow、CategoricalArray |
| > 100 万行 | 考虑 InMemoryDatasets.jl 或分块处理 |
| > 内存 | CSV.Rows 流式处理或 DuckDB.jl |
扩展阅读
- DataFrames.jl 官方文档
- DataFramesMeta.jl — 更简洁的宏语法
- CSV.jl 文档
- JuliaData 生态概览
- Hadley Wickham, Tidy Data
- Comparison with pandas