Hunspell 拼写检查完全教程 / 第 03 章：基本使用

第 03 章：基本使用

3.1 命令行基础

Hunspell 的命令行工具是日常使用的核心。基本语法如下：

hunspell [选项] [文件...]

3.1.1 最简单的用法

# 检查单个文件
hunspell document.txt

# 指定词典
hunspell -d en_US document.txt

# 检查多个文件
hunspell file1.txt file2.txt file3.md

3.1.2 核心参数一览

参数	说明	示例
`-d dict`	指定词典	`-d en_US` 或 `-d /path/to/en_US`
`-D`	列出可用词典及搜索路径	`hunspell -D`
`-l`	只输出拼写错误的单词	`hunspell -l file.txt`
`-a`	管道模式（ispell 兼容）	`echo "test" \| hunspell -a`
`-c`	交互式更正模式	`hunspell -c file.txt`
`-G`	只打印正确的单词	`hunspell -G file.txt`
`-L num`	限制建议数量	`-L 5`
`-m`	输出词干/形态分析	`hunspell -m file.txt`
`-M`	输出形态学分解	`hunspell -M file.txt`
`-s`	词干模式	`hunspell -s file.txt`
`-i enc`	指定输入编码	`-i UTF-8`
`-p dict`	指定个人词典	`-p ~/.my_dict`
`-w`	输出拼写错误的行（含上下文）	`hunspell -w file.txt`
`-t`	TeX/LaTeX 模式	`hunspell -t paper.tex`
`-H`	HTML 模式	`hunspell -H page.html`
`-n`	nroff/troff 模式	`hunspell -n manpage.1`
`-1`	第一个建议自动替换	管道模式

3.2 交互模式

交互模式是 Hunspell 最传统的使用方式，逐个提示拼写错误的单词并提供更正选项。

3.2.1 启动交互模式

hunspell -c document.txt

3.2.2 交互界面解析

& errorr 3 0: error, errors, errata

这个输出的含义：

字段	值	含义
`&`	标记符	表示有建议的拼写错误
`errorr`	错误词	在文本中检测到的词
`3`	建议数	共有 3 个建议
`0`	偏移量	在该行中的位置
`: error, errors, errata`	建议列表	逗号分隔的候选更正

3.2.3 交互命令

在交互模式下，可以使用以下命令：

命令	说明
数字	选择对应编号的建议替换
空格	跳过，不做更正
r + 文本	手动输入替换词
a	接受该词，添加到会话词典
A	接受该词，添加到个人词典
i	接受该词（小写形式），添加到个人词典
u	接受该词（词根形式），添加到个人词典
q	退出，保存更改到文件
x	退出，不保存更改
?	显示帮助

3.2.4 交互模式示例

# 创建测试文件
cat > /tmp/test.txt << 'EOF'
This sentense has a few typose in it.
The quik brown fox jumps over the lazzy dog.
EOF

# 启动交互检查
hunspell -c /tmp/test.txt

交互过程：

& sentense 2 0: sentence, sentences
  输入: 1          ← 选择 "sentence"

& typose 2 0: typos, types
  输入: 1          ← 选择 "typos"

& quik 7 0: quick, quirk, Quik, quiche, quid, quit, quiz
  输入: 1          ← 选择 "quick"

& lazzy 3 0: lazy, dazedly, laze
  输入: 1          ← 选择 "lazy"

3.3 管道模式（-a）

管道模式输出机器可读的结果，适合脚本和自动化场景。

3.3.1 基本用法

# 管道模式
echo "helo wrld" | hunspell -a -d en_US

输出：

@(#) International Ispell Version 3.2.06 (but really Hunspell 1.7.2)
& helo 4 0: hello, Helo, helot, help
& wrld 3 0: world, wold, weld

*

3.3.2 输出格式详解

管道模式的每一行以特殊字符开头：

前缀	格式	含义
`*`	`*`	单词正确
`&`	`& word count offset: sug1, sug2, ...`	有建议的错误
`#`	`# word offset`	无建议的错误
`+`	`+ word offset`	词根正确，但词缀错误
`-`	`- word offset`	根词可能正确（近似）
`@`	`@ word`	词干信息
`!`	`!`	首行，显示版本信息
``	空行	单词分隔（每个输入词对应一个结果行）

3.3.3 提取错误单词列表

# 只获取错误的单词（去重排序）
cat document.txt | hunspell -d en_US -l | sort -u

# 统计错误数量
cat document.txt | hunspell -d en_US -l | wc -l

# 只输出正确的单词
cat document.txt | hunspell -d en_US -G | sort -u

3.3.4 获取建议并自动替换

# 使用 -a 模式解析建议（awk 处理）
echo "helo wrld" | hunspell -a -d en_US | awk '
/^&/ {
    split($0, parts, ":");
    split(parts[2], sugs, ",");
    gsub(/^ /, "", sugs[1]);  # 取第一个建议
    print $2 " → " sugs[1];
}
'
# 输出：
# helo → hello
# wrld → world

3.4 批量检查

3.4.1 检查多个文件

# 检查目录下所有 .txt 文件
find /docs -name "*.txt" -exec hunspell -l -d en_US {} \;

# 检查所有 Markdown 文件
find . -name "*.md" -exec hunspell -l -d en_US {} \; | sort -u

# 并行检查（使用 xargs 加速）
find . -name "*.md" -print0 | xargs -0 -P 4 -I{} sh -c 'hunspell -l -d en_US "$1"' _ {}

3.4.2 递归检查并报告

#!/bin/bash
# spellcheck_dir.sh - 递归检查目录下所有文本文件
# 用法: ./spellcheck_dir.sh <目录> [词典]

DIR="${1:-.}"
DICT="${2:-en_US}"
TOTAL_ERRORS=0

echo "=== Hunspell 批量拼写检查 ==="
echo "目录: $DIR"
echo "词典: $DICT"
echo "─────────────────────────────"

while IFS= read -r -d '' file; do
    errors=$(hunspell -l -d "$DICT" "$file" 2>/dev/null | wc -l)
    if [ "$errors" -gt 0 ]; then
        echo "[$errors 错误] $file"
        hunspell -l -d "$DICT" "$file" 2>/dev/null | sort -u | sed 's/^/  → /'
        TOTAL_ERRORS=$((TOTAL_ERRORS + errors))
    fi
done < <(find "$DIR" -type f \( -name "*.txt" -o -name "*.md" -o -name "*.html" \) -print0)

echo "─────────────────────────────"
echo "总计: $TOTAL_ERRORS 个拼写错误"

3.4.3 生成拼写报告

#!/bin/bash
# spell_report.sh - 生成拼写检查报告
# 输出格式：文件名:行号: 错误词

REPORT_FILE="spell_report.txt"
DICT="en_US"

> "$REPORT_FILE"  # 清空报告文件

while IFS= read -r -d '' file; do
    # -w 参数输出错误行及行号
    hunspell -w -d "$DICT" "$file" 2>/dev/null | \
    grep -E "^[^*#]" | \
    while read -r line; do
        echo "$file: $line" >> "$REPORT_FILE"
    done
done < <(find . -type f -name "*.md" -print0)

# 统计汇总
echo "拼写检查报告 - $(date)"
echo "=========================="
cat "$REPORT_FILE"
echo ""
echo "共 $(wc -l < "$REPORT_FILE") 处疑似拼写错误"

3.5 自定义词典

3.5.1 个人词典（Personal Dictionary）

个人词典用于存储 Hunspell 无法识别但你确认正确的单词：

# 使用 -p 参数指定个人词典
hunspell -p ~/.hunspell_personal -d en_US document.txt

个人词典格式：

# ~/.hunspell_personal
# 每行一个单词，可带 affix 标志
# 第一行可以是字符集标记
ISO8859-1       # 或 UTF-8
Hunspell        # 接受的专有名词
API/M           # API 及其复数形式 APIs（M 标志）
Golang          # 编程语言名

3.5.2 在交互模式中添加单词

# 启动交互模式
hunspell -c -p ~/.my_dict document.txt

# 当提示某个词时，输入以下命令：
# A — 添加到个人词典
# i — 添加小写形式
# u — 添加词根形式

3.5.3 命令行直接添加

# 创建/追加到词典文件
echo "Hunspell" >> ~/.hunspell_personal
echo "API" >> ~/.hunspell_personal

# 使用会话词典（临时，不写入文件）
echo "testword" | hunspell -a -d en_US -p /dev/null

3.5.4 多词典管理

# 按项目管理词典
PROJECT_DICT=".hunspell_project"

# 检查时同时使用系统词典和个人词典
cat README.md | hunspell -d en_US -p "$PROJECT_DICT" -l

# 项目词典通常放在版本控制中
echo ".hunspell_project" >> .gitignore  # 或不忽略，团队共享

3.6 格式化文本检查

Hunspell 可以理解多种标记语言，避免将标签/命令误认为拼写错误。

3.6.1 HTML 模式

# HTML 模式：忽略标签，只检查文本内容
hunspell -H page.html

# 示例 HTML
cat > /tmp/test.html << 'EOF'
<html>
<body>
  <h1>Welcom to My Site</h1>
  <p>This is a <strong>tesst</strong> page.</p>
  <p class="header">Anothr paragraph.</p>
</body>
</html>
EOF

hunspell -H -l -d en_US /tmp/test.html
# Welcom
# tesst
# Anothr

3.6.2 LaTeX 模式

# TeX/LaTeX 模式：忽略命令，只检查文本
hunspell -t paper.tex

# 示例 LaTeX
cat > /tmp/test.tex << 'EOF'
\documentclass{article}
\begin{document}
\title{An Introdution to Hunspell}
\maketitle

This documnet explains the basic usge of Hunspell.
We use \texttt{hunspell} for spel checking.

\end{document}
EOF

hunspell -t -l -d en_US /tmp/test.tex
# Introdution
# documnet
# usge
# spel

3.6.3 nroff/troff 模式

# man page 格式
hunspell -n manpage.1

3.6.4 模式对比

模式	参数	忽略内容	适用场景
纯文本	（默认）	无	`.txt` 文件
HTML	`-H`	HTML 标签、实体	`.html`、`.htm`
LaTeX	`-t`	LaTeX 命令、环境	`.tex`、`.sty`
nroff	`-n`	troff 请求、宏	man pages
邮件	`-e`	邮件头	mbox 文件

3.7 词干提取与形态学

3.7.1 词干模式

# 提取词干
echo "running" | hunspell -s -d en_US
# running -> run

echo "wolves" | hunspell -s -d en_US
# wolves -> wolf

# 多词测试
echo "cats dogs running jumped" | hunspell -s -d en_US
# cats -> cat
# dogs -> dog
# running -> run
# jumped -> jump

3.7.2 形态学分析模式

# -m 输出详细形态信息
echo "running" | hunspell -m -d en_US
# running st:run po:verb ts:present_participle

# 解读：
# st:run          → 词干是 "run"
# po:verb         → 词性是动词
# ts:present_participle → 时态是现在分词

# -M 输出更详细的分解
echo "unhappiness" | hunspell -M -d en_US
# unhappiness: un+happi+ness

3.7.3 形态学信息对照表

缩写	含义	英文
`st`	词干	stem
`po`	词性	part of speech
`ts`	时态	tense
`ps`	人称	person
`nu`	数	number
`ca`	格	case
`ge`	性	gender
`mo`	语气	mood
`dv`	派生	derivation

3.8 编码处理

3.8.1 默认编码行为

# Hunspell 自动检测编码（基于 .aff 文件中的 SET 指令）
# 大多数现代词典使用 UTF-8

# 手动指定输入编码
hunspell -i UTF-8 -d en_US file.txt
hunspell -i ISO-8859-1 -d de_DE file.txt

3.8.2 编码转换示例

# 如果文件编码与词典编码不匹配
file -i document.txt
# document.txt: text/plain; charset=iso-8859-1

# 转换后检查
iconv -f ISO-8859-1 -t UTF-8 document.txt | hunspell -d en_US -l

# 或者指定输入编码
hunspell -i ISO-8859-1 -d en_US document.txt

3.8.3 常见编码问题

# 问题：UTF-8 BOM 导致首行无法检查
# 解决：去掉 BOM
sed -i '1s/^\xEF\xBB\xBF//' file.txt

# 问题：混合编码文件
# 解决：统一转换
find . -name "*.txt" -exec sh -c '
    charset=$(file -bi "$1" | grep -oP "charset=\K[^ ]+")
    if [ "$charset" != "utf-8" ] && [ "$charset" != "us-ascii" ]; then
        iconv -f "$charset" -t UTF-8 "$1" -o "$1.utf8" && mv "$1.utf8" "$1"
        echo "转换: $1 ($charset → UTF-8)"
    fi
' _ {} \;

3.9 输出格式化与脚本集成

3.9.1 解析管道模式输出

#!/usr/bin/env python3
"""解析 hunspell -a 的输出，返回结构化结果"""
import subprocess
import re

def spell_check(text: str, dictionary: str = "en_US") -> list[dict]:
    """
    对文本进行拼写检查，返回结构化结果。
    
    返回格式：
    [
        {"word": "helo", "status": "misspelled", "suggestions": ["hello", ...]},
        {"word": "hello", "status": "correct", "suggestions": []},
        ...
    ]
    """
    result = subprocess.run(
        ["hunspell", "-a", "-d", dictionary],
        input=text, capture_output=True, text=True
    )
    
    entries = []
    for line in result.stdout.strip().split("\n"):
        if line.startswith("*"):
            # 正确的单词
            entries.append({"word": line[2:], "status": "correct", "suggestions": []})
        elif line.startswith("&"):
            # 有建议的错误
            match = re.match(r"& (\S+) \d+ \d+: (.+)", line)
            if match:
                word = match.group(1)
                suggestions = [s.strip() for s in match.group(2).split(",")]
                entries.append({"word": word, "status": "misspelled", "suggestions": suggestions})
        elif line.startswith("#"):
            # 无建议的错误
            match = re.match(r"# (\S+) \d+", line)
            if match:
                entries.append({"word": match.group(1), "status": "misspelled", "suggestions": []})
    
    return entries

# 使用
results = spell_check("This sentense has a typoo")
for r in results:
    if r["status"] == "misspelled":
        print(f"  ✗ '{r['word']}' → 建议: {', '.join(r['suggestions'][:5])}")
    else:
        print(f"  ✓ '{r['word']}'")

输出：

  ✓ 'This'
  ✗ 'sentense' → 建议: sentence, sentences
  ✓ 'has'
  ✓ 'a'
  ✗ 'typoo' → 建议: typo, types, type

3.9.2 JSON 输出包装

#!/bin/bash
# spellcheck_json.sh - 将 hunspell 输出转为 JSON
# 用法: ./spellcheck_json.sh <文件>

FILE="$1"
DICT="${2:-en_US}"

echo "["
first=true

while IFS= read -r word; do
    if [ -n "$word" ]; then
        if [ "$first" = true ]; then
            first=false
        else
            echo ","
        fi
        # 获取建议
        suggestions=$(echo "$word" | hunspell -a -d "$DICT" 2>/dev/null | \
            grep "^&" | sed 's/.*: //' | tr ',' '\n' | \
            sed 's/^ *//' | head -5 | \
            awk '{printf "\"%s\",", $0}' | sed 's/,$//')
        printf '  {"word": "%s", "suggestions": [%s]}' "$word" "$suggestions"
    fi
done < <(hunspell -l -d "$DICT" "$FILE" | sort -u)

echo ""
echo "]"

3.10 实用技巧

3.10.1 忽略大小写

# 使用 -C 参数忽略大小写（默认检查大小写）
echo "hello Hello HELLO" | hunspell -C -d en_US -l
# 无输出 → 全部正确

3.10.2 限制建议数量

# -L 限制每词建议数（默认 25）
echo "helo" | hunspell -a -d en_US -L 3
# & helo 3 0: hello, Helo, helot

3.10.3 结合 grep 过滤

# 只显示特定模式的错误
cat document.txt | hunspell -l | grep -E "^[A-Z]"  # 只看大写开头的
cat document.txt | hunspell -l | grep -E ".{20,}"  # 只看超长词

3.10.4 与 diff 结合

# 比较拼写检查前后的差异
hunspell -l file.txt > /tmp/errors_before.txt
# 编辑文件...
hunspell -l file.txt > /tmp/errors_after.txt
diff /tmp/errors_before.txt /tmp/errors_after.txt

3.10.5 拼写检查统计

#!/bin/bash
# spell_stats.sh - 拼写检查统计
FILE="$1"
DICT="${2:-en_US}"

TOTAL_WORDS=$(wc -w < "$FILE")
MISSPELLED=$(hunspell -l -d "$DICT" "$FILE" | wc -l)
UNIQUE_ERRORS=$(hunspell -l -d "$DICT" "$FILE" | sort -u | wc -l)
CORRECT=$((TOTAL_WORDS - MISSPELLED))

echo "=== 拼写检查统计 ==="
echo "文件:          $FILE"
echo "总词数:        $TOTAL_WORDS"
echo "拼写错误:      $MISSPELLED"
echo "唯一错误词:    $UNIQUE_ERRORS"
echo "正确率:        $(echo "scale=2; $CORRECT * 100 / $TOTAL_WORDS" | bc)%"
echo ""
echo "最常见的错误词："
hunspell -l -d "$DICT" "$FILE" | sort | uniq -c | sort -rn | head -10

3.11 本章小结

模式	命令	适用场景
交互模式	`hunspell -c file`	手动更正单个文件
管道模式	`hunspell -a` / `hunspell -l`	脚本自动化
词干提取	`hunspell -s`	NLP 预处理
形态分析	`hunspell -m` / `-M`	语言学研究
HTML 模式	`hunspell -H`	Web 内容检查
LaTeX 模式	`hunspell -t`	学术论文检查

扩展阅读

Hunspell 手册页 — 完整参数参考
Ispell 兼容模式规范 — 管道模式协议
Using Hunspell in the Terminal — 实用教程
Emacs ispell.el 文档 — 编辑器集成