Hunspell 拼写检查完全教程 / 第 09 章：形态学分析

第 09 章：形态学分析

9.1 形态学基础

形态学（Morphology）是语言学的分支，研究词的内部结构和构成规则。Hunspell 内置了形态学分析功能，能够：

功能	说明	命令参数
词干提取（Stemming）	找到词的词根形式	`-s`
形态分析（Analysis）	解析词的形态结构	`-m`
词形生成（Generation）	从词根生成特定词形	`-m` + 模型
复合词分析	拆解复合词组成部分	复合词规则

9.2 词干提取

9.2.1 基本词干提取

# -s 模式：输出词干
echo "running" | hunspell -s -d en_US
# running → run

echo "wolves" | hunspell -s -d en_US
# wolves → wolf

echo "unhappiness" | hunspell -s -d en_US
# unhappiness → happy

9.2.2 多词词干提取

# 处理多个词
echo "cats dogs running jumped wolves" | hunspell -s -d en_US
# cats → cat
# dogs → dog
# running → run
# jumped → jump
# wolves → wolf

9.2.3 词干提取在 NLP 中的应用

#!/usr/bin/env python3
"""词干提取在文本处理中的应用"""
import subprocess
from collections import Counter

def extract_stems(text: str, dictionary: str = "en_US") -> dict[str, list[str]]:
    """
    提取文本中所有词干，返回 {词干: [原词列表]}
    """
    result = subprocess.run(
        ["hunspell", "-s", "-d", dictionary],
        input=text, capture_output=True, text=True
    )
    
    stem_map = {}
    for line in result.stdout.strip().split("\n"):
        if " → " in line:
            word, stem = line.split(" → ", 1)
            word = word.strip()
            stem = stem.strip()
            if stem not in stem_map:
                stem_map[stem] = []
            stem_map[stem].append(word)
    
    return stem_map

def word_frequency_by_stem(text: str, dictionary: str = "en_US") -> Counter:
    """按词干统计词频"""
    stem_map = extract_stems(text, dictionary)
    freq = Counter()
    for stem, words in stem_map.items():
        freq[stem] = len(words)
    return freq

# 使用
text = """
The cats are running and the dogs are running too.
The wolf jumped over the lazy dog while wolves howled.
A beautiful day makes beautiful memories.
"""

stem_map = extract_stems(text)
print("=== 词干映射 ===")
for stem, words in sorted(stem_map.items()):
    print(f"  {stem}: {set(words)}")

freq = word_frequency_by_stem(text)
print("\n=== 词干频率 ===")
for stem, count in freq.most_common(10):
    print(f"  {stem}: {count}")

输出：

=== 词干映射 ===
  be: {'are'}
  beautiful: {'beautiful'}
  cat: {'cats'}
  day: {'day'}
  dog: {'dogs'}
  jump: {'jumped'}
  lazy: {'lazy'}
  make: {'makes'}
  memory: {'memories'}
  over: {'over'}
  run: {'running'}
  the: {'The', 'the'}
  while: {'while'}
  wolf: {'wolf', 'wolves'}
  howl: {'howled'}

=== 词干频率 ===
  the: 3
  be: 2
  run: 2
  wolf: 2
  beautiful: 2
  ...

9.3 形态分析

9.3.1 -m 模式详解

# -m 输出形态学分析信息
echo "running" | hunspell -m -d en_US
# running st:run po:verb ts:present_participle

echo "wolves" | hunspell -m -d en_US
# wolves st:wolf po:noun nu:plural

echo "unhappiness" | hunspell -m -d en_US
# unhappiness st:happy po:noun

9.3.2 形态学标签详解

标签	缩写	说明	示例
`st`	stem	词干	`st:run`
`po`	part of speech	词性	`po:verb`, `po:noun`, `po:adj`
`ts`	tense	时态	`ts:past`, `ts:present_participle`, `ts:past_participle`
`ps`	person	人称	`ps:first`, `ps:second`, `ps:third`
`nu`	number	数	`nu:plural`, `nu:singular`
`ca`	case	格	`ca:nominative`, `ca:accusative`, `ca:genitive`
`ge`	gender	性	`ge:masculine`, `ge:feminine`, `ge:neuter`
`mo`	mood	语气	`mo:indicative`, `mo:subjunctive`, `mo:imperative`
`dv`	derivation	派生	`dv:un-`, `dv:-ness`
`is`	inflection	屈折	`is:plural`, `is:past`

9.3.3 各语言形态分析示例

# 英语
echo "unhappiest" | hunspell -m -d en_US
# unhappiest st:happy po:adj ts:superlative

# 德语
echo "Häuser" | hunspell -m -d de_DE
# Häuser st:Haus po:noun nu:plural ca:nominative

# 法语
echo "parlons" | hunspell -m -d fr
# parlons st:parler po:verb ts:present ps:first nu:plural

# 西班牙语
echo "hablamos" | hunspell -m -d es_ES
# hablamos st:hablar po:verb ts:present ps:first nu:plural

# 俄语
echo "книгами" | hunspell -m -d ru_RU
# книгами st:книга po:noun ca:instrumental nu:plural ge:feminine

9.3.4 形态分析封装

#!/usr/bin/env python3
"""Hunspell 形态分析封装"""
import subprocess
import re
from dataclasses import dataclass

@dataclass
class MorphAnalysis:
    word: str
    stem: str
    pos: str        # part of speech
    tense: str = ""
    person: str = ""
    number: str = ""
    case: str = ""
    gender: str = ""
    mood: str = ""
    derivation: str = ""
    raw: str = ""

def analyze(word: str, dictionary: str = "en_US") -> MorphAnalysis:
    """对单词进行形态分析"""
    result = subprocess.run(
        ["hunspell", "-m", "-d", dictionary],
        input=word, capture_output=True, text=True
    )
    
    raw = result.stdout.strip()
    analysis = MorphAnalysis(word=word, stem="", pos="", raw=raw)
    
    # 解析标签
    for match in re.finditer(r'(\w+):(\S+)', raw):
        tag, value = match.groups()
        if tag == "st":
            analysis.stem = value
        elif tag == "po":
            analysis.pos = value
        elif tag == "ts":
            analysis.tense = value
        elif tag == "ps":
            analysis.person = value
        elif tag == "nu":
            analysis.number = value
        elif tag == "ca":
            analysis.case = value
        elif tag == "ge":
            analysis.gender = value
        elif tag == "mo":
            analysis.mood = value
        elif tag == "dv":
            analysis.derivation = value
    
    return analysis

def analyze_batch(words: list[str], dictionary: str = "en_US") -> list[MorphAnalysis]:
    """批量分析"""
    text = "\n".join(words)
    result = subprocess.run(
        ["hunspell", "-m", "-d", dictionary],
        input=text, capture_output=True, text=True
    )
    
    analyses = []
    for line in result.stdout.strip().split("\n"):
        if line.strip():
            word = line.split()[0] if line.split() else ""
            if word:
                analyses.append(analyze(word, dictionary))
    
    return analyses

# 使用
words = ["running", "wolves", "unhappiness", "happier", "ran"]
for a in analyze_batch(words):
    print(f"  {a.word}: 词干={a.stem}, 词性={a.pos}, 时态={a.tense}, 数={a.number}")

输出：

  running: 词干=run, 词性=verb, 时态=present_participle, 数=
  wolves: 词干=wolf, 词性=noun, 时态=, 数=plural
  unhappiness: 词干=happy, 词性=noun, 时态=, 数=
  happier: 词干=happy, 词性=adj, 时态=comparative, 数=
  ran: 词干=run, 词性=verb, 时态=past, 数=

9.4 复合词分析

9.4.1 复合词检测

# 复合词在德语中非常常见
echo "Hausaufgabe" | hunspell -m -d de_DE
# Hausaufgabe st:Haus+Aufgabe po:noun

echo "Handschuh" | hunspell -m -d de_DE
# Handschuh st:Hand+Schuh po:noun

9.4.2 复合词配置详解

# .aff 文件中复合词相关指令
COMPOUNDBEGIN B         # 可出现在复合词开头的标志
COMPOUNDMIDDLE M        # 可出现在复合词中间的标志
COMPOUNDEND E           # 可出现在复合词结尾的标志
COMPOUNDWORDMAX 5       # 复合词最多组成部分
COMPOUNDMIN 3           # 每部分最短 3 个字符
COMPOUNDSYLLABLE 6      # 复合词最多音节数
COMPOUNDROOT FLAG       # 复合词根标志
CHECKCOMPOUNDCASE       # 检查复合词大小写
CHECKCOMPOUNDDUP        # 禁止重复词复合（如 "the-the"）
CHECKCOMPOUNDREP        # 复合词中不允许 REP 替换
CHECKCOMPOUNDTRIPLE     # 禁止三连字符
FORCEUCASE 1            # 大写开头强制

9.4.3 德语复合词规则示例

# de_DE.aff 中的关键复合词设置
COMPOUNDRULE 2
COMPOUNDRULE BME*       # 开头 + 任意中间 + 结尾
COMPOUNDRULE BE          # 两词复合

COMPOUNDBEGIN B
COMPOUNDMIDDLE M
COMPOUNDEND E

COMPOUNDWORDMAX 3
COMPOUNDMIN 3

# de_DE.dic 中
Haus/BME                # 可出现在复合词的任何位置
Aufgabe/BME
Hand/BME
Schuh/BME
Arbeit/BME

# 生成的复合词：
# Hausaufgabe (家庭作业)
# Handschuh (手套)
# Arbeitgeber (雇主)
# Schuhmacher (鞋匠)

9.4.4 英语复合词

# 英语复合词相对较少，但也有
# .aff 中
WORDCHARS -'            # 连字符和撇号视为单词组成部分

# 带连字符的复合词
well-known              # 需要 WORDCHARS - 配置
self-esteem
mother-in-law

9.5 屈折变化（Inflection）

9.5.1 什么是屈折

屈折（Inflection）是词在语法关系中发生的形式变化，不改变词的基本含义：

屈折类型	英语示例	其他语言
数（名词）	cat → cats	德语 Haus → Häuser
格（名词）	— (英语无)	俄语 книга → книги → книге
性（名词/形容词）	— (英语无)	法语 grand → grande
时态（动词）	walk → walked	西语 hablar → hablo
人称（动词）	walk → walks	法语 parler → parle
语态（动词）	— (需助动词)	拉丁语 amo → amor
级（形容词）	tall → taller	—

9.5.2 英语屈折系统

# 名词屈折
echo "cat" | hunspell -m -d en_US
# cat st:cat po:noun nu:singular

echo "cats" | hunspell -m -d en_US
# cats st:cat po:noun nu:plural

# 动词屈折
echo "walks" | hunspell -m -d en_US
# walks st:walk po:verb ts:present ps:third nu:singular

echo "walked" | hunspell -m -d en_US
# walked st:walk po:verb ts:past

echo "walking" | hunspell -m -d en_US
# walking st:walk po:verb ts:present_participle

echo "walked" | hunspell -m -d en_US
# walked st:walk po:verb ts:past_participle

# 形容词屈折
echo "taller" | hunspell -m -d en_US
# taller st:tall po:adj ts:comparative

echo "tallest" | hunspell -m -d en_US
# tallest st:tall po:adj ts:superlative

9.5.3 规则屈折 vs 不规则屈折

# 规则屈折：通过 affix 规则处理
walk → walked, walking, walks       # SFX D, SFX G, SFX S

# 不规则屈折：需要在 .dic 中手动列出
go → went, gone, going              # 需要手动处理
be → am, is, are, was, were, been   # 需要手动处理
have → has, had, having             # 需要手动处理

# 不规则动词在词典中的处理方式
# 方法 1: 列出所有不规则形式
went
gone
going
goes
went

# 方法 2: 使用特殊标志（部分词典支持）
go/DGS                  # 常规形式
went                    # 不规则过去式（手动添加）

# 方法 3: 使用 PFX 规则模拟
# 有些词典使用技巧性 PFX 规则来处理不规则变化

9.6 派生变化（Derivation）

9.6.1 什么是派生

派生（Derivation）是通过添加词缀创造新词的过程，通常改变词性或基本含义：

派生类型	示例	词性变化
名词化	happy → happiness	adj → noun
动词化	modern → modernize	adj → verb
形容词化	danger → dangerous	noun → adj
副词化	quick → quickly	adj → adv
否定化	happy → unhappy	adj → adj
施事者	teach → teacher	verb → noun
工具	write → writer	verb → noun

9.6.2 英语派生规则

# .aff 文件中的派生规则
# -ness 名词化
SFX N Y 2
SFX N   0   ness    [^y]
SFX N   y   iness   [^aeiou]y

# -ly 副词化
SFX L Y 2
SFX L   0   ly      [^y]
SFX L   y   ily     [^aeiou]y

# -ment 名词化
SFX M Y 1
SFX M   0   ment    .

# -tion 名词化
SFX O Y 2
SFX O   e   ion     e
SFX O   0   ation   [^e]

# -ful 形容词化
SFX F Y 1
SFX F   0   ful     .

# -less 形容词化（否定）
SFX X Y 1
SFX X   0   less    .

# -ous 形容词化
SFX U Y 2
SFX U   0   ous     .
SFX U   y   ious    [^aeiou]y

# -er 施事者
SFX E Y 1
SFX E   0   r       .

# -able 形容词化
SFX B Y 2
SFX B   0   able    [^e]
SFX B   e   able    e

# un- 否定前缀
PFX U Y 1
PFX U   un  0       .

# re- 重复前缀
PFX R Y 1
PFX R   re  0       .

9.6.3 派生链

# 派生可以链式进行
happy (adj)
  → unhappy (adj, un-)
  → unhappiness (noun, -ness)
  → unhappily (adv, -ly)

# 词典中的表示
happy/RYLN      # R=比较级, Y=副词, L=?, N=名词化
# 通过 affix 规则组合，可以生成：
# happy, happier, happiest, happily, happiness
# unhappy, unhappier, unhappiest, unhappily, unhappiness

9.7 中文形态学处理

9.7.1 中文的特殊性

中文与印欧语系有本质区别：

特征	印欧语系（英/德/法）	中文
书写单位	字母 → 单词	汉字 → 词/词组
词边界	空格分隔	无天然分隔
屈折变化	丰富	基本没有
派生方式	前后缀	复合为主
形态学	前后缀系统	无

9.7.2 中文分词

#!/usr/bin/env python3
"""中文拼写检查：分词 + Hunspell"""
import jieba
import subprocess

def chinese_spellcheck(text: str, dictionary: str = "zh_CN") -> list[dict]:
    """
    中文拼写检查流程：
    1. jieba 分词
    2. 对每个词进行 Hunspell 检查
    3. 返回错误列表
    """
    # 分词
    words = list(jieba.cut(text))
    
    errors = []
    for word in words:
        # 跳过空白和标点
        if not word.strip():
            continue
        # 跳过非中文字符
        if not any('\u4e00' <= c <= '\u9fff' for c in word):
            continue
        
        # 检查每个词
        result = subprocess.run(
            ["hunspell", "-d", dictionary, "-l"],
            input=word, capture_output=True, text=True
        )
        
        if result.stdout.strip():
            errors.append({
                "word": word,
                "suggestions": []  # Hunspell 中文建议有限
            })
    
    return errors

# 使用
text = "这是一段包含错别子的文本，用于演试中文拼写检查功能。"
errors = chinese_spellcheck(text)
print(f"发现 {len(errors)} 个可能的错误：")
for err in errors:
    print(f"  → {err['word']}")

9.7.3 中文特殊词汇处理

#!/usr/bin/env python3
"""中文特殊词汇处理"""
import re

# 中文数字、量词等特殊词汇
CHINESE_NUMBERS = set("零一二三四五六七八九十百千万亿两")
CHINESE_UNITS = set("个只条把张件套双副对组群批种类阵")
CHINESE_PARTICLES = set("的地得着了过")

def is_chinese_special(word: str) -> bool:
    """判断是否为中文特殊词汇（数字、量词等）"""
    # 纯数字
    if all(c in CHINESE_NUMBERS or c.isdigit() for c in word):
        return True
    # 量词
    if len(word) == 1 and word in CHINESE_UNITS:
        return True
    # 助词
    if word in CHINESE_PARTICLES:
        return True
    return False

def chinese_spellcheck_enhanced(text: str) -> list[dict]:
    """增强版中文拼写检查"""
    import jieba
    
    words = list(jieba.cut(text))
    errors = []
    
    for word in words:
        if not word.strip() or not any('\u4e00' <= c <= '\u9fff' for c in word):
            continue
        
        # 跳过特殊词汇
        if is_chinese_special(word):
            continue
        
        # 可以添加自定义检查逻辑
        # 例如：常见的错别字
        COMMON_MISTAKES = {
            "错别子": "错别字",
            "演试": "演示",
            "在见": "再见",
        }
        
        if word in COMMON_MISTAKES:
            errors.append({
                "word": word,
                "correction": COMMON_MISTAKES[word],
                "type": "常见错别字"
            })
    
    return errors

9.7.4 日语形态学

# 日语的特殊处理
# 日语需要专门的分词器（MeCab 等）

# 方案 1: 使用 MeCab + Hunspell
# MeCab 负责分词，Hunspell 负责词干/拼写检查

# 方案 2: 使用专门的日语词典
# 日语词典通常需要特殊格式

9.7.5 韩语形态学

# 韩语是黏着语，有丰富的词缀
# 动词活用例：
# 가다 (gada, 去) → 가, 가고, 가서, 갔다, 갈

# Hunspell 对韩语支持有限
# 建议使用 KoNLPy（Python）等专门工具

9.8 阿拉伯语形态学

9.8.1 阿拉伯语特点

阿拉伯语基于三辅音词根系统（Trilateral Root System）：

词根	含义	派生词
k-t-b	写	kitāb (书), kātib (作者), maktab (办公室)
d-r-s	学习	darasa (他学), madrasa (学校), mudarris (教师)
ʿ-l-m	知识	ʿilm (科学), ʿālim (学者), maʿlūm (已知的)

9.8.2 Hunspell 处理阿拉伯语

# Hunspell 对阿拉伯语支持有限
# 需要大量手动词条或特殊模板

# 替代方案：
# - qutrub (阿拉伯语动词变位)
# - AraMorph (阿拉伯语形态分析器)
# - CAMeL Tools (阿拉伯语 NLP)

9.9 特殊语言支持

9.9.1 土耳其语

# 土耳其语特点：元音和谐（Vowel Harmony）
# 后缀根据词根元音选择不同形式

# 示例：-ler/-lar (复数)
# ev → evler (房子) — 前元音词用 -ler
# at → atlar (马) — 后元音词用 -lar

# Hunspell 中的实现
SFX L Y 2
SFX L   0   ler     [eiöü].*        # 前元音后用 -ler
SFX L   0   lar     [aıou].*        # 后元音后用 -lar

9.9.2 芬兰语

# 芬兰语特点：高度屈折，15 种格
# 词典展开比例可达 200:1

# 格变化示例：talo (房子)
# talo (主格), talon (属格), taloa (部分格)
# talossa (内格), talosta (离格), taloon (入格)
# talolla (在格), talolta (从格), talolle (向格)
# talona (本质格), taloksi (转变格)
# taloin (工具格), talotta (无格)

# Hunspell 处理芬兰语需要大量词缀规则
# 或使用 Voikko（芬兰语专用工具）

9.9.3 匈牙利语

# 匈牙利语特点：极度黏着，18 种格后缀
# 示例：ház (房子)
# házban (在房子里), házból (从房子出来)
# házba (进房子), házat (房子-宾格)
# háznak (给房子), házzal (用房子)
# házig (直到房子), házért (为了房子)
# házként (作为房子), házul (以房子的形式)
# házról (关于房子), házhoz (向房子)
# házon (在房子上), házból (从房子)
# házba (进入房子), házal (...)
# házárt (...), házév (...)

# Hunspell 的匈牙利语词典是最大的之一
# 约 6 万词根可展开到 150 万词形

9.9.4 冰岛语

# 冰岛语保留了古北欧语的复杂屈折系统
# 4 种格、3 种性、2 种数

# 词典中大量使用形态分析标记
# 冰岛语 Hunspell 词典由冰岛语言研究所维护

9.10 形态学在搜索中的应用

9.10.1 词干搜索

#!/usr/bin/env python3
"""使用 Hunspell 进行词干搜索"""
import subprocess

def get_stems(word: str, dictionary: str = "en_US") -> list[str]:
    """获取单词的所有可能词干"""
    result = subprocess.run(
        ["hunspell", "-s", "-d", dictionary],
        input=word, capture_output=True, text=True
    )
    stems = []
    for line in result.stdout.strip().split("\n"):
        if "→" in line:
            stem = line.split("→")[1].strip()
            if stem:
                stems.append(stem)
    return list(set(stems))

def stem_search(query: str, documents: list[str], dictionary: str = "en_US") -> dict:
    """
    基于词干的搜索
    将查询词和文档都转换为词干，然后匹配
    """
    # 提取查询词的词干
    query_words = query.lower().split()
    query_stems = set()
    for word in query_words:
        stems = get_stems(word, dictionary)
        query_stems.update(stems)
    
    # 对文档评分
    results = []
    for i, doc in enumerate(documents):
        doc_words = doc.lower().split()
        doc_stems = set()
        for word in doc_words:
            stems = get_stems(word, dictionary)
            doc_stems.update(stems)
        
        # 计算匹配分数
        matches = query_stems & doc_stems
        score = len(matches) / len(query_stems) if query_stems else 0
        
        if score > 0:
            results.append({
                "doc_id": i,
                "score": score,
                "matches": matches,
                "text": doc[:100]
            })
    
    return sorted(results, key=lambda x: x["score"], reverse=True)

# 使用
documents = [
    "The cats are running in the garden",
    "Dogs run quickly through the park",
    "A beautiful sunset over the mountains",
    "Running is a great form of exercise",
]

results = stem_search("cat running", documents)
print("搜索结果：")
for r in results:
    print(f"  文档 {r['doc_id']}: 分数 {r['score']:.2f}")
    print(f"    匹配词干: {r['matches']}")
    print(f"    内容: {r['text']}...")

9.10.2 词形还原与词形还原器对比

工具	方法	优点	缺点
Hunspell	基于规则的词缀分析	支持多语言、可自定义	需要完善词典
NLTK WordNetLemmatizer	基于词典查找	准确	仅限英语
spaCy lemmatizer	基于模型	上下文感知	需要模型
Stemming (Porter/Snowball)	基于规则的后缀去除	简单快速	可能过度词干化

9.11 本章小结

概念	说明	Hunspell 支持
词干提取	找到词的根形式	`-s` 参数
形态分析	解析词的结构	`-m` 参数
屈折变化	语法关系引起的形式变化	SFX/PFX 规则
派生变化	创造新词的词缀	SFX/PFX 规则
复合词	多词组合成新词	COMPOUNDRULE 系统
中文形态学	分词 + 词典检查	需配合分词工具

Hunspell 拼写检查完全教程 / 第 09 章：形态学分析

第 09 章：形态学分析

9.1 形态学基础

9.2 词干提取

9.2.1 基本词干提取

9.2.2 多词词干提取

9.2.3 词干提取在 NLP 中的应用

9.3 形态分析

9.3.1 -m 模式详解

9.3.2 形态学标签详解

9.3.3 各语言形态分析示例

9.3.4 形态分析封装

9.4 复合词分析

9.4.1 复合词检测

9.4.2 复合词配置详解

9.4.3 德语复合词规则示例

9.4.4 英语复合词

9.5 屈折变化（Inflection）

9.5.1 什么是屈折

9.5.2 英语屈折系统

9.5.3 规则屈折 vs 不规则屈折

9.6 派生变化（Derivation）

9.6.1 什么是派生

9.6.2 英语派生规则

9.6.3 派生链

9.7 中文形态学处理

9.7.1 中文的特殊性

9.7.2 中文分词

9.7.3 中文特殊词汇处理

9.7.4 日语形态学

9.7.5 韩语形态学

9.8 阿拉伯语形态学

9.8.1 阿拉伯语特点

9.8.2 Hunspell 处理阿拉伯语

9.9 特殊语言支持

9.9.1 土耳其语

9.9.2 芬兰语

9.9.3 匈牙利语

9.9.4 冰岛语

9.10 形态学在搜索中的应用

9.10.1 词干搜索

9.10.2 词形还原与词形还原器对比

9.11 本章小结

扩展阅读