强曰为道

与天地相似,故不违。知周乎万物,而道济天下,故不过。旁行而不流,乐天知命,故不忧.
文档目录

第 09 章:高可用与故障切换

第 09 章:高可用与故障切换

9.1 为什么需要高可用

DNS 和 DHCP 是网络基础设施的核心服务:

服务故障影响
DNS 故障所有域名解析失败,互联网访问中断
DHCP 故障新设备无法获取 IP,租约到期设备断网

SLA 目标

场景可用性要求允许停机时间/年
家庭网络99%3.65 天
小型企业99.9%8.76 小时
中型企业99.99%52.56 分钟

9.2 高可用架构方案

方案一:主备 + Keepalived

┌─────────────┐         ┌─────────────┐
│ Dnsmasq 主  │←──心跳──→│ Dnsmasq 备  │
│ 192.168.1.1 │         │ 192.168.1.2 │
└──────┬──────┘         └──────┬──────┘
       │                       │
       └───────────┬───────────┘
                   │
              VIP: 192.168.1.254
                   │
            ┌──────┴──────┐
            │   客户端     │
            └─────────────┘

方案二:双活 + 负载均衡

客户端 DNS 配置:
  nameserver 192.168.1.1
  nameserver 192.168.1.2

两个 Dnsmasq 实例同时工作,客户端自动故障切换

方案三:DHCP 分割作用域

主服务器:192.168.1.100 - 192.168.1.150(50%)
备服务器:192.168.1.151 - 192.168.1.200(50%)

9.3 Keepalived 部署

9.3.1 安装 Keepalived

# Debian/Ubuntu
sudo apt install keepalived

# CentOS/RHEL
sudo yum install keepalived

# 验证安装
keepalived --version

9.3.2 主节点 Keepalived 配置

# /etc/keepalived/keepalived.conf (主节点)

global_defs {
    router_id DNS_MASTER
    script_user root
    enable_script_security
}

vrrp_script chk_dnsmasq {
    script "/usr/bin/killall -0 dnsmasq"
    interval 2          # 每 2 秒检查一次
    weight -20          # 失败时降低优先级 20
    fall 3              # 连续失败 3 次判定为故障
    rise 2              # 连续成功 2 次判定为恢复
}

vrrp_instance VI_DNS {
    state MASTER
    interface eth1
    virtual_router_id 51
    priority 100
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass dns_ha_secret
    }

    virtual_ipaddress {
        192.168.1.254/24 dev eth1
    }

    track_script {
        chk_dnsmasq
    }

    # 状态切换时执行的脚本
    notify_master "/etc/keepalived/scripts/notify.sh MASTER"
    notify_backup "/etc/keepalived/scripts/notify.sh BACKUP"
    notify_fault  "/etc/keepalived/scripts/notify.sh FAULT"
}

9.3.3 备节点 Keepalived 配置

# /etc/keepalived/keepalived.conf (备节点)

global_defs {
    router_id DNS_BACKUP
    script_user root
    enable_script_security
}

vrrp_script chk_dnsmasq {
    script "/usr/bin/killall -0 dnsmasq"
    interval 2
    weight -20
    fall 3
    rise 2
}

vrrp_instance VI_DNS {
    state BACKUP
    interface eth1
    virtual_router_id 51
    priority 90        # 低于主节点
    advert_int 1

    authentication {
        auth_type PASS
        auth_pass dns_ha_secret
    }

    virtual_ipaddress {
        192.168.1.254/24 dev eth1
    }

    track_script {
        chk_dnsmasq
    }

    notify_master "/etc/keepalived/scripts/notify.sh MASTER"
    notify_backup "/etc/keepalived/scripts/notify.sh BACKUP"
    notify_fault  "/etc/keepalived/scripts/notify.sh FAULT"
}

9.3.4 状态切换通知脚本

sudo mkdir -p /etc/keepalived/scripts

sudo tee /etc/keepalived/scripts/notify.sh <<'SCRIPT'
#!/bin/bash
STATE=$1
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
LOGFILE="/var/log/keepalived-state.log"

echo "[$TIMESTAMP] State changed to: $STATE" >> "$LOGFILE"

case $STATE in
    MASTER)
        # 成为主节点:确保 Dnsmasq 正常运行
        systemctl start dnsmasq
        # 发送告警通知(邮件/钉钉/Slack)
        # curl -X POST "https://hooks.slack.com/..." -d '{"text":"DNS Master activated"}'
        ;;
    BACKUP)
        # 成为备节点:保持 Dnsmasq 运行(备用状态)
        systemctl start dnsmasq
        ;;
    FAULT)
        # 故障状态:记录日志
        logger -t keepalived "DNS HA entered FAULT state"
        ;;
esac
SCRIPT

sudo chmod +x /etc/keepalived/scripts/notify.sh

9.3.5 双节点 Dnsmasq 配置同步

# 两台服务器的 Dnsmasq 配置必须完全一致

# 方法 1:手动同步(简单但不实时)
rsync -avz /etc/dnsmasq.d/ backup-server:/etc/dnsmasq.d/

# 方法 2:使用 inotifywait 自动同步
sudo apt install inotify-tools

#!/bin/bash
# /usr/local/bin/sync-dnsmasq.sh
REMOTE="backup-server"
inotifywait -m -r -e modify,create,delete /etc/dnsmasq.d/ |
while read path action file; do
    rsync -avz /etc/dnsmasq.d/ $REMOTE:/etc/dnsmasq.d/
    ssh $REMOTE "systemctl reload dnsmasq"
done

9.3.6 DHCP 租约同步

# DHCP 租约文件也需要同步,避免主备切换时地址冲突

# 方法 1:共享存储(NFS)
# 两台服务器挂载同一 NFS 目录存放租约文件
# leasefile=/nfs/shared/dnsmasq.leases

# 方法 2:使用固定地址范围 + 静态绑定
# 主服务器使用 192.168.1.100-150
# 备服务器使用 192.168.1.151-200
# 重叠区域为 0,不会冲突

# 方法 3:租约文件同步脚本
#!/bin/bash
# /usr/local/bin/sync-leases.sh
REMOTE="backup-server"
scp /var/lib/misc/dnsmasq.leases $REMOTE:/var/lib/misc/dnsmasq.leases
ssh $REMOTE "systemctl reload dnsmasq"

9.4 双活 DNS 配置

9.4.1 简单双活(客户端配置两个 DNS)

# 不需要 Keepalived,客户端直接配置两个 DNS

# 主服务器 192.168.1.1 配置
listen-address=192.168.1.1
bind-interfaces

# 备服务器 192.168.1.2 配置
listen-address=192.168.1.2
bind-interfaces

# DHCP 下发两个 DNS
dhcp-option=option:dns-server,192.168.1.1,192.168.1.2

9.4.2 双活 DNS 的数据一致性

# 两台服务器使用相同的:
# - /etc/dnsmasq.hosts(本地记录)
# - /etc/dnsmasq.conf + /etc/dnsmasq.d/(配置文件)
# - 上游 DNS 设置

# 同步方案
# 方案 A:Git 仓库管理配置
# /etc/dnsmasq.d/ 是 Git 仓库,两台服务器 pull 最新配置
cd /etc/dnsmasq.d && git pull && systemctl reload dnsmasq

# 方案 B:NFS 共享配置
# mount -t nfs config-server:/export/dnsmasq /etc/dnsmasq.d

# 方案 C:Ansible 批量管理
# ansible-playbook deploy-dnsmasq.yml

9.4.3 DHCP 分割作用域

# 主服务器 - 上半段地址池
# /etc/dnsmasq.d/dhcp-primary.conf
interface=eth1
dhcp-range=set:primary,192.168.1.100,192.168.1.149,255.255.255.0,24h
dhcp-option=tag:primary,option:router,192.168.1.1
dhcp-option=tag:primary,option:dns-server,192.168.1.1,192.168.1.2
dhcp-authoritative

# 备服务器 - 下半段地址池
# /etc/dnsmasq.d/dhcp-secondary.conf
interface=eth1
dhcp-range=set:secondary,192.168.1.150,192.168.1.199,255.255.255.0,24h
dhcp-option=tag:secondary,option:router,192.168.1.1
dhcp-option=tag:secondary,option:dns-server,192.168.1.1,192.168.1.2
dhcp-authoritative

9.5 健康检查脚本

9.5.1 DNS 健康检查

#!/bin/bash
# /usr/local/bin/check-dns-health.sh

DOMAIN="www.baidu.com"
DNS_SERVER="127.0.0.1"
MAX_RETRIES=3
TIMEOUT=5

for i in $(seq 1 $MAX_RETRIES); do
    result=$(dig @$DNS_SERVER $DOMAIN +short +timeout=$TIMEOUT 2>/dev/null)
    if [ -n "$result" ]; then
        echo "DNS OK: $DOMAIN resolved to $result"
        exit 0
    fi
    sleep 1
done

echo "DNS CRITICAL: Failed to resolve $DOMAIN after $MAX_RETRIES attempts"
exit 2

9.5.2 DHCP 健康检查

#!/bin/bash
# /usr/local/bin/check-dhcp-health.sh

# 检查 Dnsmasq 进程
if ! pidof dnsmasq > /dev/null; then
    echo "CRITICAL: dnsmasq process not running"
    exit 2
fi

# 检查端口监听
if ! ss -uln | grep -q ":67 "; then
    echo "CRITICAL: DHCP port 67 not listening"
    exit 2
fi

# 检查租约文件
LEASE_FILE="/var/lib/misc/dnsmasq.leases"
if [ ! -f "$LEASE_FILE" ]; then
    echo "WARNING: Lease file missing"
    exit 1
fi

# 检查最近租约活动
LEASE_COUNT=$(wc -l < "$LEASE_FILE")
echo "OK: dnsmasq running, $LEASE_COUNT active leases"
exit 0

9.5.3 综合监控脚本

#!/bin/bash
# /usr/local/bin/monitor-dnsmasq.sh

LOG="/var/log/dnsmasq-monitor.log"
ALERT_EMAIL="[email protected]"
SLACK_WEBHOOK="https://hooks.slack.com/services/xxx"

send_alert() {
    local message="$1"
    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$timestamp] ALERT: $message" >> "$LOG"
    
    # 邮件通知
    echo "$message" | mail -s "Dnsmasq Alert" "$ALERT_EMAIL" 2>/dev/null
    
    # Slack 通知
    curl -s -X POST "$SLACK_WEBHOOK" \
        -d "{\"text\":\"🚨 Dnsmasq Alert: $message\"}" \
        -H 'Content-Type: application/json' > /dev/null 2>&1
}

# 检查进程
if ! pidof dnsmasq > /dev/null; then
    send_alert "Dnsmasq process not running!"
    # 自动重启
    systemctl restart dnsmasq
    if [ $? -eq 0 ]; then
        send_alert "Dnsmasq restarted successfully"
    else
        send_alert "Dnsmasq restart FAILED"
    fi
fi

# 检查 DNS 解析
result=$(dig @127.0.0.1 www.baidu.com +short +timeout=3 2>/dev/null)
if [ -z "$result" ]; then
    send_alert "DNS resolution failed!"
fi

# 检查内存使用
RSS=$(ps -o rss= -p $(pidof dnsmasq) 2>/dev/null)
if [ -n "$RSS" ] && [ "$RSS" -gt 102400 ]; then  # 超过 100MB
    send_alert "Dnsmasq memory usage high: ${RSS}KB"
fi

# 检查缓存命中率
# 发送 SIGUSR1 输出统计
kill -USR1 $(pidof dnsmasq) 2>/dev/null

9.5.4 Systemd 自动重启

# /etc/systemd/system/dnsmasq.service.d/restart.conf
[Service]
Restart=always
RestartSec=5
StartLimitIntervalSec=300
StartLimitBurst=5

9.6 Keepalived 高级配置

9.6.1 非抢占模式

# 避免频繁切换(主节点恢复后不自动切回)
vrrp_instance VI_DNS {
    state BACKUP          # 两台都设为 BACKUP
    nopreempt             # 启用非抢占
    priority 100          # 主节点优先级更高
    ...
}

9.6.2 多 VRRP 实例(DNS + DHCP 分离)

# DNS 主 → 服务器 A
# DHCP 主 → 服务器 B
# 两个服务互相备份

# 服务器 A 的 Keepalived 配置
vrrp_instance VI_DNS {
    state MASTER
    priority 100
    virtual_ipaddress { 192.168.1.253/24 }
}

vrrp_instance VI_DHCP {
    state BACKUP
    priority 90
    virtual_ipaddress { 192.168.1.252/24 }
}

# 服务器 B 的 Keepalived 配置(反向)
vrrp_instance VI_DNS {
    state BACKUP
    priority 90
    virtual_ipaddress { 192.168.1.253/24 }
}

vrrp_instance VI_DHCP {
    state MASTER
    priority 100
    virtual_ipaddress { 192.168.1.252/24 }
}

9.7 测试故障切换

9.7.1 手动故障模拟

# 方法 1:停止 Dnsmasq
sudo systemctl stop dnsmasq

# 方法 2:停止 Keepalived
sudo systemctl stop keepalived

# 方法 3:断开网络接口
sudo ip link set eth1 down

# 方法 4:降低优先级
# 编辑 keepalived.conf,降低 priority,reload

# 监控切换过程
sudo tcpdump -i eth1 -n vrrp

9.7.2 切换时间测试

# 从客户端持续测试 DNS,观察切换时间
#!/bin/bash
# test-failover.sh

while true; do
    result=$(dig @192.168.1.254 www.baidu.com +short +timeout=1 2>/dev/null)
    timestamp=$(date '+%H:%M:%S')
    if [ -z "$result" ]; then
        echo "[$timestamp] FAIL"
    else
        echo "[$timestamp] OK: $result"
    fi
    sleep 1
done

9.8 完整高可用配置示例

主节点完整配置清单

节点信息:
- 主机名:dns-master
- IP:192.168.1.1
- VIP:192.168.1.254
- 角色:MASTER

需要配置的文件:
1. /etc/dnsmasq.d/*.conf      (Dnsmasq 配置)
2. /etc/keepalived/keepalived.conf  (Keepalived 配置)
3. /etc/keepalived/scripts/notify.sh (通知脚本)
4. /usr/local/bin/monitor-dnsmasq.sh (监控脚本)

部署步骤

# 1. 安装软件
sudo apt install dnsmasq keepalived

# 2. 配置 Dnsmasq(两台服务器配置相同)
sudo cp -r /path/to/dnsmasq-config/* /etc/dnsmasq.d/

# 3. 配置 Keepalived(注意主备 priority 不同)
sudo cp /path/to/keepalived-master.conf /etc/keepalived/keepalived.conf

# 4. 启动服务
sudo systemctl enable --now dnsmasq
sudo systemctl enable --now keepalived

# 5. 验证 VIP
ip addr show eth1 | grep 192.168.1.254

# 6. 测试 DNS
dig @192.168.1.254 www.baidu.com

# 7. 测试故障切换
sudo systemctl stop dnsmasq
# 在另一台服务器上验证 VIP 漂移

9.9 小结

方案优点缺点适用场景
Keepalived 主备自动切换,VIP 不变需要额外软件企业网络
双活 DNS简单,无需额外软件需手动同步配置小型网络
DHCP 分割作用域无单点故障地址池减半大型网络

9.10 扩展阅读