systemd 教程 / 高可用服务编排实战
高可用服务编排实战
高可用(High Availability, HA)是生产环境的核心需求。本章将介绍如何使用 systemd 配合各种 HA 工具构建高可用服务架构。
1. 高可用服务设计原则
1.1 可用性指标
| 可用性等级 | 年停机时间 | 适用场景 |
|---|---|---|
| 99% (2个9) | 3.65 天 | 内部工具 |
| 99.9% (3个9) | 8.76 小时 | 一般业务系统 |
| 99.99% (4个9) | 52.6 分钟 | 核心业务系统 |
| 99.999% (5个9) | 5.26 分钟 | 金融/支付系统 |
1.2 设计原则
| 原则 | 说明 | 实现方式 |
|---|---|---|
| 冗余 | 消除单点故障 | 主从、集群 |
| 故障检测 | 快速发现问题 | 健康检查、心跳 |
| 自动恢复 | 减少人工干预 | 故障转移、自动重启 |
| 数据一致性 | 保证数据完整 | 复制、同步 |
| 最小故障域 | 限制故障影响范围 | 隔离、分片 |
2. VIP 漂移(Keepalived + systemd)
2.1 架构概述
VIP: 192.168.1.100
│
┌───────────────┼───────────────┐
│ │
┌────┴────┐ ┌─────┴────┐
│ Master │ ← 心跳 → │ Backup │
│ (主节点) │ │ (备节点) │
└─────────┘ └──────────┘
2.2 安装配置
sudo dnf install keepalived
sudo systemctl enable keepalived.service
2.3 Master 节点配置
# /etc/keepalived/keepalived.conf (Master)
global_defs {
router_id LVS_MASTER
script_user root
enable_script_security
}
vrrp_script chk_nginx {
script "/usr/bin/curl -sf http://localhost/ -o /dev/null"
interval 2
weight -20
fall 3
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass mypassword
}
virtual_ipaddress {
192.168.1.100/24 dev eth0
}
track_script {
chk_nginx
}
notify_master "/etc/keepalived/scripts/notify.sh MASTER"
notify_backup "/etc/keepalived/scripts/notify.sh BACKUP"
notify_fault "/etc/keepalived/scripts/notify.sh FAULT"
}
2.4 Backup 节点配置
# /etc/keepalived/keepalived.conf (Backup)
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90
advert_int 1
# ... 其余配置相同
}
2.5 通知脚本
#!/bin/bash
# /etc/keepalived/scripts/notify.sh
STATE=$1
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
HOSTNAME=$(hostname)
case ${STATE} in
MASTER)
logger -t keepalived "${HOSTNAME} became MASTER"
systemctl start nginx.service
;;
BACKUP)
logger -t keepalived "${HOSTNAME} became BACKUP"
;;
FAULT)
logger -t keepalived "${HOSTNAME} entered FAULT state"
curl -X POST https://hooks.slack.com/... \
-d "{\"text\": \"${HOSTNAME} FAULT\"}"
;;
esac
2.6 systemd 集成
# /etc/systemd/system/keepalived.service.d/override.conf
[Unit]
After=network-online.target nginx.service
Requires=network-online.target
Wants=nginx.service
[Service]
Restart=on-failure
RestartSec=5
3. 数据库高可用
3.1 MySQL 主从复制
Master 配置:
# /etc/my.cnf.d/master.cnf
[mysqld]
server-id=1
log-bin=mysql-bin
binlog_format=ROW
sync_binlog=1
innodb_flush_log_at_trx_commit=1
Master systemd 服务:
# /etc/systemd/system/mysql-master.service
[Unit]
Description=MySQL Master
After=network.target
[Service]
Type=forking
User=mysql
Group=mysql
ExecStart=/usr/sbin/mysqld --defaults-file=/etc/my.cnf.d/master.cnf
ExecStartPost=/bin/bash -c 'while ! mysqladmin ping -s; do sleep 1; done'
Restart=on-failure
RestartSec=10
LimitNOFILE=65535
[Install]
WantedBy=multi-user.target
Slave systemd 服务:
# /etc/systemd/system/mysql-slave.service
[Unit]
Description=MySQL Slave
After=network.target mysql-master.service
[Service]
Type=forking
User=mysql
Group=mysql
ExecStart=/usr/sbin/mysqld --defaults-file=/etc/my.cnf.d/slave.cnf
ExecStartPost=/bin/bash -c 'while ! mysqladmin ping -s; do sleep 1; done'
Restart=on-failure
[Install]
WantedBy=multi-user.target
3.2 PostgreSQL 流复制
Primary 配置:
# /var/lib/pgsql/data/postgresql.conf
wal_level = replica
max_wal_senders = 3
wal_keep_size = 1GB
hot_standby = on
synchronous_standby_names = 'standby1'
Primary systemd 服务:
# /etc/systemd/system/postgresql-primary.service
[Unit]
Description=PostgreSQL Primary
After=network.target
[Service]
Type=forking
User=postgres
ExecStart=/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/data start
ExecStop=/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/data -m fast stop
ExecReload=/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/data reload
Restart=on-failure
[Install]
WantedBy=multi-user.target
Standby systemd 服务:
# /etc/systemd/system/postgresql-standby.service
[Unit]
Description=PostgreSQL Standby
After=network.target postgresql-primary.service
[Service]
Type=forking
User=postgres
ExecStart=/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/standby start
ExecStop=/usr/pgsql-15/bin/pg_ctl -D /var/lib/pgsql/standby -m fast stop
Restart=on-failure
[Install]
WantedBy=multi-user.target
4. 负载均衡(HAProxy + systemd)
4.1 HAProxy 配置
# /etc/haproxy/haproxy.cfg
global
daemon
maxconn 50000
stats socket /run/haproxy/admin.sock mode 660 level admin
defaults
mode http
timeout connect 5s
timeout client 30s
timeout server 30s
option httplog
frontend http_front
bind *:80
bind *:443 ssl crt /etc/haproxy/certs/
default_backend http_back
backend http_back
balance roundrobin
option httpchk GET /health
http-check expect status 200
server web1 192.168.1.11:8080 check inter 3s fall 3 rise 2
server web2 192.168.1.12:8080 check inter 3s fall 3 rise 2
server web3 192.168.1.13:8080 check inter 3s fall 3 rise 2
4.2 HAProxy systemd 服务
# /etc/systemd/system/haproxy.service
[Unit]
Description=HAProxy Load Balancer
After=network.target syslog.target
Before=nginx.service
[Service]
Type=notify
Environment="CONFIG=/etc/haproxy/haproxy.cfg" "PIDFILE=/run/haproxy.pid"
ExecStartPre=/usr/sbin/haproxy -f $CONFIG -c -q
ExecStart=/usr/sbin/haproxy -Ws -f $CONFIG -p $PIDFILE
ExecReload=/bin/kill -USR2 $MAINPID
Restart=on-failure
RestartSec=5
LimitNOFILE=100000
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
ReadWritePaths=/run /var/lib/haproxy /var/log
[Install]
WantedBy=multi-user.target
4.3 健康检查脚本
#!/bin/bash
# /opt/scripts/haproxy-healthcheck.sh
if ! systemctl is-active haproxy.service > /dev/null; then
echo "HAProxy is not running"; exit 1
fi
STATS=$(echo "show stat" | socat /run/haproxy/admin.sock stdio)
BACKENDS_DOWN=$(echo "${STATS}" | grep -c ",DOWN,")
if [ "${BACKENDS_DOWN}" -gt 0 ]; then
echo "${BACKENDS_DOWN} backend(s) are DOWN"; exit 1
fi
echo "All backends are healthy"; exit 0
5. 服务健康检查脚本
5.1 通用健康检查
#!/bin/bash
# /opt/scripts/service-healthcheck.sh
SERVICE_NAME=$1
HEALTH_URL=${2:-"http://localhost/health"}
TIMEOUT=${3:-5}
RETRIES=${4:-3}
# 检查服务状态
if ! systemctl is-active "${SERVICE_NAME}.service" > /dev/null; then
echo "${SERVICE_NAME} is not running"; exit 1
fi
# 检查健康端点
for i in $(seq 1 ${RETRIES}); do
if curl -sf --max-time ${TIMEOUT} "${HEALTH_URL}" > /dev/null 2>&1; then
echo "OK"; exit 0
fi
sleep 1
done
echo "Health endpoint not responding"; exit 1
5.2 systemd 健康检查单元
# /etc/systemd/system/[email protected]
[Unit]
Description=Health Check for %i
After=%i.service
[Service]
Type=oneshot
ExecStart=/opt/scripts/service-healthcheck.sh %i
TimeoutStartSec=30
[Install]
WantedBy=multi-user.target
# /etc/systemd/system/[email protected]
[Unit]
Description=Health Check Timer for %i
[Timer]
OnBootSec=60
OnUnitActiveSec=30
[Install]
WantedBy=timers.target
6. 故障自动恢复
6.1 systemd 自动重启
[Service]
Restart=on-failure
RestartSec=10
StartLimitBurst=5
StartLimitIntervalSec=300
ExecStartPost=/bin/bash -c 'while ! curl -sf http://localhost/health; do sleep 1; done'
6.2 故障通知脚本
#!/bin/bash
# /opt/scripts/service-failure-notify.sh
SERVICE_NAME=$1
ACTION=$2
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
HOSTNAME=$(hostname)
logger -t service-monitor "${SERVICE_NAME} ${ACTION} on ${HOSTNAME}"
case ${ACTION} in
failed)
curl -X POST https://hooks.slack.com/services/xxx \
-H 'Content-Type: application/json' \
-d "{\"text\": \"⚠️ ${SERVICE_NAME} failed on ${HOSTNAME}\"}"
;;
recovered)
curl -X POST https://hooks.slack.com/services/xxx \
-d "{\"text\": \"✅ ${SERVICE_NAME} recovered on ${HOSTNAME}\"}"
;;
esac
7. Pacemaker/Corosync 集成
7.1 安装配置
sudo dnf install pacemaker corosync pcs
sudo systemctl enable pcsd.service corosync.service pacemaker.service
sudo systemctl start pcsd.service
# 设置集群认证
sudo pcs host auth node1 node2 -u hacluster -p password
sudo pcs cluster setup mycluster node1 node2
sudo pcs cluster start --all
7.2 集群资源配置
# 创建资源
pcs resource create WebServer systemd:nginx \
op monitor interval=30s \
op start timeout=60s
pcs resource create VIP ocf:heartbeat:IPaddr2 \
ip=192.168.1.100 cidr_netmask=24
# 创建资源组
pcs resource group add WebGroup VIP WebServer
# 设置约束
pcs constraint colocation add WebServer with VIP INFINITY
pcs constraint order VIP then WebServer
7.3 systemd 与 Pacemaker 配合
# /etc/systemd/system/nginx-ha.service
[Unit]
Description=Nginx HA Service
After=pacemaker.service
Requires=pacemaker.service
[Service]
Type=simple
ExecStart=/usr/sbin/nginx -g "daemon off;"
ExecReload=/bin/kill -HUP $MAINPID
Restart=no # 由 Pacemaker 控制
8. 实际案例
8.1 Web 集群架构
┌──────────────┐
│ 客户端 │
└──────┬───────┘
│
┌──────┴───────┐
│ Keepalived │
│ VIP漂移 │
└──────┬───────┘
│
┌──────┴───────┐
│ HAProxy │
│ 负载均衡 │
└──────┬───────┘
│
┌────────────────┼────────────────┐
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ Nginx 1 │ │ Nginx 2 │ │ Nginx 3 │
└───────────┘ └───────────┘ └───────────┘
8.2 数据库集群架构
┌──────────────┐
│ 应用层 │
└──────┬───────┘
│
┌──────┴───────┐
│ HAProxy │
│ 读写分离 │
└──────┬───────┘
│
┌────────────────┼────────────────┐
┌─────┴─────┐ ┌─────┴─────┐ ┌─────┴─────┐
│ Master │ → │ Slave 1 │ → │ Slave 2 │
│ (读写) │ │ (只读) │ │ (只读) │
└───────────┘ └───────────┘ └───────────┘
HAProxy 读写分离:
frontend mysql_front
bind *:3306
mode tcp
default_backend mysql_write
frontend mysql_read_front
bind *:3307
mode tcp
default_backend mysql_read
backend mysql_write
mode tcp
option mysql-check user haproxy
server master 192.168.1.10:3306 check inter 3s
backend mysql_read
mode tcp
balance roundrobin
option mysql-check user haproxy
server slave1 192.168.1.11:3306 check inter 3s
server slave2 192.168.1.12:3306 check inter 3s
9. 完整 HA 架构 systemd 服务设计
# /etc/systemd/system/ha-webapp.service
[Unit]
Description=HA Web Application
After=network.target
Requires=network.target
Wants=postgresql.service redis.service
ExecStartPost=/bin/bash -c 'while ! curl -sf http://localhost:8080/health; do sleep 1; done'
[Service]
Type=simple
User=webapp
Group=webapp
WorkingDirectory=/opt/webapp
ExecStart=/opt/webapp/bin/app --config /etc/webapp/config.yml
Restart=on-failure
RestartSec=10
StartLimitBurst=5
StartLimitIntervalSec=300
CPUQuota=80%
MemoryMax=2G
NoNewPrivileges=true
PrivateTmp=true
ProtectSystem=strict
[Install]
WantedBy=multi-user.target
⚠️ 注意事项
- 脑裂问题:Keepalived 需要正确配置防止脑裂(使用仲裁节点)
- 数据一致性:主从复制需要监控延迟
- 故障检测时间:健康检查间隔影响故障检测时间
- 资源消耗:HA 组件本身消耗资源
- 配置同步:集群节点配置需要保持一致
- 定期测试:定期测试故障转移流程
💡 提示
- Keepalived 的
weight参数用于调整优先级,负值表示服务故障时降低优先级 - HAProxy 的
option httpchk可以自定义健康检查请求 - 使用
pcs stonith配置 STONITH 防止脑裂 - PostgreSQL 的
synchronous_commit可以控制数据持久化级别 - 定期进行故障演练,验证 HA 方案的有效性