Bài 13: High Availability và Load Balancing Advanced trong NGINX
Health checks (active/passive), session persistence, sticky sessions, failover strategies, Keepalived for virtual IPs, active-active và active-passive architectures, database load balancing, geographic distribution, disaster recovery planning và testing HA setups.
1. High Availability Concepts
1.1. HA Architecture Overview
High Availability Setup:
┌─────────────────┐
│ Load Balancer │
│ (Virtual IP) │
└────────┬────────┘
│
┌────┴────┐
│ │
┌───▼───┐ ┌───▼───┐
│ LB-1 │ │ LB-2 │ (Active-Active with Keepalived)
│Primary│ │Backup │
└───┬───┘ └───┬───┘
│ │
└────┬────┘
│
┌────┴─────────┬─────────┐
│ │ │
┌───▼───┐ ┌───▼───┐ ┌───▼───┐
│ Web-1 │ │ Web-2 │ │ Web-3 │ (Backend servers)
└───────┘ └───────┘ └───────┘
Key components:
- Multiple load balancers (redundancy)
- Virtual IP (VIP) with failover
- Health checks
- Session persistence
- Multiple backend servers
- Automatic failover
1.2. Availability Calculations
Availability = (Total Time - Downtime) / Total Time × 100%
Uptime Targets:
- 99% (Two nines): 3.65 days downtime/year
- 99.9% (Three nines): 8.76 hours downtime/year
- 99.99% (Four nines): 52.56 minutes downtime/year
- 99.999% (Five nines): 5.26 minutes downtime/year
Example with redundancy:
Single server: 99% availability
Two servers: 1 - (0.01 × 0.01) = 99.99% availability
Three servers: 1 - (0.01 × 0.01 × 0.01) = 99.9999% availability
1.3. Types of HA Setups
Active-Passive:
┌────────┐ ┌────────┐
│ Active │────▶│Passive │
│ LB │ │ LB │
└────────┘ └────────┘
│ │
│ (Standby)
│
Serves
Traffic
Active-Active:
┌────────┐ ┌────────┐
│ Active │ │ Active │
│ LB-1 │ │ LB-2 │
└───┬────┘ └───┬────┘
│ │
└──────┬───────┘
│
Both serve traffic
2. Advanced Health Checks
2.1. Passive Health Checks
Passive health checks monitor actual traffic to detect failures.
upstream backend {
server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
server backend3.example.com:8080 max_fails=3 fail_timeout=30s;
# max_fails: Number of failed attempts before marking down
# fail_timeout: Time to mark server as down
}
server {
listen 80;
server_name example.com;
location / {
proxy_pass http://backend;
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
proxy_connect_timeout 2s;
proxy_read_timeout 10s;
}
}
Detailed configuration:
upstream app_backend {
# Backend servers with health check params
server 10.0.0.10:8080 max_fails=3 fail_timeout=30s weight=5;
server 10.0.0.11:8080 max_fails=3 fail_timeout=30s weight=5;
server 10.0.0.12:8080 max_fails=2 fail_timeout=20s weight=3 backup;
# Keepalive connections
keepalive 32;
keepalive_timeout 60s;
keepalive_requests 100;
}
server {
listen 80;
location / {
proxy_pass http://app_backend;
# Define what constitutes a failure
proxy_next_upstream error timeout invalid_header
http_500 http_502 http_503 http_504;
# Retry settings
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
# Connection settings
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
# Headers
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}
2.2. Active Health Checks (Nginx Plus)
# Nginx Plus only
upstream backend {
zone backend 64k;
server backend1.example.com:8080;
server backend2.example.com:8080;
server backend3.example.com:8080;
}
server {
listen 80;
location / {
proxy_pass http://backend;
# Active health check
health_check interval=5s
fails=3
passes=2
uri=/health
match=health_check;
}
}
# Define what a healthy response looks like
match health_check {
status 200;
header Content-Type = "application/json";
body ~ "\"status\":\"ok\"";
}
2.3. Custom Health Check Scripts
External health check script:
#!/bin/bash
# health_check.sh
BACKEND_SERVERS=(
"10.0.0.10:8080"
"10.0.0.11:8080"
"10.0.0.12:8080"
)
HEALTH_ENDPOINT="/health"
NGINX_UPSTREAM_CONF="/etc/nginx/conf.d/upstream.conf"
TEMP_CONF="/tmp/upstream.conf.tmp"
check_backend() {
local server=$1
local url="http://${server}${HEALTH_ENDPOINT}"
# Check with timeout
if curl -sf --max-time 3 "$url" > /dev/null; then
return 0 # Healthy
else
return 1 # Unhealthy
fi
}
update_upstream_config() {
echo "upstream backend {" > $TEMP_CONF
for server in "${BACKEND_SERVERS[@]}"; do
if check_backend "$server"; then
echo " server $server;" >> $TEMP_CONF
echo "✓ $server is healthy"
else
echo " server $server down;" >> $TEMP_CONF
echo "✗ $server is down"
fi
done
echo "}" >> $TEMP_CONF
# Compare and reload if changed
if ! cmp -s "$TEMP_CONF" "$NGINX_UPSTREAM_CONF"; then
mv $TEMP_CONF $NGINX_UPSTREAM_CONF
nginx -t && nginx -s reload
echo "Nginx configuration updated and reloaded"
fi
}
# Run health check
update_upstream_config
Systemd timer for health checks:
# /etc/systemd/system/nginx-health-check.service
[Unit]
Description=Nginx Backend Health Check
[Service]
Type=oneshot
ExecStart=/usr/local/bin/health_check.sh
# /etc/systemd/system/nginx-health-check.timer
[Unit]
Description=Run Nginx health check every 30 seconds
[Timer]
OnBootSec=30s
OnUnitActiveSec=30s
[Install]
WantedBy=timers.target
Enable timer:
sudo systemctl daemon-reload
sudo systemctl start nginx-health-check.timer
sudo systemctl enable nginx-health-check.timer
2.4. Application-Level Health Checks
Node.js health endpoint:
// health.js
const express = require('express');
const app = express();
// Health check endpoint
app.get('/health', async (req, res) => {
const health = {
status: 'ok',
timestamp: Date.now(),
uptime: process.uptime(),
checks: {}
};
try {
// Database check
await checkDatabase();
health.checks.database = 'ok';
// Redis check
await checkRedis();
health.checks.redis = 'ok';
// Memory check
const memUsage = process.memoryUsage();
if (memUsage.heapUsed / memUsage.heapTotal > 0.9) {
throw new Error('High memory usage');
}
health.checks.memory = 'ok';
res.status(200).json(health);
} catch (error) {
health.status = 'error';
health.error = error.message;
res.status(503).json(health);
}
});
async function checkDatabase() {
// Database connection check
// throw error if unhealthy
}
async function checkRedis() {
// Redis connection check
// throw error if unhealthy
}
app.listen(8080);
Python/Flask health endpoint:
# app.py
from flask import Flask, jsonify
import psycopg2
import redis
import time
app = Flask(__name__)
@app.route('/health')
def health_check():
health = {
'status': 'ok',
'timestamp': int(time.time()),
'checks': {}
}
try:
# Database check
check_database()
health['checks']['database'] = 'ok'
# Redis check
check_redis()
health['checks']['redis'] = 'ok'
return jsonify(health), 200
except Exception as e:
health['status'] = 'error'
health['error'] = str(e)
return jsonify(health), 503
def check_database():
conn = psycopg2.connect(
host="localhost",
database="mydb",
user="user",
password="pass"
)
conn.close()
def check_redis():
r = redis.Redis(host='localhost', port=6379)
r.ping()
if __name__ == '__main__':
app.run(host='0.0.0.0', port=8080)
3. Session Persistence / Sticky Sessions
3.1. IP Hash
upstream backend {
ip_hash; # Route based on client IP
server backend1.example.com:8080;
server backend2.example.com:8080;
server backend3.example.com:8080;
}
server {
listen 80;
location / {
proxy_pass http://backend;
}
}
# Pros: Simple, works without cookies
# Cons: Issues with proxies, NAT, mobile users
3.2. Cookie-based Sticky Sessions
Nginx Plus:
upstream backend {
server backend1.example.com:8080;
server backend2.example.com:8080;
server backend3.example.com:8080;
sticky cookie srv_id expires=1h domain=.example.com path=/;
}
Open source alternative with hash:
map $cookie_route $backend_server {
~*server1 backend1.example.com:8080;
~*server2 backend2.example.com:8080;
~*server3 backend3.example.com:8080;
default backend1.example.com:8080;
}
upstream backend {
server backend1.example.com:8080;
server backend2.example.com:8080;
server backend3.example.com:8080;
}
server {
listen 80;
location / {
# Generate sticky cookie if not present
if ($cookie_route = "") {
add_header Set-Cookie "route=server${remote_addr}hash; Path=/; HttpOnly";
}
proxy_pass http://backend;
}
}
3.3. Hash-based Load Balancing
URI hash:
upstream backend {
hash $request_uri consistent;
server backend1.example.com:8080;
server backend2.example.com:8080;
server backend3.example.com:8080;
}
# Same URL always goes to same server
# Good for caching
Custom hash key:
map $cookie_user_id $hash_key {
default $remote_addr;
~.+ $cookie_user_id;
}
upstream backend {
hash $hash_key consistent;
server backend1.example.com:8080;
server backend2.example.com:8080;
server backend3.example.com:8080;
}
# Hash by user ID if available, otherwise by IP
3.4. Session Replication Alternative
Instead of sticky sessions, use session replication.
Redis for session storage:
upstream backend {
# No sticky sessions needed
least_conn;
server backend1.example.com:8080;
server backend2.example.com:8080;
server backend3.example.com:8080;
}
# Application stores sessions in Redis
# All backends can access same session data
Application-side (Node.js example):
const session = require('express-session');
const RedisStore = require('connect-redis')(session);
const redis = require('redis');
const redisClient = redis.createClient({
host: 'redis.example.com',
port: 6379
});
app.use(session({
store: new RedisStore({ client: redisClient }),
secret: 'your-secret',
resave: false,
saveUninitialized: false,
cookie: {
secure: true,
httpOnly: true,
maxAge: 3600000
}
}));
4. Keepalived for High Availability
4.1. Keepalived Setup
Install Keepalived:
# Ubuntu/Debian
sudo apt install keepalived
# CentOS/RHEL
sudo yum install keepalived
Network topology:
Virtual IP: 192.168.1.100
Master: 192.168.1.10 (Priority: 100)
Backup: 192.168.1.11 (Priority: 90)
Clients connect to VIP (192.168.1.100)
Master handles traffic
Backup takes over if master fails
4.2. Master Configuration
# /etc/keepalived/keepalived.conf (Master)
global_defs {
router_id nginx_master
script_user root
enable_script_security
}
vrrp_script check_nginx {
script "/usr/local/bin/check_nginx.sh"
interval 2
weight 2
fall 2
rise 2
}
vrrp_instance VI_1 {
state MASTER
interface eth0
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret_password
}
virtual_ipaddress {
192.168.1.100/24
}
track_script {
check_nginx
}
notify_master "/usr/local/bin/notify_master.sh"
notify_backup "/usr/local/bin/notify_backup.sh"
notify_fault "/usr/local/bin/notify_fault.sh"
}
4.3. Backup Configuration
# /etc/keepalived/keepalived.conf (Backup)
global_defs {
router_id nginx_backup
script_user root
enable_script_security
}
vrrp_script check_nginx {
script "/usr/local/bin/check_nginx.sh"
interval 2
weight 2
fall 2
rise 2
}
vrrp_instance VI_1 {
state BACKUP
interface eth0
virtual_router_id 51
priority 90
advert_int 1
authentication {
auth_type PASS
auth_pass your_secret_password
}
virtual_ipaddress {
192.168.1.100/24
}
track_script {
check_nginx
}
notify_master "/usr/local/bin/notify_master.sh"
notify_backup "/usr/local/bin/notify_backup.sh"
notify_fault "/usr/local/bin/notify_fault.sh"
}
4.4. Health Check Script
#!/bin/bash
# /usr/local/bin/check_nginx.sh
# Check if Nginx is running
if systemctl is-active --quiet nginx; then
# Check if Nginx responds
if curl -sf http://localhost/health > /dev/null 2>&1; then
exit 0 # Healthy
fi
fi
exit 1 # Unhealthy
Make executable:
sudo chmod +x /usr/local/bin/check_nginx.sh
4.5. Notification Scripts
#!/bin/bash
# /usr/local/bin/notify_master.sh
HOSTNAME=$(hostname)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "$TIMESTAMP: $HOSTNAME became MASTER" >> /var/log/keepalived-state.log
# Send alert
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🟢 $HOSTNAME is now MASTER\"}"
#!/bin/bash
# /usr/local/bin/notify_backup.sh
HOSTNAME=$(hostname)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "$TIMESTAMP: $HOSTNAME became BACKUP" >> /var/log/keepalived-state.log
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🟡 $HOSTNAME is now BACKUP\"}"
#!/bin/bash
# /usr/local/bin/notify_fault.sh
HOSTNAME=$(hostname)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "$TIMESTAMP: $HOSTNAME entered FAULT state" >> /var/log/keepalived-state.log
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"🔴 $HOSTNAME in FAULT state\"}"
Make executable:
sudo chmod +x /usr/local/bin/notify_*.sh
4.6. Start Keepalived
# Start on both master and backup
sudo systemctl start keepalived
sudo systemctl enable keepalived
# Check status
sudo systemctl status keepalived
# View logs
sudo journalctl -u keepalived -f
# Check VIP
ip addr show eth0
4.7. Test Failover
# On master, stop Nginx
sudo systemctl stop nginx
# VIP should move to backup within seconds
# Check on backup:
ip addr show eth0 | grep 192.168.1.100
# Restart Nginx on master
sudo systemctl start nginx
# VIP should return to master
5. Complete HA Setup
5.1. Multi-tier HA Architecture
Internet
│
▼
┌─────────────────┐
│ DNS (GeoDNS) │
│ Round-robin │
└────────┬─────────┘
│
┌──────────────┴──────────────┐
│ │
▼ ▼
┌────────────────┐ ┌────────────────┐
│ Data Center 1 │ │ Data Center 2 │
└────────┬───────┘ └────────┬───────┘
│ │
┌────┴────┐ ┌────┴────┐
▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│ LB-1 │ │ LB-2 │ │ LB-3 │ │ LB-4 │
│Master │ │Backup │ │Master │ │Backup │
└───┬───┘ └───┬───┘ └───┬───┘ └───┬───┘
│ │ │ │
│ VIP1 │ │ VIP2 │
└────┬────┘ └────┬────┘
│ │
┌────┴────┬─────┐ ┌──────┴────┬─────┐
▼ ▼ ▼ ▼ ▼ ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Web-1 │ │Web-2 │ │Web-3 │ │Web-4 │ │Web-5 │ │Web-6 │
└───────┘ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘
5.2. Load Balancer Configuration
LB-1 (Master) configuration:
# /etc/nginx/nginx.conf
user nginx;
worker_processes auto;
worker_rlimit_nofile 65535;
events {
worker_connections 4096;
use epoll;
multi_accept on;
}
http {
# Upstream definitions
upstream web_backend {
least_conn;
# Data Center 1 servers
server 10.0.1.10:80 max_fails=3 fail_timeout=30s weight=5;
server 10.0.1.11:80 max_fails=3 fail_timeout=30s weight=5;
server 10.0.1.12:80 max_fails=3 fail_timeout=30s weight=5;
# Keepalive
keepalive 64;
keepalive_timeout 60s;
keepalive_requests 1000;
}
upstream api_backend {
least_conn;
server 10.0.2.10:8080 max_fails=3 fail_timeout=30s;
server 10.0.2.11:8080 max_fails=3 fail_timeout=30s;
server 10.0.2.12:8080 max_fails=3 fail_timeout=30s;
keepalive 32;
}
# Health check endpoint
server {
listen 80;
server_name localhost;
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
}
# Main site
server {
listen 80;
listen [::]:80;
server_name example.com;
# Redirect to HTTPS
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
listen [::]:443 ssl http2;
server_name example.com;
# SSL
ssl_certificate /etc/nginx/ssl/cert.pem;
ssl_certificate_key /etc/nginx/ssl/key.pem;
ssl_protocols TLSv1.2 TLSv1.3;
ssl_session_cache shared:SSL:10m;
# Logging
access_log /var/log/nginx/access.log;
error_log /var/log/nginx/error.log;
# Web traffic
location / {
proxy_pass http://web_backend;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
proxy_next_upstream_tries 3;
proxy_next_upstream_timeout 10s;
proxy_connect_timeout 5s;
proxy_send_timeout 10s;
proxy_read_timeout 10s;
}
# API traffic
location /api/ {
proxy_pass http://api_backend/;
proxy_http_version 1.1;
proxy_set_header Connection "";
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_next_upstream error timeout http_502 http_503;
}
}
}
5.3. Configuration Sync
Rsync script to sync configs:
#!/bin/bash
# sync_config.sh
PRIMARY="lb-1.example.com"
BACKUP="lb-2.example.com"
CONFIG_DIR="/etc/nginx"
if [ "$(hostname)" == "$PRIMARY" ]; then
# Sync from primary to backup
rsync -avz --delete \
--exclude 'logs/*' \
--exclude '*.log' \
$CONFIG_DIR/ \
root@$BACKUP:$CONFIG_DIR/
# Test config on backup
ssh root@$BACKUP "nginx -t && systemctl reload nginx"
echo "Configuration synced to backup"
else
echo "Run this script on primary only"
fi
Automated sync with inotify:
#!/bin/bash
# watch_and_sync.sh
CONFIG_DIR="/etc/nginx"
BACKUP="lb-2.example.com"
inotifywait -m -r -e modify,create,delete $CONFIG_DIR | while read path action file; do
echo "Change detected: $path$file ($action)"
# Sync to backup
rsync -avz --delete \
--exclude 'logs/*' \
$CONFIG_DIR/ \
root@$BACKUP:$CONFIG_DIR/
# Test and reload on backup
ssh root@$BACKUP "nginx -t && systemctl reload nginx"
done
6. Database Load Balancing
6.1. MySQL/PostgreSQL Read Replicas
# Master for writes
upstream db_master {
server db-master.example.com:3306 max_fails=2 fail_timeout=10s;
}
# Slaves for reads
upstream db_slaves {
least_conn;
server db-slave1.example.com:3306 max_fails=3 fail_timeout=30s weight=5;
server db-slave2.example.com:3306 max_fails=3 fail_timeout=30s weight=5;
server db-slave3.example.com:3306 max_fails=3 fail_timeout=30s weight=3;
keepalive 16;
}
# Stream block for TCP proxying
stream {
# Write traffic to master
server {
listen 3306;
proxy_pass db_master;
proxy_connect_timeout 1s;
}
# Read traffic to slaves
server {
listen 3307;
proxy_pass db_slaves;
proxy_connect_timeout 1s;
}
}
# Application connects to:
# localhost:3306 for writes
# localhost:3307 for reads
6.2. MongoDB Replica Set
stream {
upstream mongodb {
# MongoDB replica set members
server mongo1.example.com:27017 max_fails=2 fail_timeout=30s;
server mongo2.example.com:27017 max_fails=2 fail_timeout=30s;
server mongo3.example.com:27017 max_fails=2 fail_timeout=30s;
}
server {
listen 27017;
proxy_pass mongodb;
proxy_connect_timeout 2s;
proxy_timeout 10m;
}
}
6.3. Redis Cluster
stream {
upstream redis_cluster {
# Redis nodes
server redis1.example.com:6379;
server redis2.example.com:6379;
server redis3.example.com:6379;
# Hash by client IP for consistency
hash $remote_addr consistent;
}
server {
listen 6379;
proxy_pass redis_cluster;
proxy_connect_timeout 1s;
proxy_timeout 3s;
}
}
7. Geographic Load Balancing
7.1. GeoDNS Setup
Multiple data centers:
US West: us-west.example.com (IP: 203.0.113.10)
US East: us-east.example.com (IP: 203.0.113.20)
EU: eu.example.com (IP: 203.0.113.30)
Asia: asia.example.com (IP: 203.0.113.40)
DNS configuration (Route53 example):
{
"Type": "A",
"Name": "example.com",
"GeoLocation": {
"ContinentCode": "NA",
"CountryCode": "US",
"SubdivisionCode": "CA"
},
"SetIdentifier": "US-West",
"ResourceRecords": [
{
"Value": "203.0.113.10"
}
],
"TTL": 60,
"HealthCheckId": "health-check-us-west"
}
7.2. Nginx Geo Module
http {
# Map client location to nearest datacenter
geo $nearest_dc {
default us-east;
# US West
include geoip/us-west.conf;
# EU
include geoip/eu.conf;
# Asia
include geoip/asia.conf;
}
# Define upstreams per region
upstream us-west {
server web1.us-west.example.com:80;
server web2.us-west.example.com:80;
}
upstream us-east {
server web1.us-east.example.com:80;
server web2.us-east.example.com:80;
}
upstream eu {
server web1.eu.example.com:80;
server web2.eu.example.com:80;
}
upstream asia {
server web1.asia.example.com:80;
server web2.asia.example.com:80;
}
server {
listen 80;
location / {
# Route to nearest datacenter
proxy_pass http://$nearest_dc;
}
}
}
7.3. GeoIP2 Module
http {
geoip2 /usr/share/GeoIP/GeoLite2-Country.mmdb {
$geoip2_data_country_code country iso_code;
$geoip2_data_country_name country names en;
}
# Map country to datacenter
map $geoip2_data_country_code $datacenter {
default us-east;
# North America
US us-west;
CA us-west;
MX us-east;
# Europe
GB eu;
DE eu;
FR eu;
IT eu;
ES eu;
# Asia
CN asia;
JP asia;
KR asia;
IN asia;
}
server {
location / {
proxy_pass http://$datacenter;
# Add headers for debugging
add_header X-Country-Code $geoip2_data_country_code;
add_header X-Datacenter $datacenter;
}
}
}
8. Disaster Recovery
8.1. Backup Strategy
#!/bin/bash
# backup_nginx_config.sh
BACKUP_DIR="/backup/nginx"
DATE=$(date +%Y%m%d-%H%M%S)
RETENTION_DAYS=30
# Create backup directory
mkdir -p $BACKUP_DIR
# Backup Nginx configuration
tar -czf $BACKUP_DIR/nginx-config-$DATE.tar.gz /etc/nginx
# Backup SSL certificates
tar -czf $BACKUP_DIR/nginx-ssl-$DATE.tar.gz /etc/letsencrypt
# Backup to remote location
rsync -avz $BACKUP_DIR/ backup-server:/backups/nginx/
# Delete old backups
find $BACKUP_DIR -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete
echo "Backup completed: $DATE"
Automated backup with cron:
# /etc/cron.d/nginx-backup
0 2 * * * root /usr/local/bin/backup_nginx_config.sh >> /var/log/nginx-backup.log 2>&1
8.2. Disaster Recovery Plan
Recovery procedure:
#!/bin/bash
# restore_nginx.sh
BACKUP_FILE=$1
if [ -z "$BACKUP_FILE" ]; then
echo "Usage: $0 <backup_file>"
exit 1
fi
# Stop Nginx
systemctl stop nginx
# Backup current config
mv /etc/nginx /etc/nginx.old
# Restore from backup
tar -xzf $BACKUP_FILE -C /
# Test configuration
nginx -t
if [ $? -eq 0 ]; then
# Start Nginx
systemctl start nginx
echo "Restoration successful"
else
# Rollback
rm -rf /etc/nginx
mv /etc/nginx.old /etc/nginx
systemctl start nginx
echo "Restoration failed, rolled back"
exit 1
fi
8.3. Failover Testing
#!/bin/bash
# test_failover.sh
VIP="192.168.1.100"
MASTER="192.168.1.10"
BACKUP="192.168.1.11"
echo "Starting failover test..."
# 1. Check initial state
echo "Checking VIP location..."
ssh root@$MASTER "ip addr | grep $VIP" && echo "✓ VIP on master"
# 2. Stop Nginx on master
echo "Stopping Nginx on master..."
ssh root@$MASTER "systemctl stop nginx"
# Wait for failover
sleep 5
# 3. Check VIP moved to backup
echo "Checking VIP failover..."
ssh root@$BACKUP "ip addr | grep $VIP" && echo "✓ VIP moved to backup"
# 4. Test connectivity
echo "Testing connectivity..."
curl -I http://$VIP && echo "✓ Site accessible"
# 5. Restore master
echo "Restoring master..."
ssh root@$MASTER "systemctl start nginx"
# Wait for failback
sleep 5
# 6. Check VIP returned
echo "Checking VIP failback..."
ssh root@$MASTER "ip addr | grep $VIP" && echo "✓ VIP returned to master"
echo "Failover test complete"
8.4. Recovery Time Objective (RTO)
Measure downtime:
#!/bin/bash
# measure_rto.sh
TARGET="http://192.168.1.100"
LOG_FILE="/var/log/rto-test.log"
START_TIME=$(date +%s)
# Trigger failure
echo "$(date): Starting RTO test" >> $LOG_FILE
ssh root@192.168.1.10 "systemctl stop nginx"
# Monitor until service restored
DOWNTIME=0
while true; do
if curl -sf $TARGET > /dev/null 2>&1; then
END_TIME=$(date +%s)
DOWNTIME=$((END_TIME - START_TIME))
echo "$(date): Service restored after ${DOWNTIME}s" >> $LOG_FILE
break
fi
sleep 1
done
# Restore
ssh root@192.168.1.10 "systemctl start nginx"
echo "RTO: ${DOWNTIME} seconds"
9. Testing HA Setup
9.1. Load Testing with Failover
#!/bin/bash
# load_test_ha.sh
VIP="http://192.168.1.100"
DURATION=300 # 5 minutes
# Start load test in background
echo "Starting load test..."
wrk -t4 -c100 -d${DURATION}s $VIP > /tmp/wrk-results.txt &
WRK_PID=$!
# Wait 60 seconds
sleep 60
# Trigger failover during load test
echo "Triggering failover..."
ssh root@192.168.1.10 "systemctl stop nginx"
# Wait for test to complete
wait $WRK_PID
# Analyze results
echo "Load test complete"
cat /tmp/wrk-results.txt
# Count errors
ERRORS=$(grep "Socket errors" /tmp/wrk-results.txt)
echo "Errors during failover: $ERRORS"
# Restore
ssh root@192.168.1.10 "systemctl start nginx"
9.2. Chaos Engineering
#!/bin/bash
# chaos_test.sh
SERVERS=(
"192.168.1.10"
"192.168.1.11"
"10.0.1.10"
"10.0.1.11"
"10.0.1.12"
)
# Randomly kill services
while true; do
# Random server
SERVER=${SERVERS[$RANDOM % ${#SERVERS[@]}]}
# Random action
ACTIONS=("stop nginx" "network disconnect" "high cpu" "high memory")
ACTION=${ACTIONS[$RANDOM % ${#ACTIONS[@]}]}
echo "$(date): Testing $ACTION on $SERVER"
case $ACTION in
"stop nginx")
ssh root@$SERVER "systemctl stop nginx"
sleep 30
ssh root@$SERVER "systemctl start nginx"
;;
"network disconnect")
ssh root@$SERVER "iptables -A INPUT -j DROP"
sleep 30
ssh root@$SERVER "iptables -F"
;;
"high cpu")
ssh root@$SERVER "stress-ng --cpu 4 --timeout 30s" &
;;
"high memory")
ssh root@$SERVER "stress-ng --vm 2 --vm-bytes 1G --timeout 30s" &
;;
esac
# Wait before next chaos
sleep 60
done
9.3. Automated HA Tests
#!/bin/bash
# ha_test_suite.sh
run_test() {
local test_name=$1
local test_command=$2
echo "Running: $test_name"
if eval $test_command; then
echo "✓ PASS: $test_name"
return 0
else
echo "✗ FAIL: $test_name"
return 1
fi
}
# Test 1: VIP reachable
run_test "VIP Reachability" "ping -c 3 192.168.1.100"
# Test 2: Service responds
run_test "HTTP Response" "curl -sf http://192.168.1.100/health"
# Test 3: Failover time < 5s
run_test "Failover Time" "./measure_rto.sh | grep -q 'RTO: [0-4] seconds'"
# Test 4: All backends healthy
run_test "Backend Health" "curl -sf http://192.168.1.100/health | grep -q 'healthy'"
# Test 5: Session persistence
run_test "Session Persistence" "./test_session_persistence.sh"
# Test 6: Load distribution
run_test "Load Distribution" "./test_load_distribution.sh"
echo "HA test suite complete"
10. Monitoring HA Setup
10.1. HA Monitoring Dashboard
#!/usr/bin/env python3
# ha_monitor.py
import requests
import time
from datetime import datetime
SERVERS = [
{'name': 'LB-1', 'ip': '192.168.1.10', 'role': 'master'},
{'name': 'LB-2', 'ip': '192.168.1.11', 'role': 'backup'},
{'name': 'Web-1', 'ip': '10.0.1.10', 'role': 'backend'},
{'name': 'Web-2', 'ip': '10.0.1.11', 'role': 'backend'},
{'name': 'Web-3', 'ip': '10.0.1.12', 'role': 'backend'},
]
VIP = '192.168.1.100'
def check_server(server):
try:
response = requests.get(
f"http://{server['ip']}/health",
timeout=2
)
return response.status_code == 200
except:
return False
def check_vip():
try:
response = requests.get(f"http://{VIP}/health", timeout=2)
return response.status_code == 200
except:
return False
def display_status():
print("\033[2J\033[H") # Clear screen
print("=" * 60)
print("HA Monitoring Dashboard")
print("=" * 60)
print(f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
print()
# Check VIP
vip_status = "✓ UP" if check_vip() else "✗ DOWN"
print(f"Virtual IP ({VIP}): {vip_status}")
print()
# Check all servers
print("Server Status:")
print("-" * 60)
for server in SERVERS:
status = "✓ UP" if check_server(server) else "✗ DOWN"
print(f"{server['name']:10} {server['ip']:15} {server['role']:10} {status}")
print()
def main():
while True:
display_status()
time.sleep(5)
if __name__ == '__main__':
main()
10.2. Alerting for HA Events
#!/bin/bash
# ha_alert.sh
WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
send_alert() {
local message=$1
local severity=$2
local emoji="ℹ️"
case $severity in
critical) emoji="🔴" ;;
warning) emoji="🟡" ;;
info) emoji="🟢" ;;
esac
curl -X POST $WEBHOOK_URL \
-H 'Content-Type: application/json' \
-d "{\"text\":\"$emoji HA Alert: $message\"}"
}
# Monitor VIP
VIP="192.168.1.100"
PREVIOUS_STATE="unknown"
while true; do
if ping -c 1 -W 1 $VIP > /dev/null 2>&1; then
CURRENT_STATE="up"
if [ "$PREVIOUS_STATE" == "down" ]; then
send_alert "VIP $VIP is now UP" "info"
fi
else
CURRENT_STATE="down"
if [ "$PREVIOUS_STATE" == "up" ]; then
send_alert "VIP $VIP is DOWN!" "critical"
fi
fi
PREVIOUS_STATE=$CURRENT_STATE
sleep 5
done
Tổng kết
Trong bài này, bạn đã học:
- ✅ HA concepts và architecture
- ✅ Advanced health checks (passive/active)
- ✅ Session persistence strategies
- ✅ Keepalived for virtual IPs và failover
- ✅ Complete HA setup với multiple tiers
- ✅ Database load balancing
- ✅ Geographic load balancing
- ✅ Disaster recovery planning
- ✅ HA testing và chaos engineering
- ✅ Monitoring HA infrastructure
Key takeaways:
- Redundancy at every layer
- Automatic failover với Keepalived
- Health checks critical for reliability
- Session persistence for stateful apps
- Geographic distribution for global apps
- Regular testing of failover scenarios
- Comprehensive monitoring và alerting
- Documented DR procedures
HA Checklist:
- ✅ Multiple load balancers with Keepalived
- ✅ Virtual IP configured
- ✅ Health checks implemented
- ✅ Session persistence configured
- ✅ Multiple backend servers
- ✅ Database replication setup
- ✅ Configuration sync automated
- ✅ Monitoring và alerting active
- ✅ DR plan documented và tested
- ✅ Regular failover testing
Production readiness:
- RTO (Recovery Time Objective): < 5 seconds
- RPO (Recovery Point Objective): Near-zero for stateless apps
- Availability target: 99.99% (four nines)
- Regular chaos engineering tests
- Automated incident response
Bài tiếp theo: Microservices và Service Mesh - service discovery, API Gateway patterns, rate limiting per service, circuit breakers, distributed tracing, Consul/Istio integration để manage complex microservices architectures.