Bài 13: High Availability và Load Balancing Advanced trong NGINX

1. High Availability Concepts

1.1. HA Architecture Overview

High Availability Setup:

┌─────────────────┐
│   Load Balancer │
│    (Virtual IP) │
└────────┬────────┘
         │
    ┌────┴────┐
    │         │
┌───▼───┐ ┌───▼───┐
│ LB-1  │ │ LB-2  │  (Active-Active with Keepalived)
│Primary│ │Backup │
└───┬───┘ └───┬───┘
    │         │
    └────┬────┘
         │
    ┌────┴─────────┬─────────┐
    │              │         │
┌───▼───┐     ┌───▼───┐ ┌───▼───┐
│ Web-1 │     │ Web-2 │ │ Web-3 │  (Backend servers)
└───────┘     └───────┘ └───────┘

Key components:

Multiple load balancers (redundancy)
Virtual IP (VIP) with failover
Health checks
Session persistence
Multiple backend servers
Automatic failover

1.2. Availability Calculations

Availability = (Total Time - Downtime) / Total Time × 100%

Uptime Targets:
- 99% (Two nines): 3.65 days downtime/year
- 99.9% (Three nines): 8.76 hours downtime/year
- 99.99% (Four nines): 52.56 minutes downtime/year
- 99.999% (Five nines): 5.26 minutes downtime/year

Example with redundancy:
Single server: 99% availability
Two servers: 1 - (0.01 × 0.01) = 99.99% availability
Three servers: 1 - (0.01 × 0.01 × 0.01) = 99.9999% availability

1.3. Types of HA Setups

Active-Passive:

┌────────┐     ┌────────┐
│ Active │────▶│Passive │
│   LB   │     │   LB   │
└────────┘     └────────┘
     │              │
     │         (Standby)
     │
   Serves
   Traffic

Active-Active:

┌────────┐     ┌────────┐
│ Active │     │ Active │
│  LB-1  │     │  LB-2  │
└───┬────┘     └───┬────┘
    │              │
    └──────┬───────┘
           │
    Both serve traffic

2. Advanced Health Checks

2.1. Passive Health Checks

Passive health checks monitor actual traffic to detect failures.

upstream backend {
    server backend1.example.com:8080 max_fails=3 fail_timeout=30s;
    server backend2.example.com:8080 max_fails=3 fail_timeout=30s;
    server backend3.example.com:8080 max_fails=3 fail_timeout=30s;
    
    # max_fails: Number of failed attempts before marking down
    # fail_timeout: Time to mark server as down
}

server {
    listen 80;
    server_name example.com;
    
    location / {
        proxy_pass http://backend;
        proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
        proxy_connect_timeout 2s;
        proxy_read_timeout 10s;
    }
}

Detailed configuration:

upstream app_backend {
    # Backend servers with health check params
    server 10.0.0.10:8080 max_fails=3 fail_timeout=30s weight=5;
    server 10.0.0.11:8080 max_fails=3 fail_timeout=30s weight=5;
    server 10.0.0.12:8080 max_fails=2 fail_timeout=20s weight=3 backup;
    
    # Keepalive connections
    keepalive 32;
    keepalive_timeout 60s;
    keepalive_requests 100;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://app_backend;
        
        # Define what constitutes a failure
        proxy_next_upstream error timeout invalid_header 
                           http_500 http_502 http_503 http_504;
        
        # Retry settings
        proxy_next_upstream_tries 3;
        proxy_next_upstream_timeout 10s;
        
        # Connection settings
        proxy_connect_timeout 5s;
        proxy_send_timeout 10s;
        proxy_read_timeout 10s;
        
        # Headers
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
    }
}

2.2. Active Health Checks (Nginx Plus)

# Nginx Plus only
upstream backend {
    zone backend 64k;
    
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://backend;
        
        # Active health check
        health_check interval=5s 
                    fails=3 
                    passes=2 
                    uri=/health 
                    match=health_check;
    }
}

# Define what a healthy response looks like
match health_check {
    status 200;
    header Content-Type = "application/json";
    body ~ "\"status\":\"ok\"";
}

2.3. Custom Health Check Scripts

External health check script:

#!/bin/bash
# health_check.sh

BACKEND_SERVERS=(
    "10.0.0.10:8080"
    "10.0.0.11:8080"
    "10.0.0.12:8080"
)

HEALTH_ENDPOINT="/health"
NGINX_UPSTREAM_CONF="/etc/nginx/conf.d/upstream.conf"
TEMP_CONF="/tmp/upstream.conf.tmp"

check_backend() {
    local server=$1
    local url="http://${server}${HEALTH_ENDPOINT}"
    
    # Check with timeout
    if curl -sf --max-time 3 "$url" > /dev/null; then
        return 0  # Healthy
    else
        return 1  # Unhealthy
    fi
}

update_upstream_config() {
    echo "upstream backend {" > $TEMP_CONF
    
    for server in "${BACKEND_SERVERS[@]}"; do
        if check_backend "$server"; then
            echo "    server $server;" >> $TEMP_CONF
            echo "✓ $server is healthy"
        else
            echo "    server $server down;" >> $TEMP_CONF
            echo "✗ $server is down"
        fi
    done
    
    echo "}" >> $TEMP_CONF
    
    # Compare and reload if changed
    if ! cmp -s "$TEMP_CONF" "$NGINX_UPSTREAM_CONF"; then
        mv $TEMP_CONF $NGINX_UPSTREAM_CONF
        nginx -t && nginx -s reload
        echo "Nginx configuration updated and reloaded"
    fi
}

# Run health check
update_upstream_config

Systemd timer for health checks:

# /etc/systemd/system/nginx-health-check.service
[Unit]
Description=Nginx Backend Health Check

[Service]
Type=oneshot
ExecStart=/usr/local/bin/health_check.sh

# /etc/systemd/system/nginx-health-check.timer
[Unit]
Description=Run Nginx health check every 30 seconds

[Timer]
OnBootSec=30s
OnUnitActiveSec=30s

[Install]
WantedBy=timers.target

Enable timer:

sudo systemctl daemon-reload
sudo systemctl start nginx-health-check.timer
sudo systemctl enable nginx-health-check.timer

2.4. Application-Level Health Checks

Node.js health endpoint:

// health.js
const express = require('express');
const app = express();

// Health check endpoint
app.get('/health', async (req, res) => {
    const health = {
        status: 'ok',
        timestamp: Date.now(),
        uptime: process.uptime(),
        checks: {}
    };
    
    try {
        // Database check
        await checkDatabase();
        health.checks.database = 'ok';
        
        // Redis check
        await checkRedis();
        health.checks.redis = 'ok';
        
        // Memory check
        const memUsage = process.memoryUsage();
        if (memUsage.heapUsed / memUsage.heapTotal > 0.9) {
            throw new Error('High memory usage');
        }
        health.checks.memory = 'ok';
        
        res.status(200).json(health);
    } catch (error) {
        health.status = 'error';
        health.error = error.message;
        res.status(503).json(health);
    }
});

async function checkDatabase() {
    // Database connection check
    // throw error if unhealthy
}

async function checkRedis() {
    // Redis connection check
    // throw error if unhealthy
}

app.listen(8080);

Python/Flask health endpoint:

# app.py
from flask import Flask, jsonify
import psycopg2
import redis
import time

app = Flask(__name__)

@app.route('/health')
def health_check():
    health = {
        'status': 'ok',
        'timestamp': int(time.time()),
        'checks': {}
    }
    
    try:
        # Database check
        check_database()
        health['checks']['database'] = 'ok'
        
        # Redis check
        check_redis()
        health['checks']['redis'] = 'ok'
        
        return jsonify(health), 200
        
    except Exception as e:
        health['status'] = 'error'
        health['error'] = str(e)
        return jsonify(health), 503

def check_database():
    conn = psycopg2.connect(
        host="localhost",
        database="mydb",
        user="user",
        password="pass"
    )
    conn.close()

def check_redis():
    r = redis.Redis(host='localhost', port=6379)
    r.ping()

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8080)

3. Session Persistence / Sticky Sessions

3.1. IP Hash

upstream backend {
    ip_hash;  # Route based on client IP
    
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}

server {
    listen 80;
    
    location / {
        proxy_pass http://backend;
    }
}

# Pros: Simple, works without cookies
# Cons: Issues with proxies, NAT, mobile users

Nginx Plus:

upstream backend {
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
    
    sticky cookie srv_id expires=1h domain=.example.com path=/;
}

Open source alternative with hash:

map $cookie_route $backend_server {
    ~*server1 backend1.example.com:8080;
    ~*server2 backend2.example.com:8080;
    ~*server3 backend3.example.com:8080;
    default backend1.example.com:8080;
}

upstream backend {
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}

server {
    listen 80;
    
    location / {
        # Generate sticky cookie if not present
        if ($cookie_route = "") {
            add_header Set-Cookie "route=server${remote_addr}hash; Path=/; HttpOnly";
        }
        
        proxy_pass http://backend;
    }
}

3.3. Hash-based Load Balancing

URI hash:

upstream backend {
    hash $request_uri consistent;
    
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}

# Same URL always goes to same server
# Good for caching

Custom hash key:

map $cookie_user_id $hash_key {
    default $remote_addr;
    ~.+ $cookie_user_id;
}

upstream backend {
    hash $hash_key consistent;
    
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}

# Hash by user ID if available, otherwise by IP

3.4. Session Replication Alternative

Instead of sticky sessions, use session replication.

Redis for session storage:

upstream backend {
    # No sticky sessions needed
    least_conn;
    
    server backend1.example.com:8080;
    server backend2.example.com:8080;
    server backend3.example.com:8080;
}

# Application stores sessions in Redis
# All backends can access same session data

Application-side (Node.js example):

const session = require('express-session');
const RedisStore = require('connect-redis')(session);
const redis = require('redis');

const redisClient = redis.createClient({
    host: 'redis.example.com',
    port: 6379
});

app.use(session({
    store: new RedisStore({ client: redisClient }),
    secret: 'your-secret',
    resave: false,
    saveUninitialized: false,
    cookie: {
        secure: true,
        httpOnly: true,
        maxAge: 3600000
    }
}));

4. Keepalived for High Availability

4.1. Keepalived Setup

Install Keepalived:

# Ubuntu/Debian
sudo apt install keepalived

# CentOS/RHEL
sudo yum install keepalived

Network topology:

Virtual IP: 192.168.1.100

Master:  192.168.1.10 (Priority: 100)
Backup:  192.168.1.11 (Priority: 90)

Clients connect to VIP (192.168.1.100)
Master handles traffic
Backup takes over if master fails

4.2. Master Configuration

# /etc/keepalived/keepalived.conf (Master)
global_defs {
    router_id nginx_master
    script_user root
    enable_script_security
}

vrrp_script check_nginx {
    script "/usr/local/bin/check_nginx.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state MASTER
    interface eth0
    virtual_router_id 51
    priority 100
    advert_int 1
    
    authentication {
        auth_type PASS
        auth_pass your_secret_password
    }
    
    virtual_ipaddress {
        192.168.1.100/24
    }
    
    track_script {
        check_nginx
    }
    
    notify_master "/usr/local/bin/notify_master.sh"
    notify_backup "/usr/local/bin/notify_backup.sh"
    notify_fault "/usr/local/bin/notify_fault.sh"
}

4.3. Backup Configuration

# /etc/keepalived/keepalived.conf (Backup)
global_defs {
    router_id nginx_backup
    script_user root
    enable_script_security
}

vrrp_script check_nginx {
    script "/usr/local/bin/check_nginx.sh"
    interval 2
    weight 2
    fall 2
    rise 2
}

vrrp_instance VI_1 {
    state BACKUP
    interface eth0
    virtual_router_id 51
    priority 90
    advert_int 1
    
    authentication {
        auth_type PASS
        auth_pass your_secret_password
    }
    
    virtual_ipaddress {
        192.168.1.100/24
    }
    
    track_script {
        check_nginx
    }
    
    notify_master "/usr/local/bin/notify_master.sh"
    notify_backup "/usr/local/bin/notify_backup.sh"
    notify_fault "/usr/local/bin/notify_fault.sh"
}

4.4. Health Check Script

#!/bin/bash
# /usr/local/bin/check_nginx.sh

# Check if Nginx is running
if systemctl is-active --quiet nginx; then
    # Check if Nginx responds
    if curl -sf http://localhost/health > /dev/null 2>&1; then
        exit 0  # Healthy
    fi
fi

exit 1  # Unhealthy

Make executable:

sudo chmod +x /usr/local/bin/check_nginx.sh

4.5. Notification Scripts

#!/bin/bash
# /usr/local/bin/notify_master.sh

HOSTNAME=$(hostname)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

echo "$TIMESTAMP: $HOSTNAME became MASTER" >> /var/log/keepalived-state.log

# Send alert
curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
    -H 'Content-Type: application/json' \
    -d "{\"text\":\"🟢 $HOSTNAME is now MASTER\"}"

#!/bin/bash
# /usr/local/bin/notify_backup.sh

HOSTNAME=$(hostname)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

echo "$TIMESTAMP: $HOSTNAME became BACKUP" >> /var/log/keepalived-state.log

curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
    -H 'Content-Type: application/json' \
    -d "{\"text\":\"🟡 $HOSTNAME is now BACKUP\"}"

#!/bin/bash
# /usr/local/bin/notify_fault.sh

HOSTNAME=$(hostname)
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

echo "$TIMESTAMP: $HOSTNAME entered FAULT state" >> /var/log/keepalived-state.log

curl -X POST https://hooks.slack.com/services/YOUR/WEBHOOK/URL \
    -H 'Content-Type: application/json' \
    -d "{\"text\":\"🔴 $HOSTNAME in FAULT state\"}"

Make executable:

sudo chmod +x /usr/local/bin/notify_*.sh

4.6. Start Keepalived

# Start on both master and backup
sudo systemctl start keepalived
sudo systemctl enable keepalived

# Check status
sudo systemctl status keepalived

# View logs
sudo journalctl -u keepalived -f

# Check VIP
ip addr show eth0

4.7. Test Failover

# On master, stop Nginx
sudo systemctl stop nginx

# VIP should move to backup within seconds
# Check on backup:
ip addr show eth0 | grep 192.168.1.100

# Restart Nginx on master
sudo systemctl start nginx

# VIP should return to master

5. Complete HA Setup

5.1. Multi-tier HA Architecture

                    Internet
                       │
                       ▼
              ┌─────────────────┐
              │   DNS (GeoDNS)  │
              │   Round-robin   │
              └────────┬─────────┘
                       │
        ┌──────────────┴──────────────┐
        │                             │
        ▼                             ▼
┌────────────────┐           ┌────────────────┐
│  Data Center 1 │           │  Data Center 2 │
└────────┬───────┘           └────────┬───────┘
         │                            │
    ┌────┴────┐                  ┌────┴────┐
    ▼         ▼                  ▼         ▼
┌───────┐ ┌───────┐          ┌───────┐ ┌───────┐
│ LB-1  │ │ LB-2  │          │ LB-3  │ │ LB-4  │
│Master │ │Backup │          │Master │ │Backup │
└───┬───┘ └───┬───┘          └───┬───┘ └───┬───┘
    │         │                  │         │
    │  VIP1   │                  │  VIP2   │
    └────┬────┘                  └────┬────┘
         │                            │
    ┌────┴────┬─────┐          ┌──────┴────┬─────┐
    ▼         ▼     ▼          ▼           ▼     ▼
┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐ ┌───────┐
│Web-1  │ │Web-2  │ │Web-3  │ │Web-4  │ │Web-5  │ │Web-6  │
└───────┘ └───────┘ └───────┘ └───────┘ └───────┘ └───────┘

5.2. Load Balancer Configuration

LB-1 (Master) configuration:

# /etc/nginx/nginx.conf
user nginx;
worker_processes auto;
worker_rlimit_nofile 65535;

events {
    worker_connections 4096;
    use epoll;
    multi_accept on;
}

http {
    # Upstream definitions
    upstream web_backend {
        least_conn;
        
        # Data Center 1 servers
        server 10.0.1.10:80 max_fails=3 fail_timeout=30s weight=5;
        server 10.0.1.11:80 max_fails=3 fail_timeout=30s weight=5;
        server 10.0.1.12:80 max_fails=3 fail_timeout=30s weight=5;
        
        # Keepalive
        keepalive 64;
        keepalive_timeout 60s;
        keepalive_requests 1000;
    }
    
    upstream api_backend {
        least_conn;
        
        server 10.0.2.10:8080 max_fails=3 fail_timeout=30s;
        server 10.0.2.11:8080 max_fails=3 fail_timeout=30s;
        server 10.0.2.12:8080 max_fails=3 fail_timeout=30s;
        
        keepalive 32;
    }
    
    # Health check endpoint
    server {
        listen 80;
        server_name localhost;
        
        location /health {
            access_log off;
            return 200 "healthy\n";
            add_header Content-Type text/plain;
        }
    }
    
    # Main site
    server {
        listen 80;
        listen [::]:80;
        server_name example.com;
        
        # Redirect to HTTPS
        return 301 https://$server_name$request_uri;
    }
    
    server {
        listen 443 ssl http2;
        listen [::]:443 ssl http2;
        server_name example.com;
        
        # SSL
        ssl_certificate /etc/nginx/ssl/cert.pem;
        ssl_certificate_key /etc/nginx/ssl/key.pem;
        ssl_protocols TLSv1.2 TLSv1.3;
        ssl_session_cache shared:SSL:10m;
        
        # Logging
        access_log /var/log/nginx/access.log;
        error_log /var/log/nginx/error.log;
        
        # Web traffic
        location / {
            proxy_pass http://web_backend;
            
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            proxy_set_header X-Forwarded-Proto $scheme;
            
            proxy_next_upstream error timeout invalid_header http_500 http_502 http_503;
            proxy_next_upstream_tries 3;
            proxy_next_upstream_timeout 10s;
            
            proxy_connect_timeout 5s;
            proxy_send_timeout 10s;
            proxy_read_timeout 10s;
        }
        
        # API traffic
        location /api/ {
            proxy_pass http://api_backend/;
            
            proxy_http_version 1.1;
            proxy_set_header Connection "";
            
            proxy_set_header Host $host;
            proxy_set_header X-Real-IP $remote_addr;
            proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
            
            proxy_next_upstream error timeout http_502 http_503;
        }
    }
}

5.3. Configuration Sync

Rsync script to sync configs:

#!/bin/bash
# sync_config.sh

PRIMARY="lb-1.example.com"
BACKUP="lb-2.example.com"
CONFIG_DIR="/etc/nginx"

if [ "$(hostname)" == "$PRIMARY" ]; then
    # Sync from primary to backup
    rsync -avz --delete \
        --exclude 'logs/*' \
        --exclude '*.log' \
        $CONFIG_DIR/ \
        root@$BACKUP:$CONFIG_DIR/
    
    # Test config on backup
    ssh root@$BACKUP "nginx -t && systemctl reload nginx"
    
    echo "Configuration synced to backup"
else
    echo "Run this script on primary only"
fi

Automated sync with inotify:

#!/bin/bash
# watch_and_sync.sh

CONFIG_DIR="/etc/nginx"
BACKUP="lb-2.example.com"

inotifywait -m -r -e modify,create,delete $CONFIG_DIR | while read path action file; do
    echo "Change detected: $path$file ($action)"
    
    # Sync to backup
    rsync -avz --delete \
        --exclude 'logs/*' \
        $CONFIG_DIR/ \
        root@$BACKUP:$CONFIG_DIR/
    
    # Test and reload on backup
    ssh root@$BACKUP "nginx -t && systemctl reload nginx"
done

6. Database Load Balancing

6.1. MySQL/PostgreSQL Read Replicas

# Master for writes
upstream db_master {
    server db-master.example.com:3306 max_fails=2 fail_timeout=10s;
}

# Slaves for reads
upstream db_slaves {
    least_conn;
    
    server db-slave1.example.com:3306 max_fails=3 fail_timeout=30s weight=5;
    server db-slave2.example.com:3306 max_fails=3 fail_timeout=30s weight=5;
    server db-slave3.example.com:3306 max_fails=3 fail_timeout=30s weight=3;
    
    keepalive 16;
}

# Stream block for TCP proxying
stream {
    # Write traffic to master
    server {
        listen 3306;
        proxy_pass db_master;
        proxy_connect_timeout 1s;
    }
    
    # Read traffic to slaves
    server {
        listen 3307;
        proxy_pass db_slaves;
        proxy_connect_timeout 1s;
    }
}

# Application connects to:
# localhost:3306 for writes
# localhost:3307 for reads

6.2. MongoDB Replica Set

stream {
    upstream mongodb {
        # MongoDB replica set members
        server mongo1.example.com:27017 max_fails=2 fail_timeout=30s;
        server mongo2.example.com:27017 max_fails=2 fail_timeout=30s;
        server mongo3.example.com:27017 max_fails=2 fail_timeout=30s;
    }
    
    server {
        listen 27017;
        proxy_pass mongodb;
        proxy_connect_timeout 2s;
        proxy_timeout 10m;
    }
}

6.3. Redis Cluster

stream {
    upstream redis_cluster {
        # Redis nodes
        server redis1.example.com:6379;
        server redis2.example.com:6379;
        server redis3.example.com:6379;
        
        # Hash by client IP for consistency
        hash $remote_addr consistent;
    }
    
    server {
        listen 6379;
        proxy_pass redis_cluster;
        proxy_connect_timeout 1s;
        proxy_timeout 3s;
    }
}

7. Geographic Load Balancing

7.1. GeoDNS Setup

Multiple data centers:

US West:  us-west.example.com  (IP: 203.0.113.10)
US East:  us-east.example.com  (IP: 203.0.113.20)
EU:       eu.example.com       (IP: 203.0.113.30)
Asia:     asia.example.com     (IP: 203.0.113.40)

DNS configuration (Route53 example):

{
  "Type": "A",
  "Name": "example.com",
  "GeoLocation": {
    "ContinentCode": "NA",
    "CountryCode": "US",
    "SubdivisionCode": "CA"
  },
  "SetIdentifier": "US-West",
  "ResourceRecords": [
    {
      "Value": "203.0.113.10"
    }
  ],
  "TTL": 60,
  "HealthCheckId": "health-check-us-west"
}

7.2. Nginx Geo Module

http {
    # Map client location to nearest datacenter
    geo $nearest_dc {
        default us-east;
        
        # US West
        include geoip/us-west.conf;
        
        # EU
        include geoip/eu.conf;
        
        # Asia
        include geoip/asia.conf;
    }
    
    # Define upstreams per region
    upstream us-west {
        server web1.us-west.example.com:80;
        server web2.us-west.example.com:80;
    }
    
    upstream us-east {
        server web1.us-east.example.com:80;
        server web2.us-east.example.com:80;
    }
    
    upstream eu {
        server web1.eu.example.com:80;
        server web2.eu.example.com:80;
    }
    
    upstream asia {
        server web1.asia.example.com:80;
        server web2.asia.example.com:80;
    }
    
    server {
        listen 80;
        
        location / {
            # Route to nearest datacenter
            proxy_pass http://$nearest_dc;
        }
    }
}

7.3. GeoIP2 Module

http {
    geoip2 /usr/share/GeoIP/GeoLite2-Country.mmdb {
        $geoip2_data_country_code country iso_code;
        $geoip2_data_country_name country names en;
    }
    
    # Map country to datacenter
    map $geoip2_data_country_code $datacenter {
        default us-east;
        
        # North America
        US us-west;
        CA us-west;
        MX us-east;
        
        # Europe
        GB eu;
        DE eu;
        FR eu;
        IT eu;
        ES eu;
        
        # Asia
        CN asia;
        JP asia;
        KR asia;
        IN asia;
    }
    
    server {
        location / {
            proxy_pass http://$datacenter;
            
            # Add headers for debugging
            add_header X-Country-Code $geoip2_data_country_code;
            add_header X-Datacenter $datacenter;
        }
    }
}

8. Disaster Recovery

8.1. Backup Strategy

#!/bin/bash
# backup_nginx_config.sh

BACKUP_DIR="/backup/nginx"
DATE=$(date +%Y%m%d-%H%M%S)
RETENTION_DAYS=30

# Create backup directory
mkdir -p $BACKUP_DIR

# Backup Nginx configuration
tar -czf $BACKUP_DIR/nginx-config-$DATE.tar.gz /etc/nginx

# Backup SSL certificates
tar -czf $BACKUP_DIR/nginx-ssl-$DATE.tar.gz /etc/letsencrypt

# Backup to remote location
rsync -avz $BACKUP_DIR/ backup-server:/backups/nginx/

# Delete old backups
find $BACKUP_DIR -name "*.tar.gz" -mtime +$RETENTION_DAYS -delete

echo "Backup completed: $DATE"

Automated backup with cron:

# /etc/cron.d/nginx-backup
0 2 * * * root /usr/local/bin/backup_nginx_config.sh >> /var/log/nginx-backup.log 2>&1

8.2. Disaster Recovery Plan

Recovery procedure:

#!/bin/bash
# restore_nginx.sh

BACKUP_FILE=$1

if [ -z "$BACKUP_FILE" ]; then
    echo "Usage: $0 <backup_file>"
    exit 1
fi

# Stop Nginx
systemctl stop nginx

# Backup current config
mv /etc/nginx /etc/nginx.old

# Restore from backup
tar -xzf $BACKUP_FILE -C /

# Test configuration
nginx -t

if [ $? -eq 0 ]; then
    # Start Nginx
    systemctl start nginx
    echo "Restoration successful"
else
    # Rollback
    rm -rf /etc/nginx
    mv /etc/nginx.old /etc/nginx
    systemctl start nginx
    echo "Restoration failed, rolled back"
    exit 1
fi

8.3. Failover Testing

#!/bin/bash
# test_failover.sh

VIP="192.168.1.100"
MASTER="192.168.1.10"
BACKUP="192.168.1.11"

echo "Starting failover test..."

# 1. Check initial state
echo "Checking VIP location..."
ssh root@$MASTER "ip addr | grep $VIP" && echo "✓ VIP on master"

# 2. Stop Nginx on master
echo "Stopping Nginx on master..."
ssh root@$MASTER "systemctl stop nginx"

# Wait for failover
sleep 5

# 3. Check VIP moved to backup
echo "Checking VIP failover..."
ssh root@$BACKUP "ip addr | grep $VIP" && echo "✓ VIP moved to backup"

# 4. Test connectivity
echo "Testing connectivity..."
curl -I http://$VIP && echo "✓ Site accessible"

# 5. Restore master
echo "Restoring master..."
ssh root@$MASTER "systemctl start nginx"

# Wait for failback
sleep 5

# 6. Check VIP returned
echo "Checking VIP failback..."
ssh root@$MASTER "ip addr | grep $VIP" && echo "✓ VIP returned to master"

echo "Failover test complete"

8.4. Recovery Time Objective (RTO)

Measure downtime:

#!/bin/bash
# measure_rto.sh

TARGET="http://192.168.1.100"
LOG_FILE="/var/log/rto-test.log"

START_TIME=$(date +%s)

# Trigger failure
echo "$(date): Starting RTO test" >> $LOG_FILE
ssh root@192.168.1.10 "systemctl stop nginx"

# Monitor until service restored
DOWNTIME=0
while true; do
    if curl -sf $TARGET > /dev/null 2>&1; then
        END_TIME=$(date +%s)
        DOWNTIME=$((END_TIME - START_TIME))
        echo "$(date): Service restored after ${DOWNTIME}s" >> $LOG_FILE
        break
    fi
    sleep 1
done

# Restore
ssh root@192.168.1.10 "systemctl start nginx"

echo "RTO: ${DOWNTIME} seconds"

9. Testing HA Setup

9.1. Load Testing with Failover

#!/bin/bash
# load_test_ha.sh

VIP="http://192.168.1.100"
DURATION=300  # 5 minutes

# Start load test in background
echo "Starting load test..."
wrk -t4 -c100 -d${DURATION}s $VIP > /tmp/wrk-results.txt &
WRK_PID=$!

# Wait 60 seconds
sleep 60

# Trigger failover during load test
echo "Triggering failover..."
ssh root@192.168.1.10 "systemctl stop nginx"

# Wait for test to complete
wait $WRK_PID

# Analyze results
echo "Load test complete"
cat /tmp/wrk-results.txt

# Count errors
ERRORS=$(grep "Socket errors" /tmp/wrk-results.txt)
echo "Errors during failover: $ERRORS"

# Restore
ssh root@192.168.1.10 "systemctl start nginx"

9.2. Chaos Engineering

#!/bin/bash
# chaos_test.sh

SERVERS=(
    "192.168.1.10"
    "192.168.1.11"
    "10.0.1.10"
    "10.0.1.11"
    "10.0.1.12"
)

# Randomly kill services
while true; do
    # Random server
    SERVER=${SERVERS[$RANDOM % ${#SERVERS[@]}]}
    
    # Random action
    ACTIONS=("stop nginx" "network disconnect" "high cpu" "high memory")
    ACTION=${ACTIONS[$RANDOM % ${#ACTIONS[@]}]}
    
    echo "$(date): Testing $ACTION on $SERVER"
    
    case $ACTION in
        "stop nginx")
            ssh root@$SERVER "systemctl stop nginx"
            sleep 30
            ssh root@$SERVER "systemctl start nginx"
            ;;
        "network disconnect")
            ssh root@$SERVER "iptables -A INPUT -j DROP"
            sleep 30
            ssh root@$SERVER "iptables -F"
            ;;
        "high cpu")
            ssh root@$SERVER "stress-ng --cpu 4 --timeout 30s" &
            ;;
        "high memory")
            ssh root@$SERVER "stress-ng --vm 2 --vm-bytes 1G --timeout 30s" &
            ;;
    esac
    
    # Wait before next chaos
    sleep 60
done

9.3. Automated HA Tests

#!/bin/bash
# ha_test_suite.sh

run_test() {
    local test_name=$1
    local test_command=$2
    
    echo "Running: $test_name"
    
    if eval $test_command; then
        echo "✓ PASS: $test_name"
        return 0
    else
        echo "✗ FAIL: $test_name"
        return 1
    fi
}

# Test 1: VIP reachable
run_test "VIP Reachability" "ping -c 3 192.168.1.100"

# Test 2: Service responds
run_test "HTTP Response" "curl -sf http://192.168.1.100/health"

# Test 3: Failover time < 5s
run_test "Failover Time" "./measure_rto.sh | grep -q 'RTO: [0-4] seconds'"

# Test 4: All backends healthy
run_test "Backend Health" "curl -sf http://192.168.1.100/health | grep -q 'healthy'"

# Test 5: Session persistence
run_test "Session Persistence" "./test_session_persistence.sh"

# Test 6: Load distribution
run_test "Load Distribution" "./test_load_distribution.sh"

echo "HA test suite complete"

10. Monitoring HA Setup

10.1. HA Monitoring Dashboard

#!/usr/bin/env python3
# ha_monitor.py

import requests
import time
from datetime import datetime

SERVERS = [
    {'name': 'LB-1', 'ip': '192.168.1.10', 'role': 'master'},
    {'name': 'LB-2', 'ip': '192.168.1.11', 'role': 'backup'},
    {'name': 'Web-1', 'ip': '10.0.1.10', 'role': 'backend'},
    {'name': 'Web-2', 'ip': '10.0.1.11', 'role': 'backend'},
    {'name': 'Web-3', 'ip': '10.0.1.12', 'role': 'backend'},
]

VIP = '192.168.1.100'

def check_server(server):
    try:
        response = requests.get(
            f"http://{server['ip']}/health",
            timeout=2
        )
        return response.status_code == 200
    except:
        return False

def check_vip():
    try:
        response = requests.get(f"http://{VIP}/health", timeout=2)
        return response.status_code == 200
    except:
        return False

def display_status():
    print("\033[2J\033[H")  # Clear screen
    print("=" * 60)
    print("HA Monitoring Dashboard")
    print("=" * 60)
    print(f"Time: {datetime.now().strftime('%Y-%m-%d %H:%M:%S')}")
    print()
    
    # Check VIP
    vip_status = "✓ UP" if check_vip() else "✗ DOWN"
    print(f"Virtual IP ({VIP}): {vip_status}")
    print()
    
    # Check all servers
    print("Server Status:")
    print("-" * 60)
    for server in SERVERS:
        status = "✓ UP" if check_server(server) else "✗ DOWN"
        print(f"{server['name']:10} {server['ip']:15} {server['role']:10} {status}")
    print()

def main():
    while True:
        display_status()
        time.sleep(5)

if __name__ == '__main__':
    main()

10.2. Alerting for HA Events

#!/bin/bash
# ha_alert.sh

WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"

send_alert() {
    local message=$1
    local severity=$2
    
    local emoji="ℹ️"
    case $severity in
        critical) emoji="🔴" ;;
        warning) emoji="🟡" ;;
        info) emoji="🟢" ;;
    esac
    
    curl -X POST $WEBHOOK_URL \
        -H 'Content-Type: application/json' \
        -d "{\"text\":\"$emoji HA Alert: $message\"}"
}

# Monitor VIP
VIP="192.168.1.100"
PREVIOUS_STATE="unknown"

while true; do
    if ping -c 1 -W 1 $VIP > /dev/null 2>&1; then
        CURRENT_STATE="up"
        if [ "$PREVIOUS_STATE" == "down" ]; then
            send_alert "VIP $VIP is now UP" "info"
        fi
    else
        CURRENT_STATE="down"
        if [ "$PREVIOUS_STATE" == "up" ]; then
            send_alert "VIP $VIP is DOWN!" "critical"
        fi
    fi
    
    PREVIOUS_STATE=$CURRENT_STATE
    sleep 5
done

Tổng kết

Trong bài này, bạn đã học:

✅ HA concepts và architecture
✅ Advanced health checks (passive/active)
✅ Session persistence strategies
✅ Keepalived for virtual IPs và failover
✅ Complete HA setup với multiple tiers
✅ Database load balancing
✅ Geographic load balancing
✅ Disaster recovery planning
✅ HA testing và chaos engineering
✅ Monitoring HA infrastructure

Key takeaways:

Redundancy at every layer
Automatic failover với Keepalived
Health checks critical for reliability
Session persistence for stateful apps
Geographic distribution for global apps
Regular testing of failover scenarios
Comprehensive monitoring và alerting
Documented DR procedures

HA Checklist:

✅ Multiple load balancers with Keepalived
✅ Virtual IP configured
✅ Health checks implemented
✅ Session persistence configured
✅ Multiple backend servers
✅ Database replication setup
✅ Configuration sync automated
✅ Monitoring và alerting active
✅ DR plan documented và tested
✅ Regular failover testing

Production readiness:

RTO (Recovery Time Objective): < 5 seconds
RPO (Recovery Point Objective): Near-zero for stateless apps
Availability target: 99.99% (four nines)
Regular chaos engineering tests
Automated incident response

Bài tiếp theo: Microservices và Service Mesh - service discovery, API Gateway patterns, rate limiting per service, circuit breakers, distributed tracing, Consul/Istio integration để manage complex microservices architectures.

Menu

Bài 13: High Availability và Load Balancing Advanced trong NGINX

DUY TRAN

Bài học khóa học

Bài 1: Tổng quan về PostgreSQL High Availability

Bài 2: Streaming Replication trong PostgreSQL

Bài 3: Giới thiệu Patroni và etcd

Bài 4: Chuẩn bị hạ tầng

Bài 1: Giới thiệu và Cài đặt Nginx

Bài 2: Cấu hình Cơ bản Nginx

Bài 3: Logging và Monitoring Nginx

Bài 4: Reverse Proxy

Bài 5: Load Balancing

Bài 6: NGINX CACHING

Bài 7: SSL/TLS và HTTPS trong NGINX

Bài 8: Performance Tuning NGINX

Bài 9: Security trong NGINX

Bài 10: Rewrite và Redirects trong NGINX

Bài 11: Nginx với Application Stack trong NGINX

Bài 12: Monitoring và Logging trong NGINX

Bài 13: High Availability và Load Balancing Advanced trong NGINX

Bài 14: Microservices và Service Mesh trong NGINX

Bài 15: Production Best Practices và Advanced Topics trong NGINX

Bài 5: Cài đặt PostgreSQL

Bài 6: Cài đặt và cấu hình etcd cluster

Bài 7: Cài đặt Patroni

Bài 8: Cấu hình Patroni chi tiết

Bài 9: Bootstrap PostgreSQL Cluster

Bài 10: Quản lý Replication

Bài 11: Patroni Callbacks

Bài 12: Patroni REST API

Bài 13: Automatic Failover

Bài 14: Switchover có kế hoạch (Planned Switchover)

Bài 15: Recovering failed nodes

Bài 16: Backup và Point-in-Time Recovery (PITR)

Bài 17: Monitoring Patroni Cluster

Bài 18: Performance Tuning

Bài 19: Logging và Troubleshooting

Bài 20: Security Best Practices

Bài 21: Multi-datacenter Setup

Bài 22: Patroni với Kubernetes

Bài 24: Upgrade Strategies

Bài 23: Patroni Configuration Management

Bài 25: Real-world Case Studies

Bài 26: Automation với Ansible

Bài 27: Disaster Recovery Drills

Bài 28: Thiết Kế Kiến Trúc HA

Bài 29: Deploy Production-ready Cluster

BÀI 1: GIỚI THIỆU KUBERNETES VÀ CONTAINER ORCHESTRATION

1. High Availability Concepts

1.1. HA Architecture Overview

1.2. Availability Calculations

1.3. Types of HA Setups

2. Advanced Health Checks

2.1. Passive Health Checks

2.2. Active Health Checks (Nginx Plus)

2.3. Custom Health Check Scripts

2.4. Application-Level Health Checks

3. Session Persistence / Sticky Sessions

3.1. IP Hash

3.2. Cookie-based Sticky Sessions

3.3. Hash-based Load Balancing

3.4. Session Replication Alternative

4. Keepalived for High Availability

4.1. Keepalived Setup

4.2. Master Configuration

4.3. Backup Configuration

4.4. Health Check Script

4.5. Notification Scripts

4.6. Start Keepalived

4.7. Test Failover

5. Complete HA Setup

5.1. Multi-tier HA Architecture

5.2. Load Balancer Configuration

5.3. Configuration Sync

6. Database Load Balancing

6.1. MySQL/PostgreSQL Read Replicas

6.2. MongoDB Replica Set

6.3. Redis Cluster

7. Geographic Load Balancing