Bài 21: Multi-datacenter Setup

Mục tiêu

Sau bài học này, bạn sẽ:

Design cross-datacenter replication architecture
Implement cascading replication topology
Handle network latency and failures
Configure disaster recovery for multiple sites
Load balance across geographic locations

1. Multi-DC Architecture Patterns

1.1. Active-Passive (DR standby)

Primary DC (Active):
  ├─ node1 (Leader)
  ├─ node2 (Replica)
  └─ node3 (Replica)
    ↓ Async replication
DR DC (Passive):
  ├─ node4 (Standby)
  └─ node5 (Standby)

Use case: Disaster recovery
RPO: Minutes to hours
RTO: Minutes to hours
Cost: Lower (minimal resources in DR)

1.2. Active-Active (Multi-master)

DC1 (Active):
  ├─ node1 (Leader)
  └─ node2 (Replica)
    ↕ Bi-directional logical replication
DC2 (Active):
  ├─ node3 (Leader)
  └─ node4 (Replica)

Use case: Global applications with regional traffic
RPO: Near-zero
RTO: Near-zero
Cost: Higher (full resources in both DCs)
Note: Requires conflict resolution

1.3. Hub-and-Spoke (Cascading)

Primary DC (Hub):
  └─ node1 (Leader)
      ├─ node2 (Replica) ← DC1 local
      ├─ node3 (Cascade) → DC2
      └─ node4 (Cascade) → DC3

DC2 (Spoke):
  └─ node3 (receives from node3-cascade)
      └─ node5 (Replica) ← DC2 local

DC3 (Spoke):
  └─ node4 (receives from node4-cascade)
      └─ node6 (Replica) ← DC3 local

Use case: Multiple regional read replicas
RPO: Seconds to minutes
Cost: Medium

2. Cascading Replication Setup

2.1. Architecture

DC1 (us-east):
  ├─ pg-us-east-1 (Leader) - 10.1.1.11
  ├─ pg-us-east-2 (Replica) - 10.1.1.12
  └─ pg-us-east-3 (Cascade) - 10.1.1.13
        ↓ WAN replication
DC2 (us-west):
  └─ pg-us-west-1 (Replica) - 10.2.1.11
      ├─ Receives from pg-us-east-3
      └─ pg-us-west-2 (Replica) - 10.2.1.12

2.2. Configure cascading node (DC1)

# /etc/patroni/patroni.yml on pg-us-east-3 (cascade node)
scope: postgres-cluster
name: pg-us-east-3

restapi:
  listen: 10.1.1.13:8008
  connect_address: 10.1.1.13:8008

etcd:
  hosts: 10.1.1.11:2379,10.1.1.12:2379,10.1.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.1.1.13:5432
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin

  authentication:
    replication:
      username: replicator
      password: rep_password
    superuser:
      username: postgres
      password: postgres_password

  parameters:
    # Enable cascading replication
    hot_standby: on
    wal_level: replica
    max_wal_senders: 10  # Extra slots for downstream replicas
    max_replication_slots: 10
    hot_standby_feedback: on
    
    # Performance tuning for WAN
    wal_sender_timeout: 60s
    wal_receiver_timeout: 60s

  # Allow downstream replicas to connect
  pg_hba:
    - host replication replicator 10.2.1.0/24 scram-sha-256  # DC2 subnet

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: true  # Can be used as clone source
  nosync: false

2.3. Configure downstream replica (DC2)

# /etc/patroni/patroni.yml on pg-us-west-1
scope: postgres-cluster-dc2  # Different scope!
name: pg-us-west-1

restapi:
  listen: 10.2.1.11:8008
  connect_address: 10.2.1.11:8008

etcd:
  # Separate etcd cluster for DC2
  hosts: 10.2.1.11:2379,10.2.1.12:2379,10.2.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    standby_cluster:
      # Point to DC1 cascade node
      host: 10.1.1.13  # pg-us-east-3
      port: 5432
      primary_slot_name: pg_us_west_1_slot
      create_replica_methods:
        - basebackup

  method: basebackup
  basebackup:
    max-rate: '100M'
    checkpoint: 'fast'
    waldir: /var/lib/postgresql/18/data/pg_wal

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.2.1.11:5432
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin

  authentication:
    replication:
      username: replicator
      password: rep_password
    superuser:
      username: postgres
      password: postgres_password

  parameters:
    hot_standby: on
    wal_level: replica
    max_wal_senders: 5
    max_replication_slots: 5
    hot_standby_feedback: on
    
    # WAN-optimized settings
    wal_sender_timeout: 120s  # Higher for WAN
    wal_receiver_timeout: 120s
    wal_retrieve_retry_interval: 10s

  pg_hba:
    - host replication replicator 10.2.1.0/24 scram-sha-256
    - host all all 10.2.1.0/24 scram-sha-256

tags:
  nofailover: false  # Can become leader in DC2
  noloadbalance: false
  clonefrom: false

2.4. Create replication slot on cascade node

# On pg-us-east-3 (cascade node)
sudo -u postgres psql -c "
  SELECT pg_create_physical_replication_slot('pg_us_west_1_slot');
"

2.5. Start DC2 replica

# On pg-us-west-1
sudo systemctl start patroni
sudo systemctl status patroni

# Check replication status
patronictl -c /etc/patroni/patroni.yml list

# Verify it's receiving from cascade node
sudo -u postgres psql -c "
  SELECT client_addr, state, sync_state, 
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
  FROM pg_stat_replication;
"

3. Network Latency Handling

3.1. Measure inter-DC latency

# Ping test
ping -c 10 10.2.1.11

# TCP latency test
nc -vz 10.2.1.11 5432

# PostgreSQL connection latency
psql "host=10.2.1.11 user=postgres" -c "SELECT now();"

# iPerf bandwidth test
# On DC2:
iperf3 -s
# On DC1:
iperf3 -c 10.2.1.11 -t 30

3.2. Optimize for high latency

-- Increase timeouts for WAN
ALTER SYSTEM SET wal_sender_timeout = '120s';  -- Default 60s
ALTER SYSTEM SET wal_receiver_timeout = '120s';
ALTER SYSTEM SET wal_retrieve_retry_interval = '10s';

-- TCP keepalive settings
ALTER SYSTEM SET tcp_keepalives_idle = 60;
ALTER SYSTEM SET tcp_keepalives_interval = 10;
ALTER SYSTEM SET tcp_keepalives_count = 6;

-- Reload
SELECT pg_reload_conf();

3.3. Use WAL compression

-- Enable WAL compression (PostgreSQL 14+)
ALTER SYSTEM SET wal_compression = on;

-- Can reduce WAN traffic by 50-70%
SELECT pg_reload_conf();

3.4. Limit replication bandwidth

# In patroni.yml
bootstrap:
  method: basebackup
  basebackup:
    max-rate: '50M'  # Limit to 50 MB/s to avoid saturating WAN
    checkpoint: 'fast'

4. Disaster Recovery Scenarios

4.1. DC1 total failure

# Promote DC2 to primary
# On pg-us-west-1

# Remove standby_cluster config
patronictl edit-config postgres-cluster-dc2 -s scope -p standby_cluster --force

# Promote to leader
patronictl failover postgres-cluster-dc2 --leader pg-us-west-1 --force

# Verify
patronictl -c /etc/patroni/patroni.yml list
# + Cluster: postgres-cluster-dc2 ----+------------+----+-----------+
# | Member         | Host       | Role   | State     | Lag in MB |
# +----------------+------------+--------+-----------+-----------+
# | pg-us-west-1   | 10.2.1.11  | Leader | running   |         0 |
# | pg-us-west-2   | 10.2.1.12  | Replica| streaming |         0 |
# +----------------+------------+--------+-----------+-----------+

4.2. DC1 recovery after failure

# When DC1 comes back online, reintegrate it

# Option 1: Make DC1 follow DC2 (temporary)
# Edit patroni.yml on DC1 nodes to add standby_cluster pointing to DC2

# Option 2: Failback to DC1
# Wait for DC2 to be fully synchronized
# Perform planned switchover back to DC1

patronictl switchover postgres-cluster-dc2 \
  --leader pg-us-west-1 \
  --candidate pg-us-east-1 \
  --scheduled 'now'

4.3. Split-brain prevention

# Use etcd/consul in both DCs
# Each DC has its own etcd cluster
# Use etcd discovery URL for cross-DC awareness

etcd:
  hosts:
    - 10.1.1.11:2379  # DC1
    - 10.1.1.12:2379
    - 10.2.1.11:2379  # DC2
    - 10.2.1.12:2379

Note: For true split-brain prevention, consider:

Odd number of sites (3+ DCs) with witness node
Fencing mechanisms (STONITH)
Quorum-based decisions

5. Geographic Load Balancing

5.1. HAProxy with geo-awareness

Architecture:

Users (us-east) → HAProxy-DC1 → PG-DC1 (primary)
Users (us-west) → HAProxy-DC2 → PG-DC2 (replica) - read-only
                             ↳ PG-DC1 (primary) - writes

# /etc/haproxy/haproxy.cfg on HAProxy-DC1 (us-east)
frontend postgres_front
  bind *:5432
  mode tcp
  default_backend postgres_master

backend postgres_master
  mode tcp
  option tcp-check
  tcp-check connect
  tcp-check send-binary 00000008  # SSLRequest
  tcp-check expect binary 4e       # 'N' (no SSL)
  server pg-us-east-1 10.1.1.11:5432 check inter 3000
  server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup

# /etc/haproxy/haproxy.cfg on HAProxy-DC2 (us-west)
frontend postgres_front_read
  bind *:5432
  mode tcp
  default_backend postgres_replicas

frontend postgres_front_write
  bind *:5433
  mode tcp
  default_backend postgres_master_remote

backend postgres_replicas
  # Local read replicas
  mode tcp
  balance roundrobin
  option tcp-check
  server pg-us-west-1 10.2.1.11:5432 check inter 3000
  server pg-us-west-2 10.2.1.12:5432 check inter 3000

backend postgres_master_remote
  # Write to primary in DC1
  mode tcp
  option tcp-check
  server pg-us-east-1 10.1.1.11:5432 check inter 3000
  server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup

5.2. DNS-based routing

# Use DNS with geo-location
# GeoDNS service (Route53, Cloudflare, etc.)

# US-East users resolve to:
postgres.example.com → 10.1.1.100 (HAProxy-DC1)

# US-West users resolve to:
postgres.example.com → 10.2.1.100 (HAProxy-DC2)

# Configure health checks to failover on DC failure

5.3. Application-level routing

# Python example with psycopg2
import psycopg2
import requests

def get_postgres_endpoint():
    """Get optimal PostgreSQL endpoint based on geo-location"""
    # Check latency to each DC
    latencies = {}
    for dc, host in [('dc1', '10.1.1.11'), ('dc2', '10.2.1.11')]:
        try:
            start = time.time()
            conn = psycopg2.connect(
                host=host, user='app', password='pass',
                dbname='mydb', connect_timeout=3
            )
            conn.close()
            latencies[dc] = time.time() - start
        except:
            latencies[dc] = float('inf')
    
    # Return DC with lowest latency
    best_dc = min(latencies, key=latencies.get)
    return '10.1.1.11' if best_dc == 'dc1' else '10.2.1.11'

# Use it
conn = psycopg2.connect(
    host=get_postgres_endpoint(),
    user='app', password='pass', dbname='mydb'
)

6. Cross-DC Monitoring

6.1. Monitor replication lag

-- On cascade node (DC1)
SELECT client_addr, application_name,
       state, sync_state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sending_lag,
       pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag,
       pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag,
       pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag
FROM pg_stat_replication
WHERE application_name LIKE '%west%';

# Expected replication lag for cross-DC:
# - Low latency WAN (< 10ms): 0-10 MB lag
# - Medium latency WAN (10-50ms): 10-50 MB lag
# - High latency WAN (> 50ms): 50-200 MB lag

6.2. Prometheus exporters

# prometheus.yml
scrape_configs:
  - job_name: 'postgres-dc1'
    static_configs:
      - targets:
          - '10.1.1.11:9187'
          - '10.1.1.12:9187'
          - '10.1.1.13:9187'
    labels:
      datacenter: 'us-east'
  
  - job_name: 'postgres-dc2'
    static_configs:
      - targets:
          - '10.2.1.11:9187'
          - '10.2.1.12:9187'
    labels:
      datacenter: 'us-west'

6.3. Alert rules for cross-DC

# /etc/prometheus/alerts/multi-dc.yml
groups:
  - name: multi-dc
    rules:
      - alert: CrossDCReplicationLag
        expr: |
          pg_replication_lag{datacenter="us-west"} > 100 * 1024 * 1024
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High replication lag to DC2"
          description: "Replication lag to {{ $labels.instance }} is {{ $value | humanize }}B"
      
      - alert: CrossDCReplicationBroken
        expr: |
          pg_replication_status{datacenter="us-west"} == 0
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Replication to DC2 is broken"
      
      - alert: CrossDCLatency
        expr: |
          probe_duration_seconds{job="blackbox-dc2"} > 0.1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High network latency to DC2"

7. Backup Strategy for Multi-DC

7.1. Per-DC backups

# DC1 backup
pgbackrest --stanza=main --type=full backup

# DC2 backup (can use DC1's backup repo over WAN)
pgbackrest --stanza=main --type=diff backup --repo1-host=10.1.1.11

7.2. Geo-replicated backup storage

# pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=my-postgres-backups-us-east
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com

# S3 cross-region replication enabled:
# us-east-1 → us-west-2

7.3. Backup verification

# Restore test in DR site
pgbackrest --stanza=main \
  --type=time \
  --target="2024-11-25 10:00:00" \
  restore \
  --repo1-host=backup-server \
  --pg1-path=/var/lib/postgresql/18/restore_test

8. Best Practices

✅ DO

Use cascading replication - Reduces load on primary
Separate etcd clusters - Per-DC for independence
Monitor replication lag - Alert on high lag
Test failover regularly - Quarterly DR drills
Use replication slots - Prevent WAL deletion
Compress WAL - Reduce WAN bandwidth
Limit base backup rate - Avoid WAN saturation
Implement geo-routing - Reduce latency for users
Document topology - Clear architecture diagrams
Automate failover - But with human approval for DR

❌ DON'T

Don't use sync replication cross-DC - Too slow
Don't share etcd across WAN - Split-brain risk
Don't ignore network latency - Tune timeouts
Don't forget about WAL retention - Use slots
Don't skip DR testing - Must validate regularly
Don't use single DC for backups - Geo-replicate
Don't over-complicate - Start simple, add complexity as needed

9. Lab Exercises

Lab 1: Setup cascading replication

Tasks:

Configure cascade node in DC1
Setup downstream replica in DC2
Create replication slot
Verify replication lag
Monitor with Prometheus

Lab 2: Test DR failover

Tasks:

Simulate DC1 failure (stop all nodes)
Promote DC2 to primary
Verify application connectivity
Document RTO/RPO
Plan failback procedure

Lab 3: Geo-aware load balancing

Tasks:

Setup HAProxy in each DC
Configure geo-based routing
Test read/write routing
Measure latency improvement
Implement health checks

Lab 4: Cross-DC monitoring

Tasks:

Configure Prometheus multi-DC scraping
Create Grafana dashboard with DC labels
Setup alert rules for cross-DC lag
Test alerting on simulated failure
Document runbook for alerts

10. Advanced Topics

10.1. Three-datacenter setup

DC1 (us-east):
  └─ pg1 (Leader)
      ├─ pg2 (Replica)
      └─ pg3 (Cascade) → DC2

DC2 (us-west):
  └─ pg4 (Replica from pg3)
      ├─ pg5 (Replica)
      └─ pg6 (Cascade) → DC3

DC3 (eu-central):
  └─ pg7 (Replica from pg6)
      └─ pg8 (Replica)

Use case: Global application with regional reads

10.2. Active-active with logical replication

-- DC1 publication
CREATE PUBLICATION dc1_pub FOR ALL TABLES;

-- DC2 subscription
CREATE SUBSCRIPTION dc2_sub
CONNECTION 'host=10.1.1.11 dbname=mydb user=replicator'
PUBLICATION dc1_pub
WITH (copy_data = true);

-- DC2 publication (for bi-directional)
CREATE PUBLICATION dc2_pub FOR ALL TABLES;

-- DC1 subscription
CREATE SUBSCRIPTION dc1_sub
CONNECTION 'host=10.2.1.11 dbname=mydb user=replicator'
PUBLICATION dc2_pub
WITH (copy_data = false);  -- Already have data

-- Conflict resolution required!
-- See: https://www.postgresql.org/docs/current/logical-replication-conflicts.html

10.3. Quorum-based commit

# For strong consistency across DCs
postgresql:
  parameters:
    synchronous_standby_names: 'ANY 2 (pg-us-east-2, pg-us-west-1, pg-eu-central-1)'
    synchronous_commit: 'remote_apply'

# Requires 2 of 3 DCs to acknowledge commit
# Provides strong durability but higher latency

11. Tổng kết

Multi-DC Strategies

Pattern	RPO	RTO	Complexity	Cost
Active-Passive (DR)	Minutes	Minutes	Low	Low
Cascading Replicas	Seconds	Seconds	Medium	Medium
Active-Active	Near-zero	Near-zero	High	High
Hub-and-Spoke	Seconds	Minutes	Medium	Medium

Key Metrics

Replication Lag: < 50 MB for WAN
Network Latency: < 100 ms acceptable
Throughput: 50-100 MB/s typical for WAN
RPO Target: < 5 minutes
RTO Target: < 15 minutes

Checklist

☐ Cascading replication configured
☐ Separate etcd per DC
☐ Replication slots created
☐ WAL compression enabled
☐ Timeouts tuned for WAN
☐ Geo-aware load balancing
☐ Cross-DC monitoring
☐ DR failover tested
☐ Backup geo-replication
☐ Documentation complete

Next Steps

Bài 22 sẽ cover Patroni on Kubernetes:

StatefulSets configuration
Patroni Kubernetes operator
PersistentVolumes setup
Helm charts usage
K8s-specific considerations

Menu

Bài 21: Multi-datacenter Setup

DUY TRAN

Bài 21: Multi-datacenter Setup

Bài học khóa học

Bài 1: Tổng quan về PostgreSQL High Availability

Bài 2: Streaming Replication trong PostgreSQL

Bài 3: Giới thiệu Patroni và etcd

Bài 4: Chuẩn bị hạ tầng

Bài 1: Giới thiệu và Cài đặt Nginx

Bài 2: Cấu hình Cơ bản Nginx

Bài 3: Logging và Monitoring Nginx

Bài 4: Reverse Proxy

Bài 5: Load Balancing

Bài 6: NGINX CACHING

Bài 7: SSL/TLS và HTTPS trong NGINX

Bài 8: Performance Tuning NGINX

Bài 9: Security trong NGINX

Bài 10: Rewrite và Redirects trong NGINX

Bài 11: Nginx với Application Stack trong NGINX

Bài 12: Monitoring và Logging trong NGINX

Bài 13: High Availability và Load Balancing Advanced trong NGINX

Bài 14: Microservices và Service Mesh trong NGINX

Bài 15: Production Best Practices và Advanced Topics trong NGINX

Bài 5: Cài đặt PostgreSQL

Bài 6: Cài đặt và cấu hình etcd cluster

Bài 7: Cài đặt Patroni

Bài 8: Cấu hình Patroni chi tiết

Bài 9: Bootstrap PostgreSQL Cluster

Bài 10: Quản lý Replication

Bài 11: Patroni Callbacks

Bài 12: Patroni REST API

Bài 13: Automatic Failover

Bài 14: Switchover có kế hoạch (Planned Switchover)

Bài 15: Recovering failed nodes

Bài 16: Backup và Point-in-Time Recovery (PITR)

Bài 17: Monitoring Patroni Cluster

Bài 18: Performance Tuning

Bài 19: Logging và Troubleshooting

Bài 20: Security Best Practices

Bài 21: Multi-datacenter Setup

Bài 22: Patroni với Kubernetes

Bài 24: Upgrade Strategies

Bài 23: Patroni Configuration Management

Bài 25: Real-world Case Studies

Bài 26: Automation với Ansible

Bài 27: Disaster Recovery Drills

Bài 28: Thiết Kế Kiến Trúc HA

Bài 29: Deploy Production-ready Cluster

BÀI 1: GIỚI THIỆU KUBERNETES VÀ CONTAINER ORCHESTRATION

Mục tiêu

1. Multi-DC Architecture Patterns

1.1. Active-Passive (DR standby)

1.2. Active-Active (Multi-master)

1.3. Hub-and-Spoke (Cascading)

2. Cascading Replication Setup

2.1. Architecture

2.2. Configure cascading node (DC1)

2.3. Configure downstream replica (DC2)

2.4. Create replication slot on cascade node

2.5. Start DC2 replica

3. Network Latency Handling

3.1. Measure inter-DC latency

3.2. Optimize for high latency

3.3. Use WAL compression

3.4. Limit replication bandwidth

4. Disaster Recovery Scenarios

4.1. DC1 total failure

4.2. DC1 recovery after failure

4.3. Split-brain prevention

5. Geographic Load Balancing

5.1. HAProxy with geo-awareness

5.2. DNS-based routing

5.3. Application-level routing

6. Cross-DC Monitoring

6.1. Monitor replication lag

6.2. Prometheus exporters

6.3. Alert rules for cross-DC

7. Backup Strategy for Multi-DC

7.1. Per-DC backups