Bài 21: Multi-datacenter Setup
Chiến lược replication cross-DC, asynchronous cascading replication, disaster recovery planning và geographic load balancing.
Bài 21: Multi-datacenter Setup
Mục tiêu
Sau bài học này, bạn sẽ:
- Design cross-datacenter replication architecture
- Implement cascading replication topology
- Handle network latency and failures
- Configure disaster recovery for multiple sites
- Load balance across geographic locations
1. Multi-DC Architecture Patterns
1.1. Active-Passive (DR standby)
Primary DC (Active):
├─ node1 (Leader)
├─ node2 (Replica)
└─ node3 (Replica)
↓ Async replication
DR DC (Passive):
├─ node4 (Standby)
└─ node5 (Standby)
Use case: Disaster recovery
RPO: Minutes to hours
RTO: Minutes to hours
Cost: Lower (minimal resources in DR)
1.2. Active-Active (Multi-master)
DC1 (Active):
├─ node1 (Leader)
└─ node2 (Replica)
↕ Bi-directional logical replication
DC2 (Active):
├─ node3 (Leader)
└─ node4 (Replica)
Use case: Global applications with regional traffic
RPO: Near-zero
RTO: Near-zero
Cost: Higher (full resources in both DCs)
Note: Requires conflict resolution
1.3. Hub-and-Spoke (Cascading)
Primary DC (Hub):
└─ node1 (Leader)
├─ node2 (Replica) ← DC1 local
├─ node3 (Cascade) → DC2
└─ node4 (Cascade) → DC3
DC2 (Spoke):
└─ node3 (receives from node3-cascade)
└─ node5 (Replica) ← DC2 local
DC3 (Spoke):
└─ node4 (receives from node4-cascade)
└─ node6 (Replica) ← DC3 local
Use case: Multiple regional read replicas
RPO: Seconds to minutes
Cost: Medium
2. Cascading Replication Setup
2.1. Architecture
DC1 (us-east):
├─ pg-us-east-1 (Leader) - 10.1.1.11
├─ pg-us-east-2 (Replica) - 10.1.1.12
└─ pg-us-east-3 (Cascade) - 10.1.1.13
↓ WAN replication
DC2 (us-west):
└─ pg-us-west-1 (Replica) - 10.2.1.11
├─ Receives from pg-us-east-3
└─ pg-us-west-2 (Replica) - 10.2.1.12
2.2. Configure cascading node (DC1)
# /etc/patroni/patroni.yml on pg-us-east-3 (cascade node)
scope: postgres-cluster
name: pg-us-east-3
restapi:
listen: 10.1.1.13:8008
connect_address: 10.1.1.13:8008
etcd:
hosts: 10.1.1.11:2379,10.1.1.12:2379,10.1.1.13:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.1.1.13:5432
data_dir: /var/lib/postgresql/18/data
bin_dir: /usr/lib/postgresql/18/bin
authentication:
replication:
username: replicator
password: rep_password
superuser:
username: postgres
password: postgres_password
parameters:
# Enable cascading replication
hot_standby: on
wal_level: replica
max_wal_senders: 10 # Extra slots for downstream replicas
max_replication_slots: 10
hot_standby_feedback: on
# Performance tuning for WAN
wal_sender_timeout: 60s
wal_receiver_timeout: 60s
# Allow downstream replicas to connect
pg_hba:
- host replication replicator 10.2.1.0/24 scram-sha-256 # DC2 subnet
tags:
nofailover: false
noloadbalance: false
clonefrom: true # Can be used as clone source
nosync: false
2.3. Configure downstream replica (DC2)
# /etc/patroni/patroni.yml on pg-us-west-1
scope: postgres-cluster-dc2 # Different scope!
name: pg-us-west-1
restapi:
listen: 10.2.1.11:8008
connect_address: 10.2.1.11:8008
etcd:
# Separate etcd cluster for DC2
hosts: 10.2.1.11:2379,10.2.1.12:2379,10.2.1.13:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
standby_cluster:
# Point to DC1 cascade node
host: 10.1.1.13 # pg-us-east-3
port: 5432
primary_slot_name: pg_us_west_1_slot
create_replica_methods:
- basebackup
method: basebackup
basebackup:
max-rate: '100M'
checkpoint: 'fast'
waldir: /var/lib/postgresql/18/data/pg_wal
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.2.1.11:5432
data_dir: /var/lib/postgresql/18/data
bin_dir: /usr/lib/postgresql/18/bin
authentication:
replication:
username: replicator
password: rep_password
superuser:
username: postgres
password: postgres_password
parameters:
hot_standby: on
wal_level: replica
max_wal_senders: 5
max_replication_slots: 5
hot_standby_feedback: on
# WAN-optimized settings
wal_sender_timeout: 120s # Higher for WAN
wal_receiver_timeout: 120s
wal_retrieve_retry_interval: 10s
pg_hba:
- host replication replicator 10.2.1.0/24 scram-sha-256
- host all all 10.2.1.0/24 scram-sha-256
tags:
nofailover: false # Can become leader in DC2
noloadbalance: false
clonefrom: false
2.4. Create replication slot on cascade node
# On pg-us-east-3 (cascade node)
sudo -u postgres psql -c "
SELECT pg_create_physical_replication_slot('pg_us_west_1_slot');
"
2.5. Start DC2 replica
# On pg-us-west-1
sudo systemctl start patroni
sudo systemctl status patroni
# Check replication status
patronictl -c /etc/patroni/patroni.yml list
# Verify it's receiving from cascade node
sudo -u postgres psql -c "
SELECT client_addr, state, sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
FROM pg_stat_replication;
"
3. Network Latency Handling
3.1. Measure inter-DC latency
# Ping test
ping -c 10 10.2.1.11
# TCP latency test
nc -vz 10.2.1.11 5432
# PostgreSQL connection latency
psql "host=10.2.1.11 user=postgres" -c "SELECT now();"
# iPerf bandwidth test
# On DC2:
iperf3 -s
# On DC1:
iperf3 -c 10.2.1.11 -t 30
3.2. Optimize for high latency
-- Increase timeouts for WAN
ALTER SYSTEM SET wal_sender_timeout = '120s'; -- Default 60s
ALTER SYSTEM SET wal_receiver_timeout = '120s';
ALTER SYSTEM SET wal_retrieve_retry_interval = '10s';
-- TCP keepalive settings
ALTER SYSTEM SET tcp_keepalives_idle = 60;
ALTER SYSTEM SET tcp_keepalives_interval = 10;
ALTER SYSTEM SET tcp_keepalives_count = 6;
-- Reload
SELECT pg_reload_conf();
3.3. Use WAL compression
-- Enable WAL compression (PostgreSQL 14+)
ALTER SYSTEM SET wal_compression = on;
-- Can reduce WAN traffic by 50-70%
SELECT pg_reload_conf();
3.4. Limit replication bandwidth
# In patroni.yml
bootstrap:
method: basebackup
basebackup:
max-rate: '50M' # Limit to 50 MB/s to avoid saturating WAN
checkpoint: 'fast'
4. Disaster Recovery Scenarios
4.1. DC1 total failure
# Promote DC2 to primary
# On pg-us-west-1
# Remove standby_cluster config
patronictl edit-config postgres-cluster-dc2 -s scope -p standby_cluster --force
# Promote to leader
patronictl failover postgres-cluster-dc2 --leader pg-us-west-1 --force
# Verify
patronictl -c /etc/patroni/patroni.yml list
# + Cluster: postgres-cluster-dc2 ----+------------+----+-----------+
# | Member | Host | Role | State | Lag in MB |
# +----------------+------------+--------+-----------+-----------+
# | pg-us-west-1 | 10.2.1.11 | Leader | running | 0 |
# | pg-us-west-2 | 10.2.1.12 | Replica| streaming | 0 |
# +----------------+------------+--------+-----------+-----------+
4.2. DC1 recovery after failure
# When DC1 comes back online, reintegrate it
# Option 1: Make DC1 follow DC2 (temporary)
# Edit patroni.yml on DC1 nodes to add standby_cluster pointing to DC2
# Option 2: Failback to DC1
# Wait for DC2 to be fully synchronized
# Perform planned switchover back to DC1
patronictl switchover postgres-cluster-dc2 \
--leader pg-us-west-1 \
--candidate pg-us-east-1 \
--scheduled 'now'
4.3. Split-brain prevention
# Use etcd/consul in both DCs
# Each DC has its own etcd cluster
# Use etcd discovery URL for cross-DC awareness
etcd:
hosts:
- 10.1.1.11:2379 # DC1
- 10.1.1.12:2379
- 10.2.1.11:2379 # DC2
- 10.2.1.12:2379
Note: For true split-brain prevention, consider:
- Odd number of sites (3+ DCs) with witness node
- Fencing mechanisms (STONITH)
- Quorum-based decisions
5. Geographic Load Balancing
5.1. HAProxy with geo-awareness
Architecture:
Users (us-east) → HAProxy-DC1 → PG-DC1 (primary)
Users (us-west) → HAProxy-DC2 → PG-DC2 (replica) - read-only
↳ PG-DC1 (primary) - writes
# /etc/haproxy/haproxy.cfg on HAProxy-DC1 (us-east)
frontend postgres_front
bind *:5432
mode tcp
default_backend postgres_master
backend postgres_master
mode tcp
option tcp-check
tcp-check connect
tcp-check send-binary 00000008 # SSLRequest
tcp-check expect binary 4e # 'N' (no SSL)
server pg-us-east-1 10.1.1.11:5432 check inter 3000
server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup
# /etc/haproxy/haproxy.cfg on HAProxy-DC2 (us-west)
frontend postgres_front_read
bind *:5432
mode tcp
default_backend postgres_replicas
frontend postgres_front_write
bind *:5433
mode tcp
default_backend postgres_master_remote
backend postgres_replicas
# Local read replicas
mode tcp
balance roundrobin
option tcp-check
server pg-us-west-1 10.2.1.11:5432 check inter 3000
server pg-us-west-2 10.2.1.12:5432 check inter 3000
backend postgres_master_remote
# Write to primary in DC1
mode tcp
option tcp-check
server pg-us-east-1 10.1.1.11:5432 check inter 3000
server pg-us-east-2 10.1.1.12:5432 check inter 3000 backup
5.2. DNS-based routing
# Use DNS with geo-location
# GeoDNS service (Route53, Cloudflare, etc.)
# US-East users resolve to:
postgres.example.com → 10.1.1.100 (HAProxy-DC1)
# US-West users resolve to:
postgres.example.com → 10.2.1.100 (HAProxy-DC2)
# Configure health checks to failover on DC failure
5.3. Application-level routing
# Python example with psycopg2
import psycopg2
import requests
def get_postgres_endpoint():
"""Get optimal PostgreSQL endpoint based on geo-location"""
# Check latency to each DC
latencies = {}
for dc, host in [('dc1', '10.1.1.11'), ('dc2', '10.2.1.11')]:
try:
start = time.time()
conn = psycopg2.connect(
host=host, user='app', password='pass',
dbname='mydb', connect_timeout=3
)
conn.close()
latencies[dc] = time.time() - start
except:
latencies[dc] = float('inf')
# Return DC with lowest latency
best_dc = min(latencies, key=latencies.get)
return '10.1.1.11' if best_dc == 'dc1' else '10.2.1.11'
# Use it
conn = psycopg2.connect(
host=get_postgres_endpoint(),
user='app', password='pass', dbname='mydb'
)
6. Cross-DC Monitoring
6.1. Monitor replication lag
-- On cascade node (DC1)
SELECT client_addr, application_name,
state, sync_state,
pg_wal_lsn_diff(pg_current_wal_lsn(), sent_lsn) AS sending_lag,
pg_wal_lsn_diff(sent_lsn, write_lsn) AS write_lag,
pg_wal_lsn_diff(write_lsn, flush_lsn) AS flush_lag,
pg_wal_lsn_diff(flush_lsn, replay_lsn) AS replay_lag
FROM pg_stat_replication
WHERE application_name LIKE '%west%';
# Expected replication lag for cross-DC:
# - Low latency WAN (< 10ms): 0-10 MB lag
# - Medium latency WAN (10-50ms): 10-50 MB lag
# - High latency WAN (> 50ms): 50-200 MB lag
6.2. Prometheus exporters
# prometheus.yml
scrape_configs:
- job_name: 'postgres-dc1'
static_configs:
- targets:
- '10.1.1.11:9187'
- '10.1.1.12:9187'
- '10.1.1.13:9187'
labels:
datacenter: 'us-east'
- job_name: 'postgres-dc2'
static_configs:
- targets:
- '10.2.1.11:9187'
- '10.2.1.12:9187'
labels:
datacenter: 'us-west'
6.3. Alert rules for cross-DC
# /etc/prometheus/alerts/multi-dc.yml
groups:
- name: multi-dc
rules:
- alert: CrossDCReplicationLag
expr: |
pg_replication_lag{datacenter="us-west"} > 100 * 1024 * 1024
for: 5m
labels:
severity: warning
annotations:
summary: "High replication lag to DC2"
description: "Replication lag to {{ $labels.instance }} is {{ $value | humanize }}B"
- alert: CrossDCReplicationBroken
expr: |
pg_replication_status{datacenter="us-west"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Replication to DC2 is broken"
- alert: CrossDCLatency
expr: |
probe_duration_seconds{job="blackbox-dc2"} > 0.1
for: 5m
labels:
severity: warning
annotations:
summary: "High network latency to DC2"
7. Backup Strategy for Multi-DC
7.1. Per-DC backups
# DC1 backup
pgbackrest --stanza=main --type=full backup
# DC2 backup (can use DC1's backup repo over WAN)
pgbackrest --stanza=main --type=diff backup --repo1-host=10.1.1.11
7.2. Geo-replicated backup storage
# pgbackrest.conf
[global]
repo1-type=s3
repo1-s3-bucket=my-postgres-backups-us-east
repo1-s3-region=us-east-1
repo1-s3-endpoint=s3.amazonaws.com
# S3 cross-region replication enabled:
# us-east-1 → us-west-2
7.3. Backup verification
# Restore test in DR site
pgbackrest --stanza=main \
--type=time \
--target="2024-11-25 10:00:00" \
restore \
--repo1-host=backup-server \
--pg1-path=/var/lib/postgresql/18/restore_test
8. Best Practices
✅ DO
- Use cascading replication - Reduces load on primary
- Separate etcd clusters - Per-DC for independence
- Monitor replication lag - Alert on high lag
- Test failover regularly - Quarterly DR drills
- Use replication slots - Prevent WAL deletion
- Compress WAL - Reduce WAN bandwidth
- Limit base backup rate - Avoid WAN saturation
- Implement geo-routing - Reduce latency for users
- Document topology - Clear architecture diagrams
- Automate failover - But with human approval for DR
❌ DON'T
- Don't use sync replication cross-DC - Too slow
- Don't share etcd across WAN - Split-brain risk
- Don't ignore network latency - Tune timeouts
- Don't forget about WAL retention - Use slots
- Don't skip DR testing - Must validate regularly
- Don't use single DC for backups - Geo-replicate
- Don't over-complicate - Start simple, add complexity as needed
9. Lab Exercises
Lab 1: Setup cascading replication
Tasks:
- Configure cascade node in DC1
- Setup downstream replica in DC2
- Create replication slot
- Verify replication lag
- Monitor with Prometheus
Lab 2: Test DR failover
Tasks:
- Simulate DC1 failure (stop all nodes)
- Promote DC2 to primary
- Verify application connectivity
- Document RTO/RPO
- Plan failback procedure
Lab 3: Geo-aware load balancing
Tasks:
- Setup HAProxy in each DC
- Configure geo-based routing
- Test read/write routing
- Measure latency improvement
- Implement health checks
Lab 4: Cross-DC monitoring
Tasks:
- Configure Prometheus multi-DC scraping
- Create Grafana dashboard with DC labels
- Setup alert rules for cross-DC lag
- Test alerting on simulated failure
- Document runbook for alerts
10. Advanced Topics
10.1. Three-datacenter setup
DC1 (us-east):
└─ pg1 (Leader)
├─ pg2 (Replica)
└─ pg3 (Cascade) → DC2
DC2 (us-west):
└─ pg4 (Replica from pg3)
├─ pg5 (Replica)
└─ pg6 (Cascade) → DC3
DC3 (eu-central):
└─ pg7 (Replica from pg6)
└─ pg8 (Replica)
Use case: Global application with regional reads
10.2. Active-active with logical replication
-- DC1 publication
CREATE PUBLICATION dc1_pub FOR ALL TABLES;
-- DC2 subscription
CREATE SUBSCRIPTION dc2_sub
CONNECTION 'host=10.1.1.11 dbname=mydb user=replicator'
PUBLICATION dc1_pub
WITH (copy_data = true);
-- DC2 publication (for bi-directional)
CREATE PUBLICATION dc2_pub FOR ALL TABLES;
-- DC1 subscription
CREATE SUBSCRIPTION dc1_sub
CONNECTION 'host=10.2.1.11 dbname=mydb user=replicator'
PUBLICATION dc2_pub
WITH (copy_data = false); -- Already have data
-- Conflict resolution required!
-- See: https://www.postgresql.org/docs/current/logical-replication-conflicts.html
10.3. Quorum-based commit
# For strong consistency across DCs
postgresql:
parameters:
synchronous_standby_names: 'ANY 2 (pg-us-east-2, pg-us-west-1, pg-eu-central-1)'
synchronous_commit: 'remote_apply'
# Requires 2 of 3 DCs to acknowledge commit
# Provides strong durability but higher latency
11. Tổng kết
Multi-DC Strategies
| Pattern | RPO | RTO | Complexity | Cost |
|---|---|---|---|---|
| Active-Passive (DR) | Minutes | Minutes | Low | Low |
| Cascading Replicas | Seconds | Seconds | Medium | Medium |
| Active-Active | Near-zero | Near-zero | High | High |
| Hub-and-Spoke | Seconds | Minutes | Medium | Medium |
Key Metrics
Replication Lag: < 50 MB for WAN
Network Latency: < 100 ms acceptable
Throughput: 50-100 MB/s typical for WAN
RPO Target: < 5 minutes
RTO Target: < 15 minutes
Checklist
☐ Cascading replication configured
☐ Separate etcd per DC
☐ Replication slots created
☐ WAL compression enabled
☐ Timeouts tuned for WAN
☐ Geo-aware load balancing
☐ Cross-DC monitoring
☐ DR failover tested
☐ Backup geo-replication
☐ Documentation complete
Next Steps
Bài 22 sẽ cover Patroni on Kubernetes:
- StatefulSets configuration
- Patroni Kubernetes operator
- PersistentVolumes setup
- Helm charts usage
- K8s-specific considerations