Bài 13: Automatic Failover
Tìm hiểu cơ chế phát hiện lỗi, quá trình leader election, failover timeline và thực hành mô phỏng primary node failure.
Bài 13: Automatic Failover
Mục tiêu
Sau bài học này, bạn sẽ:
- Hiểu cơ chế phát hiện lỗi trong Patroni
- Nắm rõ leader election process
- Theo dõi failover timeline chi tiết
- Test automatic failover trong nhiều scenarios
- Troubleshoot failover issues
- Optimize failover speed
1. Automatic Failover Overview
1.1. Failover là gì?
Automatic Failover = Quá trình tự động promote một replica lên làm primary khi primary hiện tại fails.
Đặc điểm:
- ⚡ Tự động: Không cần can thiệp manual
- 🚨 Unplanned: Xảy ra do sự cố
- ⏱️ Fast: 30-60 giây (configurable)
- 🎯 Goal: Minimize downtime
Khi nào xảy ra failover?
- Primary server crashes
- PostgreSQL process dies
- Network partition
- Hardware failure
- DCS connection lost
- Disk full
1.2. Failover vs Replication
WITHOUT Patroni (Manual Failover):
1. Primary fails
2. DBA gets paged
3. DBA investigates (10-30 mins)
4. DBA manually promotes replica
5. DBA updates application config
6. Service restored
Total downtime: 30+ minutes ❌
WITH Patroni (Automatic Failover):
1. Primary fails
2. Patroni detects (10 seconds)
3. Patroni promotes best replica (20 seconds)
4. Service restored automatically
Total downtime: 30-60 seconds ✅
2. Failure Detection Mechanism
2.1. Health Check Loop
Patroni health check components:
# Pseudo-code of Patroni's main loop
while True:
# 1. Check PostgreSQL health
if not check_postgresql_running():
log.error("PostgreSQL is down!")
handle_postgres_failure()
# 2. Check DCS connectivity
if not can_connect_to_dcs():
log.error("Lost DCS connection!")
demote_if_leader()
# 3. Update status in DCS
update_member_status_in_dcs()
# 4. Check leader lock (if I'm leader)
if is_leader:
renew_leader_lock()
# 5. Sleep until next check
sleep(loop_wait) # Default: 10 seconds
2.2. PostgreSQL Health Checks
Patroni performs multiple checks:
A. Process check
# Check if postgres process exists
ps aux | grep postgres
# Check if accepting connections
pg_isready -h localhost -p 5432
B. Connection check
# Try to connect to PostgreSQL
try:
conn = psycopg2.connect("host=localhost port=5432 dbname=postgres")
conn.close()
except:
# Connection failed!
mark_unhealthy()
C. Replication check (on replicas)
-- Check if replication is active
SELECT status, received_lsn, replay_lsn
FROM pg_stat_wal_receiver;
-- If no data or status != 'streaming' → Problem!
D. Timeline check
-- Ensure timeline matches cluster
SELECT timeline_id FROM pg_control_checkpoint();
2.3. DCS Connectivity Check
Why DCS connectivity matters:
If node loses DCS connection:
- Cannot renew leader lock
- Cannot read cluster state
- MUST demote to avoid split-brain
Even if PostgreSQL is healthy!
DCS check example:
# Check etcd health
etcdctl endpoint health
# Try to read/write
etcdctl get /service/postgres/leader
etcdctl put /service/postgres/members/node1 "healthy"
2.4. Leader Lock TTL
TTL (Time-To-Live) mechanism:
# In patroni.yml
bootstrap:
dcs:
ttl: 30 # Leader lock expires after 30 seconds
loop_wait: 10 # Check every 10 seconds
Timeline:
T+0s: Leader acquires lock (TTL=30s)
T+10s: Leader renews lock (TTL extended to T+40s)
T+20s: Leader renews lock (TTL extended to T+50s)
T+30s: Leader tries to renew but FAILS (crashed)
T+40s: Lock expires in DCS
T+41s: Replicas detect no leader
T+42s: Replica election begins
T+45s: New leader elected
Total detection time: ~35-40 seconds
3. Leader Election Process
3.1. Election Trigger
Leader election starts when:
Condition 1: Leader lock expired in DCS
/service/postgres/leader → key not found
Condition 2: No active leader for > loop_wait
All replicas see: no leader heartbeat
Condition 3: Explicit failover
patronictl failover command
3.2. Candidate Selection Criteria
Patroni selects best replica based on:
Priority 1: Replication State
-- Prefer streaming over archive recovery
SELECT state FROM pg_stat_wal_receiver;
streaming > in archive recovery > stopped
Priority 2: Replication Lag
-- Replica with lowest lag wins
SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS lag_bytes;
-- Example:
-- node2: lag = 0 bytes ← BEST
-- node3: lag = 1048576 bytes (1MB)
Priority 3: Timeline
-- Higher timeline = more recent
SELECT timeline_id FROM pg_control_checkpoint();
-- node2: timeline = 3 ← BEST
-- node3: timeline = 2
Priority 4: Tags
# In patroni.yml
tags:
nofailover: false # true = never promote this node
noloadbalance: false
priority: 100 # Higher = preferred (0-999)
Example:
# node2 - Preferred candidate
tags:
nofailover: false
priority: 200
# node3 - Lower priority
tags:
nofailover: false
priority: 100
# node4 - Never promote
tags:
nofailover: true
Priority 5: Synchronous State
-- Synchronous replica preferred over async
SELECT sync_state FROM pg_stat_replication;
sync > potential > async
3.3. Race Condition and Lock Acquisition
Multiple replicas compete:
Scenario: Primary fails, 2 replicas compete
T+0s: node2 and node3 both detect no leader
T+0.1s: Both try to acquire lock simultaneously
In etcd (atomic operation):
node2 tries: PUT /service/postgres/leader "node2" if_not_exists
node3 tries: PUT /service/postgres/leader "node3" if_not_exists
Result: Only ONE succeeds (etcd atomic guarantee)
node2: SUCCESS → becomes leader
node3: FAILED → remains replica
DCS guarantees:
- Atomicity: Only one node gets the lock
- Consistency: All nodes see same leader
- Isolation: No split-brain possible
3.4. Promotion Process
Winner node executes:
Step 1: Acquire leader lock in DCS
etcdctl put /service/postgres/leader '{"node": "node2", ...}'
Step 2: Run pre_promote callback (if configured)
/var/lib/postgresql/callbacks/pre_promote.sh
Step 3: Promote PostgreSQL
Method A: pg_ctl promote -D /var/lib/postgresql/18/data
Method B: SELECT pg_promote();
Method C: Create trigger file (old method)
Step 4: Wait for promotion complete
Check: SELECT pg_is_in_recovery();
Should return: false (not in recovery = primary)
Step 5: Update timeline
Timeline increments: 1 → 2
Step 6: Run post_promote callback
Update DNS, load balancer, send notifications
Step 7: Run on_role_change callback
/var/lib/postgresql/callbacks/on_role_change.sh master
Step 8: Update DCS with new primary info
xlog_location, timeline, conn_url
Step 9: Start accepting writes
PostgreSQL now in read-write mode
4. Failover Timeline Detailed
4.1. Complete Failover Flow
Timeline of Automatic Failover
T+0s: NORMAL OPERATION
Primary (node1): Healthy, serving requests
Replica (node2): Streaming from node1, lag=0
Replica (node3): Streaming from node1, lag=0
T+1s: PRIMARY FAILS
node1: PostgreSQL crashes / server dies
node2: Still streaming (buffered data)
node3: Still streaming (buffered data)
T+5s: REPLICATION BROKEN
node2: WAL receiver error "connection lost"
node3: WAL receiver error "connection lost"
node1: Still holds leader lock (TTL not expired yet)
T+10s: HEALTH CHECK CYCLE 1
node2: Check replication → FAILED, wait...
node3: Check replication → FAILED, wait...
node1: Cannot renew lock (crashed)
T+20s: HEALTH CHECK CYCLE 2
node2: Still cannot connect to node1
node3: Still cannot connect to node1
T+30s: LEADER LOCK EXPIRES
DCS: /service/postgres/leader TTL expired → key deleted
node2: Detects no leader key
node3: Detects no leader key
T+31s: CANDIDATE ELECTION BEGINS
node2: Check eligibility → YES (lag=0, priority=100)
node3: Check eligibility → YES (lag=1MB, priority=100)
T+32s: RACE FOR LOCK
node2: PUT /service/postgres/leader "node2" → SUCCESS
node3: PUT /service/postgres/leader "node3" → FAILED
T+33s: NODE2 PROMOTES
node2: Run pre_promote callback
node2: pg_promote() executed
node2: Timeline: 1 → 2
T+35s: PROMOTION COMPLETE
node2: pg_is_in_recovery() → false
node2: Now accepting writes
node2: Run post_promote & on_role_change callbacks
T+36s: NODE3 RECONFIGURES
node3: Detects new leader = node2
node3: Update primary_conninfo → node2:5432
node3: Restart WAL receiver
T+38s: REPLICATION RESTORED
node3: Connected to node2
node3: Streaming at timeline 2
T+40s: CLUSTER OPERATIONAL
Primary: node2 (was replica)
Replica: node3 (following node2)
Failed: node1 (needs manual intervention)
Total Downtime: ~35-40 seconds ✅
4.2. Factors Affecting Failover Speed
Configuration parameters:
# Fast failover configuration
bootstrap:
dcs:
ttl: 20 # Faster detection (default: 30)
loop_wait: 5 # More frequent checks (default: 10)
retry_timeout: 5 # Quick retries (default: 10)
Trade-offs:
| Parameter | Lower Value | Higher Value |
|---|---|---|
| TTL | Faster failover | More stable |
| More false positives | Slower failover | |
| loop_wait | Faster detection | Less DCS traffic |
| More CPU/network | Slower reaction |
Typical configurations:
# Conservative (stable, slower)
ttl: 30
loop_wait: 10
→ Failover: ~40-50s
# Balanced (recommended)
ttl: 20
loop_wait: 10
→ Failover: ~30-40s
# Aggressive (fast, sensitive)
ttl: 15
loop_wait: 5
→ Failover: ~20-30s
5. Testing Automatic Failover
5.1. Test Scenario 1: PostgreSQL Process Kill
Simulate PostgreSQL crash:
# On current primary (node1)
sudo -u postgres psql -c "SELECT pg_backend_pid();"
# Returns: 12345
sudo kill -9 12345 # Kill PostgreSQL
# Or kill all postgres processes
sudo pkill -9 postgres
Monitor failover:
# Terminal 1: Watch cluster status
watch -n 1 "patronictl list postgres"
# Terminal 2: Monitor logs
sudo journalctl -u patroni -f
# Terminal 3: Test connectivity
while true; do
psql -h 10.0.1.11 -U app_user -d myapp -c "SELECT 1;" 2>&1 | grep -q "ERROR" && echo "$(date): DOWN" || echo "$(date): UP"
sleep 1
done
Expected timeline:
00:00 - Cluster healthy
00:01 - Kill postgres on node1
00:02-00:30 - Patroni detecting failure
00:31 - node2 elected as new primary
00:35 - Cluster operational (node2 = primary)
00:36+ - Connections working again
5.2. Test Scenario 2: Network Partition
Simulate network partition:
# On primary node, block traffic to other nodes
sudo iptables -A INPUT -s 10.0.1.12 -j DROP
sudo iptables -A INPUT -s 10.0.1.13 -j DROP
sudo iptables -A OUTPUT -d 10.0.1.12 -j DROP
sudo iptables -A OUTPUT -d 10.0.1.13 -j DROP
# Or block etcd access specifically
sudo iptables -A OUTPUT -p tcp --dport 2379 -j DROP
Observe:
# On node1 (isolated)
patronictl list postgres
# Will show errors / cannot connect to cluster
# On node2/node3
patronictl list postgres
# Will show node1 as unavailable
# After TTL: node2 or node3 becomes leader
Recovery:
# Restore network on node1
sudo iptables -F
# node1 should automatically rejoin as replica
patronictl list postgres
5.3. Test Scenario 3: Server Reboot
Simulate server crash:
# On primary node
sudo reboot
# Or immediate crash
echo c | sudo tee /proc/sysrq-trigger
Expected behavior: Same as Scenario 1, but node completely unavailable.
5.4. Test Scenario 4: Disk Full
Simulate disk full:
# Fill up disk on primary
dd if=/dev/zero of=/var/lib/postgresql/bigfile bs=1M count=10000
# PostgreSQL will fail when cannot write WAL
Patroni will detect PostgreSQL unhealthy → trigger failover.
5.5. Test Scenario 5: DCS Failure
Stop etcd on all nodes:
# On all 3 etcd nodes
sudo systemctl stop etcd
Expected behavior:
- All Patroni nodes lose DCS connection
- Current primary DEMOTES (safety mechanism)
- Cluster enters "read-only" state
- NO failover possible (no DCS consensus)
Recovery:
- Restart etcd cluster
- Patroni auto-recovers
- Leader election happens
6. Verify Failover Success
6.1. Check cluster status
# List cluster members
patronictl list postgres
# Expected after failover:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1 | 10.0.1.11:5432| Replica | stopped | 1 | | ← Old primary
# | node2 | 10.0.1.12:5432| Leader | running | 2 | | ← NEW primary
# | node3 | 10.0.1.13:5432| Replica | running | 2 | 0 |
# +--------+---------------+---------+---------+----+-----------+
# Note timeline changed: 1 → 2
6.2. Verify new primary
# Check primary role
sudo -u postgres psql -h 10.0.1.12 -c "SELECT pg_is_in_recovery();"
# pg_is_in_recovery
# ------------------
# f ← false = PRIMARY
# Check timeline
sudo -u postgres psql -h 10.0.1.12 -c "SELECT timeline_id FROM pg_control_checkpoint();"
# timeline_id
# ------------
# 2
# Check replication from new primary
sudo -u postgres psql -h 10.0.1.12 -c "SELECT * FROM pg_stat_replication;"
# Should show node3 replicating from node2
6.3. Test write operations
# Insert data on new primary
sudo -u postgres psql -h 10.0.1.12 -d testdb -c "
INSERT INTO test_table (data) VALUES ('After failover at ' || NOW());
"
# Verify on replica
sudo -u postgres psql -h 10.0.1.13 -d testdb -c "
SELECT * FROM test_table ORDER BY id DESC LIMIT 5;
"
# Should see new data replicated
6.4. Check failover history
# View history via REST API
curl -s http://10.0.1.12:8008/history | jq
# Output:
# [
# [1, 67108864, "no recovery target specified", "2024-11-25T10:00:00+00:00"],
# [2, 134217728, "no recovery target specified", "2024-11-25T11:30:15+00:00"]
# ]
# ↑ Timeline 2 = Failover event
# Check Patroni logs
sudo journalctl -u patroni --since "30 minutes ago" | grep -i "promote\|failover\|leader"
7. Troubleshooting Failover Issues
7.1. Issue: Failover not happening
Symptoms: Primary down but no promotion.
Possible causes:
A. All replicas tagged nofailover
# Check tags
patronictl show-config postgres | grep -A5 "tags:"
# If all replicas have nofailover: true
# Solution: Remove tag from at least one replica
patronictl edit-config postgres
# Set: nofailover: false
B. Replication lag too high
# Check maximum_lag_on_failover
patronictl show-config postgres | grep maximum_lag_on_failover
# If replica lag > threshold, won't promote
# Solution: Increase threshold or wait for lag to decrease
patronictl edit-config postgres
# Set: maximum_lag_on_failover: 10485760 # 10MB
C. No quorum in DCS
# Check etcd health
etcdctl endpoint health --cluster
# If etcd cluster has no quorum (< 2 of 3 healthy)
# Solution: Fix etcd cluster first
sudo systemctl restart etcd
D. synchronous_mode_strict enabled
# If enabled and no sync replica available
synchronous_mode: true
synchronous_mode_strict: true # ← Problem!
# Primary cannot be demoted, replicas cannot be promoted
# Solution: Disable strict mode
patronictl edit-config postgres
# Set: synchronous_mode_strict: false
7.2. Issue: Multiple failovers (flapping)
Symptoms: Cluster keeps failing over repeatedly.
Possible causes:
A. Network instability
# Check network between nodes
ping -c 100 10.0.1.12
# High packet loss → false failovers
# Solution: Fix network or increase TTL
patronictl edit-config postgres
# Set: ttl: 40 # More tolerant
B. TTL too aggressive
# ttl: 10 ← Too low!
# Every small network blip causes failover
# Solution: Increase TTL
ttl: 30 # More stable
C. Resource exhaustion
# Check CPU/Memory on nodes
top
free -h
# If resources exhausted, health checks timeout
# Solution: Scale up resources or reduce load
7.3. Issue: Slow failover
Symptoms: Takes >60 seconds to failover.
Diagnosis:
# Check TTL and loop_wait
patronictl show-config postgres | grep -E "ttl|loop_wait"
# Calculate minimum failover time:
# Minimum = TTL + (loop_wait × 2) + promotion_time
# Example: 30 + (10 × 2) + 5 = 55 seconds
Optimization:
# Reduce TTL and loop_wait
bootstrap:
dcs:
ttl: 20 # Was 30
loop_wait: 5 # Was 10
# Expected failover: ~30-35 seconds
7.4. Issue: Data loss after failover
Symptoms: Some recent transactions missing.
Cause: Asynchronous replication + replica lag.
Verification:
# Check replication mode
patronictl show-config postgres | grep synchronous_mode
# Check lag before failover
# (check logs for lag_in_mb at failover time)
sudo journalctl -u patroni | grep "lag_in_mb"
Prevention:
# Enable synchronous replication
bootstrap:
dcs:
synchronous_mode: true
synchronous_mode_strict: false # Allow degradation
postgresql:
parameters:
synchronous_commit: 'on'
8. Metrics and Monitoring
8.1. Key failover metrics
-- Time since last failover
SELECT timeline_id,
pg_postmaster_start_time(),
now() - pg_postmaster_start_time() AS uptime
FROM pg_control_checkpoint();
-- Replication lag (pre-failover indicator)
SELECT application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
replay_lag
FROM pg_stat_replication;
-- Failed connection attempts (indicator of downtime)
SELECT datname, numbackends, xact_commit, xact_rollback
FROM pg_stat_database;
8.2. Alerting rules
Prometheus alert examples:
groups:
- name: patroni_failover
rules:
- alert: PatroniFailoverDetected
expr: increase(patroni_timeline[5m]) > 0
labels:
severity: warning
annotations:
summary: "Patroni failover detected"
description: "Timeline changed, indicating failover"
- alert: PatroniNoLeader
expr: count(patroni_patroni_info{role="master"}) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "No Patroni leader"
description: "Cluster has no primary"
- alert: PatroniHighReplicationLag
expr: patroni_replication_lag_bytes > 10485760 # 10MB
for: 2m
labels:
severity: warning
annotations:
summary: "High replication lag"
description: "Replica lag > 10MB, risk of data loss on failover"
9. Best Practices
✅ DO
- Test failover regularly - Monthly in staging, quarterly in production
- Monitor replication lag - Alert if lag > 1MB
- Use synchronous replication for zero data loss
- Set synchronous_mode_strict: false - Allow degradation
- Configure proper TTL - Balance speed vs stability (20-30s)
- Have >= 2 replicas - Allow failover even if one replica down
- Monitor DCS health - etcd cluster must be healthy
- Document runbooks - Procedures for manual intervention
- Log failover events - Track patterns and issues
- Capacity planning - Replicas should handle primary load
❌ DON'T
- Don't use single replica - No failover option
- Don't ignore lag - High lag = data loss risk
- Don't set TTL too low (<15s) - False positives
- Don't skip testing - Untested failover = downtime risk
- Don't manually promote during automatic failover - Let Patroni handle it
- Don't forget about old primary - Needs rejoin/rebuild
- Don't run without monitoring - Must know when failover happens
- Don't overload DCS - Separate etcd cluster recommended
10. Lab Exercises
Lab 1: Basic failover test
Tasks: 1. Record baseline: patronictl list 2. Stop primary: sudo systemctl stop patroni 3. Time the failover with watch -n 1 patronictl list 4. Document downtime duration 5. Verify new primary accepts writes 6. Restart old primary and verify rejoin
Lab 2: Network partition test
Tasks: 1. Use iptables to partition primary from cluster 2. Observe DCS behavior 3. Verify only one primary exists after partition 4. Restore network and verify automatic recovery
Lab 3: Optimize failover speed
Tasks: 1. Baseline test with default settings (TTL=30) 2. Reduce TTL to 20, test again 3. Reduce to 15, test again 4. Compare failover times 5. Evaluate trade-offs (speed vs false positives)
Lab 4: Failover under load
Tasks: 1. Generate load with pgbench: pgbench -c 10 -T 300 2. During load, stop primary 3. Count connection errors in pgbench output 4. Calculate availability percentage 5. Document user impact
11. Tổng kết
Key Concepts
✅ Automatic Failover = Self-healing without manual intervention
✅ Detection = Health checks + DCS connectivity + TTL expiration
✅ Election = Best replica based on lag, timeline, tags
✅ Promotion = pg_promote() + timeline increment + role change
✅ Timeline = Failover counter, prevents divergence
✅ TTL = Trade-off between speed and stability
Failover Checklist
- Primary failure detected
- Leader lock expired in DCS
- Best replica identified
- Leader lock acquired
- PostgreSQL promoted successfully
- Timeline incremented
- Callbacks executed
- Other replicas reconfigured
- Replication restored
- Cluster operational
Next Steps
Bài 14 sẽ cover Switchover có kế hoạch:
- Planned maintenance scenarios
- Zero-downtime switchover process
- Graceful vs immediate switchover
- Best practices for planned failove