Bài 15: Recovering failed nodes
Rejoin failed primary vào cluster, sử dụng pg_rewind mechanism và rebuild replica từ backup khi cần thiết.
Bài 15: Recovering failed nodes
Mục tiêu
Sau bài học này, bạn sẽ:
- Rejoin old primary sau khi failover
- Sử dụng pg_rewind để sync lại data
- Rebuild replica với pg_basebackup
- Xử lý timeline divergence
- Recover từ split-brain scenarios
- Automate recovery với Patroni
1. Node Recovery Overview
1.1. Recovery Scenarios
Khi nào cần recover node?
Scenario 1: Old primary sau failover
Before:
node1 (primary) → FAILS
node2 (replica) → promoted to primary
After:
node1: Needs rejoin as replica
node2: Current primary
Scenario 2: Replica disconnected
Before:
node3 (replica) → Network partition / Crash
After:
node3: Needs to catch up with primary
Scenario 3: Hardware replacement
Before:
node2: Disk failure
After:
node2: New disk, needs full rebuild
Scenario 4: Timeline divergence
Before:
node1 accepted writes AFTER losing leader lock
After:
node1: Diverged timeline, conflicts with cluster
1.2. Recovery Methods
| Method | When to use | Time | Data loss |
|---|---|---|---|
| Auto-rejoin | Node was clean shutdown | ~10s | None |
| pg_rewind | Timeline divergence | ~1-5min | None |
| pg_basebackup | Major corruption / Full rebuild | ~30min+ | None |
| Manual recovery | Complex split-brain scenarios | Varies | Possible |
2. Auto-Rejoin (Patroni Default)
2.1. How auto-rejoin works
When node comes back online:
1. Patroni starts
2. Checks DCS for cluster state
3. Finds current leader (e.g., node2)
4. Compares local timeline with cluster timeline
5. If compatible → auto-rejoin as replica
6. If diverged → need pg_rewind or reinit
2.2. Example: Clean rejoin
Setup:
# Current cluster state
patronictl list postgres
# + Cluster: postgres ----+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +--------+-------------+---------+---------+----+-----------+
# | node1 | 10.0.1.11 | Leader | running | 2 | |
# | node2 | 10.0.1.12 | Replica | running | 2 | 0 |
# | node3 | 10.0.1.13 | Replica | running | 2 | 0 |
# +--------+-------------+---------+---------+----+-----------+
Simulate node3 failure:
# On node3: Stop Patroni cleanly
sudo systemctl stop patroni
# Cluster now:
# | node1 | 10.0.1.11 | Leader | running | 2 | |
# | node2 | 10.0.1.12 | Replica | running | 2 | 0 |
# | node3 | 10.0.1.13 | - | stopped | - | | ← Down
Recovery:
# On node3: Start Patroni
sudo systemctl start patroni
# Watch logs
sudo journalctl -u patroni -f
Log output:
2024-11-25 10:00:00 INFO: Starting Patroni...
2024-11-25 10:00:01 INFO: Connected to DCS (etcd)
2024-11-25 10:00:02 INFO: Cluster timeline: 2, local timeline: 2 ✅
2024-11-25 10:00:03 INFO: Current leader: node1
2024-11-25 10:00:04 INFO: Rejoining as replica
2024-11-25 10:00:05 INFO: Starting PostgreSQL in recovery mode
2024-11-25 10:00:08 INFO: Replication started, streaming from node1
2024-11-25 10:00:10 INFO: Successfully rejoined cluster ✅
Verify:
patronictl list postgres
# + Cluster: postgres ----+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +--------+-------------+---------+---------+----+-----------+
# | node1 | 10.0.1.11 | Leader | running | 2 | |
# | node2 | 10.0.1.12 | Replica | running | 2 | 0 |
# | node3 | 10.0.1.13 | Replica | running | 2 | 0 | ← Rejoined!
# +--------+-------------+---------+---------+----+-----------+
Time: ~10 seconds ✅
2.3. Configuration for auto-rejoin
# In patroni.yml
postgresql:
use_pg_rewind: true # Enable automatic pg_rewind if needed
remove_data_directory_on_rewind_failure: false # Safety
remove_data_directory_on_diverged_timelines: false # Safety
# Patroni will attempt:
# 1. Auto-rejoin (if timelines match)
# 2. pg_rewind (if timeline diverged but recoverable)
# 3. Full reinit (if pg_rewind fails and auto-reinit enabled)
3. Using pg_rewind
3.1. What is pg_rewind?
pg_rewind = Tool to resync a PostgreSQL instance that diverged from the current timeline.
When needed:
Scenario: Old primary received writes AFTER failover
Timeline:
T+0: node1 (primary), node2 (replica)
T+1: Network partition
T+2: node2 promoted (timeline: 1 → 2)
T+3: node1 still thinks it's primary, accepts writes (timeline: 1)
T+4: Network restored
T+5: Conflict! node1 timeline=1, cluster timeline=2
Solution: pg_rewind node1 to match node2's timeline
How it works:
1. Find common ancestor (last shared WAL position)
2. Replay WAL from new primary
3. Overwrite conflicting blocks
4. Node rejoins as replica on new timeline
3.2. Prerequisites for pg_rewind
Requirements:
# In patroni.yml → postgresql.parameters
wal_log_hints: 'on' # Required! (or use full_page_writes)
# Or use data checksums (set during initdb):
# initdb --data-checksums
# Also ensure:
max_wal_senders: 10 # For replication
wal_level: replica # For replication
Why wal_log_hints?
Without wal_log_hints:
pg_rewind cannot determine which blocks changed
→ Cannot resync
→ Must use full rebuild (pg_basebackup)
With wal_log_hints:
PostgreSQL tracks all block changes
→ pg_rewind can identify divergence
→ Fast resync ✅
Trade-off: ~1-2% write performance overhead
3.3. Manual pg_rewind
Scenario: node1 (old primary) needs resync after failover.
Step 1: Stop PostgreSQL on node1
# On node1
sudo systemctl stop patroni
sudo systemctl stop postgresql
Step 2: Run pg_rewind
# On node1: Rewind to match node2 (current primary)
sudo -u postgres pg_rewind \
--target-pgdata=/var/lib/postgresql/18/data \
--source-server="host=10.0.1.12 port=5432 user=replicator dbname=postgres" \
--progress \
--debug
# Output:
# connected to server
# servers diverged at WAL location 0/3000000 on timeline 1
# rewinding from last common checkpoint at 0/2000000 on timeline 1
# reading source file list
# reading target file list
# reading WAL in target
# need to copy 124 MB (total source directory size is 2048 MB)
# creating backup label and updating control file
# syncing target data directory
# Done!
Step 3: Create standby.signal
# On node1: Mark as standby
sudo -u postgres touch /var/lib/postgresql/18/data/standby.signal
Step 4: Update primary_conninfo
# On node1: Point to new primary (node2)
sudo -u postgres tee /var/lib/postgresql/18/data/postgresql.auto.conf <<EOF
primary_conninfo = 'host=10.0.1.12 port=5432 user=replicator password=replica_password'
EOF
Step 5: Start PostgreSQL
# On node1
sudo systemctl start patroni
# Patroni will start PostgreSQL in recovery mode
Step 6: Verify
patronictl list postgres
# node1 should now be a Replica following node2 ✅
Time: ~1-5 minutes (depends on divergence size)
3.4. Automatic pg_rewind (Patroni)
Enable in patroni.yml:
# Patroni will automatically run pg_rewind if needed
postgresql:
use_pg_rewind: true
parameters:
wal_log_hints: 'on' # Required!
Behavior:
When node rejoins after failover:
1. Patroni detects timeline divergence
2. Automatically runs pg_rewind
3. Restarts PostgreSQL as replica
4. Node rejoins cluster
No manual intervention needed! ✅
Example log:
2024-11-25 10:05:00 INFO: Local timeline 1, cluster timeline 2
2024-11-25 10:05:01 WARNING: Timeline divergence detected
2024-11-25 10:05:02 INFO: use_pg_rewind enabled, attempting rewind...
2024-11-25 10:05:03 INFO: Running pg_rewind...
2024-11-25 10:05:45 INFO: pg_rewind completed successfully
2024-11-25 10:05:46 INFO: Starting PostgreSQL as replica
2024-11-25 10:05:50 INFO: Rejoined cluster ✅
4. Full Rebuild with pg_basebackup
4.1. When to use pg_basebackup
Use cases:
- pg_rewind failed - Data too diverged
- Corruption detected - Data integrity issues
- Major version upgrade - Different PostgreSQL versions
- New node - Adding fresh replica to cluster
- Disk replaced - Empty data directory
- Paranoid safety - Want guaranteed clean state
Trade-off: Slower (~30min-2hrs for large DB) but guaranteed clean.
4.2. Manual pg_basebackup
Step 1: Stop and clean node
# On node to rebuild (e.g., node3)
sudo systemctl stop patroni
sudo systemctl stop postgresql
# Remove old data directory
sudo rm -rf /var/lib/postgresql/18/data/*
Step 2: Take base backup from primary
# On node3: Backup from current primary (node2)
sudo -u postgres pg_basebackup \
-h 10.0.1.12 \
-p 5432 \
-U replicator \
-D /var/lib/postgresql/18/data \
-Fp \
-Xs \
-P \
-R
# Flags:
# -h: Host (primary)
# -U: Replication user
# -D: Target data directory
# -Fp: Plain format (not tar)
# -Xs: Stream WAL during backup
# -P: Show progress
# -R: Create standby.signal and replication config
Output:
Password: [enter replicator password]
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/4000000 on timeline 2
pg_basebackup: starting background WAL receiver
24567/24567 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/4000168
pg_basebackup: syncing data to disk ...
pg_basebackup: base backup completed
Step 3: Verify configuration
# On node3: Check standby.signal created
ls /var/lib/postgresql/18/data/standby.signal
# Check primary_conninfo
cat /var/lib/postgresql/18/data/postgresql.auto.conf | grep primary_conninfo
Step 4: Start node
# On node3
sudo systemctl start patroni
# Node will rejoin as replica
Step 5: Verify
patronictl list postgres
# node3 should be streaming from primary ✅
Time: ~30min-2hrs (depends on database size)
4.3. Patroni automatic reinit
Enable auto-reinit:
# In patroni.yml
postgresql:
use_pg_rewind: true
# If pg_rewind fails, auto-reinit
remove_data_directory_on_rewind_failure: true
remove_data_directory_on_diverged_timelines: true
# WARNING: Data directory will be DELETED and recreated
# Only enable if you trust automation!
Behavior:
When node rejoins:
1. Try auto-rejoin → FAILED (diverged)
2. Try pg_rewind → FAILED (corruption)
3. Automatically remove data directory
4. Run pg_basebackup from current primary
5. Rejoin as replica
Fully automated! But destructive! ⚠️
4.4. Patroni reinit command
Manual trigger:
# Force reinit on node3
patronictl reinit postgres node3
# Patroni will:
# 1. Stop PostgreSQL on node3
# 2. Remove data directory
# 3. Run pg_basebackup from leader
# 4. Start as replica
# Prompt:
# Are you sure you want to reinitialize members node3? [y/N]: y
Monitor progress:
# On node3: Watch logs
sudo journalctl -u patroni -f
# Expected:
# INFO: Removing data directory...
# INFO: Running pg_basebackup...
# INFO: Backup completed (24 GB in 15 minutes)
# INFO: Starting PostgreSQL...
# INFO: Rejoined cluster ✅
5. Timeline Divergence Resolution
5.1. Understanding timelines
Timeline = History branch counter
Initial:
Timeline 1 (all nodes)
After first failover:
Old primary: Timeline 1
New primary: Timeline 2 ← Incremented
After second failover:
Timeline 3 ← Incremented again
Why timelines exist:
Prevent data conflict:
If two nodes both think they're primary,
they write on different timelines.
→ Conflict detected
→ Manual intervention required
5.2. Detecting timeline divergence
Check local timeline:
# On any node
sudo -u postgres psql -c "
SELECT timeline_id
FROM pg_control_checkpoint();
"
# Example:
# timeline_id
# ------------
# 2
Check cluster timeline:
# Via Patroni
patronictl list postgres | head -2
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# ↑ Timeline in cluster ID
# Or via REST API
curl -s http://10.0.1.12:8008/patroni | jq '.timeline'
# Output: 2
Compare:
# If node timeline ≠ cluster timeline
# → Node needs pg_rewind or reinit
5.3. Scenario: Timeline divergence after split-brain
Setup:
T+0: 3-node cluster, node1 = primary (timeline 2)
T+1: Network partition splits node1 from node2/node3
T+2: node1 thinks it's still primary (timeline 2)
T+3: node2/node3 elect node2 as primary (timeline 3)
T+4: Both node1 and node2 accept writes!
- node1: timeline 2, accepting writes ❌
- node2: timeline 3, accepting writes ✅
- Split-brain! ⚠️
T+5: Network restored
T+6: Conflict detected
Resolution:
# Step 1: Verify which timeline is "correct"
patronictl list postgres
# + Cluster: postgres ----+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +--------+-------------+---------+---------+----+-----------+
# | node1 | 10.0.1.11 | - | stopped | 2 | | ← WRONG timeline
# | node2 | 10.0.1.12 | Leader | running | 3 | | ← CORRECT
# | node3 | 10.0.1.13 | Replica | running | 3 | 0 |
# +--------+-------------+---------+---------+----+-----------+
# Step 2: Save diverged data from node1 (if needed)
sudo -u postgres pg_dumpall -h 10.0.1.11 > /backup/node1-diverged-data.sql
# Step 3: Rewind node1 to match timeline 3
# If pg_rewind works:
patronictl reinit postgres node1
# If pg_rewind fails (likely due to significant divergence):
# Manual pg_basebackup required
sudo systemctl stop patroni # On node1
sudo rm -rf /var/lib/postgresql/18/data/*
sudo -u postgres pg_basebackup -h 10.0.1.12 -D /var/lib/postgresql/18/data -U replicator -R -P
sudo systemctl start patroni
# Step 4: Manually reconcile diverged data (if important)
# Review /backup/node1-diverged-data.sql
# Manually merge important transactions into node2
Prevention:
# Configure Patroni to prevent split-brain
bootstrap:
dcs:
# Primary loses leader lock → immediately demote
ttl: 30
retry_timeout: 10
postgresql:
parameters:
# Prevent writes if not sure about leadership
synchronous_commit: 'remote_apply' # Requires sync replica
6. Split-Brain Prevention and Recovery
6.1. How Patroni prevents split-brain
Mechanism: DCS Leader Lock
Primary MUST hold leader lock in DCS:
If primary loses DCS connection:
1. Cannot renew leader lock
2. TTL expires (e.g., 30 seconds)
3. Primary DEMOTES itself (becomes read-only)
4. Replicas detect no leader
5. Election begins
Key: Primary NEVER operates without DCS lock ✅
Code flow (pseudo):
while True:
if is_leader:
if can_renew_leader_lock():
# Still leader, continue
accept_writes()
else:
# Lost DCS connection!
log.error("Lost leader lock, DEMOTING!")
demote_to_replica()
reject_writes()
sleep(loop_wait)
6.2. Fencing mechanisms
PostgreSQL-level fencing:
-- When demoted, set read-only
ALTER SYSTEM SET default_transaction_read_only = 'on';
SELECT pg_reload_conf();
-- All new transactions will fail:
-- ERROR: cannot execute INSERT in a read-only transaction
OS-level fencing (advanced):
# STONITH (Shoot The Other Node In The Head)
# Via callbacks in patroni.yml
callbacks:
on_start: /var/lib/postgresql/callbacks/on_start.sh
on_stop: /var/lib/postgresql/callbacks/on_stop.sh
on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh
# on_role_change.sh example:
#!/bin/bash
ROLE=$1 # "master" or "replica"
if [ "$ROLE" == "replica" ]; then
# Lost leadership, ensure NO writes possible
sudo iptables -A INPUT -p tcp --dport 5432 -j REJECT
# Block incoming connections to PostgreSQL
fi
if [ "$ROLE" == "master" ]; then
# Gained leadership, allow writes
sudo iptables -D INPUT -p tcp --dport 5432 -j REJECT
fi
6.3. Scenario: Recover from split-brain
Detection:
# Symptoms:
# - Multiple nodes claim to be primary
# - Patroni shows errors
# - Applications seeing inconsistent data
# Check cluster state
patronictl list postgres
# If you see multiple "Leader" or conflicts:
# SPLIT-BRAIN DETECTED! ⚠️
Recovery steps:
# Step 1: STOP ALL NODES immediately
for node in node1 node2 node3; do
ssh $node "sudo systemctl stop patroni"
done
# Step 2: Determine "source of truth"
# Usually: Node with most recent data / highest timeline
for node in node1 node2 node3; do
echo "=== $node ==="
ssh $node "sudo -u postgres psql -c \"
SELECT timeline_id, pg_last_wal_receive_lsn()
FROM pg_control_checkpoint();
\""
done
# Step 3: Choose winner (e.g., node2 has highest timeline)
WINNER="node2"
# Step 4: Backup diverged data from losers
ssh node1 "sudo -u postgres pg_dumpall > /backup/node1-diverged.sql"
ssh node3 "sudo -u postgres pg_dumpall > /backup/node3-diverged.sql"
# Step 5: Wipe losers and rebuild from winner
for node in node1 node3; do
ssh $node "sudo rm -rf /var/lib/postgresql/18/data/*"
ssh $node "sudo -u postgres pg_basebackup \
-h $WINNER \
-D /var/lib/postgresql/18/data \
-U replicator -R -P"
done
# Step 6: Clear DCS state (fresh start)
etcdctl del --prefix /service/postgres/
# Step 7: Start winner first
ssh $WINNER "sudo systemctl start patroni"
# Wait for winner to become leader
sleep 10
# Step 8: Start other nodes
ssh node1 "sudo systemctl start patroni"
ssh node3 "sudo systemctl start patroni"
# Step 9: Verify cluster
patronictl list postgres
# Should show:
# node2: Leader
# node1: Replica (following node2)
# node3: Replica (following node2)
# All same timeline ✅
# Step 10: Reconcile diverged data manually
# Review /backup/*-diverged.sql files
# Merge critical transactions if needed
7. Monitoring Node Recovery
7.1. Key metrics
-- Replication status
SELECT application_name,
state,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
replay_lag,
sync_state
FROM pg_stat_replication;
-- Timeline check
SELECT timeline_id FROM pg_control_checkpoint();
-- Recovery status (on replica)
SELECT pg_is_in_recovery(),
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn(),
pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn()) AS replay_lag_bytes;
7.2. Patroni REST API monitoring
# Check node status
curl -s http://10.0.1.11:8008/patroni | jq
# Key fields:
# {
# "state": "running",
# "role": "replica",
# "timeline": 3,
# "replication": [
# {
# "usename": "replicator",
# "application_name": "node1",
# "state": "streaming",
# "sync_state": "async",
# "replay_lsn": "0/5000000"
# }
# ]
# }
7.3. Alerting on recovery issues
# Prometheus alert
groups:
- name: node_recovery
rules:
- alert: PatroniNodeDown
expr: up{job="patroni"} == 0
for: 1m
labels:
severity: warning
annotations:
summary: "Patroni node {{ $labels.instance }} is down"
- alert: PatroniTimelineMismatch
expr: |
count by (cluster) (patroni_timeline)
!=
count by (cluster, timeline) (patroni_timeline)
labels:
severity: critical
annotations:
summary: "Timeline mismatch detected - possible split-brain"
- alert: PatroniReplicationLagHigh
expr: patroni_replication_lag_bytes > 104857600 # 100MB
for: 5m
labels:
severity: warning
annotations:
summary: "Replication lag > 100MB on {{ $labels.instance }}"
8. Best Practices
✅ DO
- Enable wal_log_hints - Required for pg_rewind
- Test recovery regularly - Monthly drills
- Monitor timelines - Alert on divergence
- Have backups - Before risky operations
- Document procedures - Recovery runbooks
- Use Patroni auto-recovery - Less manual intervention
- Verify after recovery - Test replication, queries
- Keep DCS healthy - etcd cluster critical
- Log everything - Audit trail for incidents
- Practice split-brain recovery - Hope never needed, but be ready
❌ DON'T
- Don't skip wal_log_hints - pg_rewind will fail
- Don't assume auto-recovery works - Test it!
- Don't ignore timeline mismatches - Critical issue
- Don't manually promote during recovery - Let Patroni handle
- Don't delete data without backup - Diverged data may be important
- Don't run split-brain clusters - Fix immediately
- Don't forget callbacks - Fencing prevents split-brain
- Don't over-automate reinit - Risk data loss
9. Lab Exercises
Lab 1: Auto-rejoin after clean shutdown
Tasks:
- Stop one replica:
sudo systemctl stop patroni - Make changes on primary
- Start replica:
sudo systemctl start patroni - Verify auto-rejoin and lag catch-up
- Time the recovery
Lab 2: pg_rewind after simulated failover
Tasks:
- Record current primary
- Manually stop primary:
sudo systemctl stop patroni - Wait for failover to complete
- Start old primary (should auto-rewind)
- Verify old primary rejoined as replica
- Check timeline increment
Lab 3: Full rebuild with pg_basebackup
Tasks:
- Stop a replica
- Delete data directory:
sudo rm -rf /var/lib/postgresql/18/data/* - Manually run pg_basebackup from primary
- Start replica
- Verify replication restored
- Measure rebuild time
Lab 4: Patroni reinit command
Tasks:
- Use
patronictl reinit postgres node3 - Monitor logs during process
- Verify automated rebuild
- Compare time vs manual pg_basebackup
Lab 5: Timeline divergence simulation
Tasks:
- Create network partition (iptables)
- Wait for failover
- Manually promote old primary (force split-brain)
- Write different data to both "primaries"
- Restore network
- Observe conflict detection
- Practice recovery procedure
10. Troubleshooting
Issue: pg_rewind fails
Error: pg_rewind: fatal: could not find common ancestor
Cause: wal_log_hints not enabled or data too diverged.
Solution:
# Check wal_log_hints
sudo -u postgres psql -c "SHOW wal_log_hints;"
# If off, enable:
sudo -u postgres psql -c "ALTER SYSTEM SET wal_log_hints = on;"
sudo systemctl restart postgresql
# If still fails, use pg_basebackup instead
patronictl reinit postgres node1
Issue: Replica stuck in recovery
Symptoms: Replica shows "running" but high lag.
Diagnosis:
# Check replication status
sudo -u postgres psql -h 10.0.1.11 -c "
SELECT * FROM pg_stat_replication;
"
# Check replica logs
sudo journalctl -u postgresql -n 100
Common causes:
- WAL receiver crashed
- Network issues
- Disk full on replica
- Archive restore errors
Solution:
# Restart replication
sudo systemctl restart patroni
# If persists, reinit
patronictl reinit postgres node3
Issue: Cannot connect after recovery
Error: FATAL: the database system is starting up
Cause: PostgreSQL still replaying WAL.
Solution: Wait for recovery to complete, or check logs for errors.
# Check recovery progress
sudo -u postgres psql -h 10.0.1.13 -c "
SELECT pg_is_in_recovery(),
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn();
"
11. Tổng kết
Recovery Methods Summary
| Method | Speed | Data Loss | Use Case |
|---|---|---|---|
| Auto-rejoin | Fastest | None | Clean shutdown/restart |
| pg_rewind | Fast | None | Timeline divergence |
| pg_basebackup | Slow | None | Corruption, major divergence |
| Manual recovery | Varies | Possible | Split-brain, complex issues |
Key Concepts
✅ Auto-rejoin - Patroni handles clean recovery automatically
✅ pg_rewind - Resync after timeline divergence (requires wal_log_hints)
✅ pg_basebackup - Full rebuild from primary (slow but safe)
✅ Timeline - History branch, increments on failover
✅ Split-brain - Multiple primaries (prevented by DCS leader lock)
Recovery Checklist
- Node failure detected
- Determine recovery method needed
- Backup diverged data (if any)
- Execute recovery (auto or manual)
- Verify timeline matches cluster
- Verify replication streaming
- Test read/write operations
- Check replication lag
- Update monitoring/documentation
Next Steps
Bài 16 sẽ cover Backup và Point-in-Time Recovery:
- pg_basebackup strategies
- WAL archiving configuration
- Point-in-Time Recovery (PITR) procedures
- Backup automation and scheduling
- Disaster recovery planning