Bài 27: Disaster Recovery Drills
Lập kế hoạch DR, testing procedures, quy trình incident response, post-mortem analysis và mô phỏng full DR scenario.
Bài 27: Disaster Recovery Drills
Mục tiêu
Sau bài học này, bạn sẽ:
- Plan comprehensive disaster recovery procedures
- Execute DR drills systematically
- Measure and optimize RTO/RPO
- Conduct incident response exercises
- Document and improve DR processes
1. DR Planning Foundation
1.1. Key DR metrics
RTO (Recovery Time Objective):
- Maximum acceptable downtime
- Example: 15 minutes
RPO (Recovery Point Objective):
- Maximum acceptable data loss
- Example: 5 minutes
RTA (Recovery Time Actual):
- Actual time taken in drill
- Goal: RTA < RTO
RPD (Recovery Point Detected):
- Actual data loss in drill
- Goal: RPD < RPO
1.2. DR scenarios to test
1. Single node failure
- Impact: Low (automatic failover)
- RTO: < 1 minute
- RPO: 0 (synchronous replication)
2. Leader node failure
- Impact: Medium (brief disruption)
- RTO: < 2 minutes
- RPO: 0
3. Complete datacenter failure
- Impact: High (manual intervention)
- RTO: < 15 minutes
- RPO: < 5 minutes
4. Data corruption
- Impact: High (PITR required)
- RTO: 1-4 hours
- RPO: Last valid backup
5. Human error (DROP TABLE)
- Impact: Medium-High
- RTO: 30 minutes - 2 hours
- RPO: Point-in-time before error
2. DR Drill Preparation
2.1. Pre-drill checklist
☐ Review DR documentation
☐ Verify all backups are current
☐ Test backup restoration (dry run)
☐ Confirm monitoring/alerting works
☐ Notify stakeholders of drill
☐ Schedule during low-traffic period
☐ Prepare rollback procedure
☐ Assemble response team
☐ Set up communication channels (Slack, Zoom)
☐ Document drill objectives
☐ Prepare stopwatch for timing
☐ Set up screen recording (for post-mortem)
2.2. DR team roles
Incident Commander:
- Owns overall response
- Makes final decisions
- Coordinates teams
Database Admin:
- Executes PostgreSQL recovery
- Manages Patroni cluster
- Validates data integrity
System Admin:
- Manages infrastructure
- Network connectivity
- Firewall rules
Application Owner:
- Tests application functionality
- Validates business logic
- User acceptance testing
Communications Lead:
- Updates stakeholders
- Documents timeline
- Post-mortem facilitator
Observer (optional):
- Takes notes
- Times each step
- Identifies improvements
3. Scenario 1: Single Replica Failure
3.1. Drill procedure
# Step 1: Simulate replica failure (10:00:00)
ssh node2 "sudo systemctl stop patroni"
# Step 2: Monitor automatic recovery (10:00:15)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'
# Expected output after 30 seconds:
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1 | 10.0.1.11 | Leader | running | 5 | |
# | node2 | 10.0.1.12 | Replica | STOPPED | | | ← Down
# | node3 | 10.0.1.13 | Replica | running | 5 | 0 |
# +--------+------------+---------+---------+----+-----------+
# Step 3: Verify read traffic routes to remaining replica (10:01:00)
psql -h haproxy-vip -U postgres -c "SELECT inet_server_addr();"
# Should return node1 or node3, NOT node2
# Step 4: Restore failed replica (10:05:00)
ssh node2 "sudo systemctl start patroni"
# Step 5: Wait for replication catchup (10:05:30)
patronictl -c /etc/patroni/patroni.yml list
# node2 should show "streaming" state
# Step 6: Verify replication lag is minimal (10:06:00)
psql -h node2 -U postgres -c "
SELECT pg_wal_lsn_diff(
pg_last_wal_receive_lsn(),
pg_last_wal_replay_lsn()
) AS lag_bytes;
"
# lag_bytes should be < 1MB
3.2. Expected results
Timeline:
- 10:00:00: Failure injected
- 10:00:30: Failure detected by Patroni
- 10:01:00: Traffic automatically rerouted
- 10:05:00: Recovery initiated
- 10:06:00: Full recovery complete
RTO: 1 minute (time until traffic rerouted)
RPO: 0 bytes (no data loss)
Impact:
- No application downtime
- Slightly increased load on remaining replica
- Monitoring alerts triggered (expected)
4. Scenario 2: Leader Failover
4.1. Drill procedure
# Step 1: Record current leader (10:00:00)
CURRENT_LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
echo "Current leader: $CURRENT_LEADER"
# Step 2: Simulate leader failure (10:00:05)
ssh $CURRENT_LEADER "sudo systemctl stop patroni"
# Step 3: Monitor automatic failover (10:00:10)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'
# Expected: New leader elected in 15-30 seconds
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host | Role | State | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1 | 10.0.1.11 | Replica | STOPPED | | | ← Old leader
# | node2 | 10.0.1.12 | Leader | running | 6 | | ← NEW leader
# | node3 | 10.0.1.13 | Replica | running | 6 | 0 |
# +--------+------------+---------+---------+----+-----------+
# Step 4: Test write operations (10:00:45)
psql -h haproxy-vip -U postgres <<EOF
CREATE TABLE drill_test_$(date +%s) (id serial primary key, data text);
INSERT INTO drill_test_$(date +%s) (data) VALUES ('DR drill success');
SELECT * FROM drill_test_$(date +%s);
EOF
# Step 5: Verify application connectivity (10:01:00)
# Run application health checks
curl -f http://app-server/health || echo "Application DOWN"
# Step 6: Restore old leader as replica (10:03:00)
ssh $CURRENT_LEADER "sudo systemctl start patroni"
# Step 7: Wait for reintegration (10:03:30)
patronictl -c /etc/patroni/patroni.yml list
# node1 should rejoin as replica
# Step 8: Validate replication (10:04:00)
psql -h $CURRENT_LEADER -U postgres -c "SELECT pg_is_in_recovery();"
# Should return 't' (true = replica)
4.2. Expected results
Timeline:
- 10:00:05: Leader failure injected
- 10:00:20: Failure detected (TTL expired)
- 10:00:35: New leader elected
- 10:00:45: Write operations succeed
- 10:01:00: Application fully functional
- 10:04:00: Old leader rejoins as replica
RTO: 30 seconds (leader election time)
RPO: 0 bytes (with synchronous replication)
Impact:
- 30 seconds of write unavailability
- Read operations continue on replicas
- ~10-20 failed write requests (depending on traffic)
- Monitoring alerts triggered
5. Scenario 3: Complete Datacenter Failure
5.1. Drill procedure
# Setup: Assume 2 datacenters
# DC1: node1 (leader), node2 (replica)
# DC2: node3 (replica)
# Step 1: Simulate DC1 total failure (10:00:00)
for node in node1 node2; do
ssh $node "sudo systemctl stop patroni"
ssh $node "sudo systemctl stop etcd" # Simulate network partition
done
# Step 2: Monitor DC2 status (10:00:15)
ssh node3 "patronictl -c /etc/patroni/patroni.yml list"
# Expected: No leader (quorum lost)
# Step 3: Manual intervention - promote DC2 replica (10:02:00)
# First, verify DC1 is truly down (not network glitch)
ping -c 3 node1 && echo "WARNING: DC1 still reachable!"
# Remove DC1 from etcd cluster
ssh node3 "etcdctl member list"
ssh node3 "etcdctl member remove <node1_member_id>"
ssh node3 "etcdctl member remove <node2_member_id>"
# Step 4: Promote node3 to leader (10:03:00)
ssh node3 "patronictl -c /etc/patroni/patroni.yml failover postgres-cluster --candidate node3 --force"
# Step 5: Update application connection strings (10:04:00)
# Point to DC2: node3 (now leader)
# This may require DNS update or load balancer reconfiguration
# Step 6: Verify write operations (10:05:00)
psql -h node3 -U postgres <<EOF
CREATE TABLE dc_failover_test (id serial primary key, recovered_at timestamp default now());
INSERT INTO dc_failover_test VALUES (DEFAULT);
SELECT * FROM dc_failover_test;
EOF
# Step 7: When DC1 recovers, reintegrate (later, during maintenance)
# Bring up DC1 nodes as replicas of DC2
ssh node1 "sudo systemctl start etcd"
ssh node1 "sudo systemctl start patroni"
# Wait for replication catchup
patronictl -c /etc/patroni/patroni.yml list
5.2. Expected results
Timeline:
- 10:00:00: DC1 failure
- 10:02:00: Decision to failover to DC2
- 10:03:00: Manual promotion of DC2 leader
- 10:04:00: Application reconfiguration
- 10:05:00: Service fully restored
RTO: 5 minutes (includes decision time)
RPO: 0-5 minutes (depends on replication lag at failure time)
Impact:
- 5 minutes of complete outage
- Possible data loss if async replication
- Manual intervention required
- Requires application update
6. Scenario 4: Point-in-Time Recovery (Data Corruption)
6.1. Drill procedure
# Setup: Simulate accidental table drop at 10:30:00
psql -h leader -U postgres <<EOF
CREATE TABLE important_data (id serial, data text);
INSERT INTO important_data (data) SELECT 'Record ' || generate_series(1, 1000);
SELECT count(*) FROM important_data; -- 1000 rows
EOF
# Record current time before corruption
BEFORE_CORRUPTION=$(date -u +"%Y-%m-%d %H:%M:%S")
echo "Before corruption: $BEFORE_CORRUPTION"
# Simulate data corruption at 10:30:00
psql -h leader -U postgres -c "DROP TABLE important_data;"
echo "Table dropped (simulating accident) at $(date)"
# Step 1: Detect data loss (10:30:30)
psql -h leader -U postgres -c "SELECT * FROM important_data;"
# ERROR: relation "important_data" does not exist
# Step 2: Identify PITR target time (10:31:00)
PITR_TARGET=$BEFORE_CORRUPTION
echo "Will recover to: $PITR_TARGET"
# Step 3: Setup recovery environment (10:32:00)
# Create separate recovery instance (don't disturb production!)
sudo mkdir -p /var/lib/postgresql/18/pitr_recovery
sudo chown postgres:postgres /var/lib/postgresql/18/pitr_recovery
# Step 4: Restore base backup (10:33:00)
sudo -u postgres pg_basebackup \
-h leader \
-D /var/lib/postgresql/18/pitr_recovery \
-X stream -P
# Step 5: Configure recovery (10:35:00)
cat << EOF | sudo tee /var/lib/postgresql/18/pitr_recovery/recovery.signal
# PITR recovery signal file
EOF
sudo -u postgres tee /var/lib/postgresql/18/pitr_recovery/postgresql.auto.conf <<EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '$PITR_TARGET'
recovery_target_action = 'promote'
EOF
# Step 6: Start recovery instance (10:36:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
-D /var/lib/postgresql/18/pitr_recovery \
-l /tmp/pitr_recovery.log \
start
# Step 7: Wait for recovery completion (10:40:00)
tail -f /tmp/pitr_recovery.log
# Look for: "database system is ready to accept connections"
# Step 8: Verify recovered data (10:41:00)
psql -h localhost -p 5433 -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows
# Step 9: Export recovered data (10:42:00)
pg_dump -h localhost -p 5433 -U postgres -t important_data > recovered_data.sql
# Step 10: Import to production (10:43:00)
psql -h leader -U postgres < recovered_data.sql
# Step 11: Verify production (10:44:00)
psql -h leader -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows ✅
# Step 12: Cleanup recovery instance (10:45:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
-D /var/lib/postgresql/18/pitr_recovery stop
sudo rm -rf /var/lib/postgresql/18/pitr_recovery
6.2. Expected results
Timeline:
- 10:30:00: Data corruption detected
- 10:31:00: PITR target time identified
- 10:33:00: Base backup restoration started
- 10:36:00: PITR recovery initiated
- 10:41:00: Data recovery complete
- 10:44:00: Data restored to production
- 10:45:00: Cleanup complete
RTO: 15 minutes (data restoration)
RPO: 0 (recovered to exact point before corruption)
Impact:
- Temporary read-only mode during restoration
- Requires manual data export/import
- No service downtime (recovery on separate instance)
7. DR Drill Metrics and Reporting
7.1. Drill scorecard
Scenario: Leader Failover Drill
Date: 2024-11-25
Duration: 30 minutes
Participants: 5 team members
Metrics:
☑ RTO Target: 2 minutes
RTO Actual: 35 seconds ✅ (Better than target)
☑ RPO Target: 0 bytes
RPO Actual: 0 bytes ✅
☑ Detection Time: 15 seconds ✅
☑ Failover Time: 20 seconds ✅
☑ Validation Time: 5 minutes ⚠️ (Could be faster)
Issues Found:
1. Monitoring alert delayed by 10 seconds (configuration issue)
2. Runbook step 3 outdated (missing new command)
3. Team member unfamiliar with patronictl commands
Action Items:
☐ Fix monitoring alert configuration
☐ Update runbook documentation
☐ Schedule training session for new commands
☐ Re-test in 2 weeks
7.2. Post-drill analysis
# DR Drill Post-Mortem: Leader Failover
## Summary
Successfully executed planned leader failover drill. RTO/RPO targets exceeded. Identified 3 areas for improvement.
## Timeline
| Time | Event | Owner |
|------|-------|-------|
| 10:00:00 | Drill initiated | DBA |
| 10:00:15 | Leader stopped | DBA |
| 10:00:30 | Failure detected | Monitoring |
| 10:00:35 | New leader elected | Patroni |
| 10:00:50 | Write operations tested | DBA |
| 10:01:00 | Application health check | App Owner |
| 10:05:00 | Old leader rejoined | DBA |
## What Went Well
✅ Automatic failover worked flawlessly
✅ Zero data loss confirmed
✅ Team communication effective
✅ Documentation mostly accurate
## What Could Be Improved
⚠️ Monitoring alert configuration needs tuning
⚠️ Runbook has outdated commands
⚠️ One team member needs additional training
## Action Items
1. [ ] Update Prometheus alert rules (@sre-team, due: 2024-11-30)
2. [ ] Revise DR runbook (@dba-team, due: 2024-11-28)
3. [ ] Conduct patronictl training (@dba-lead, due: 2024-12-05)
4. [ ] Schedule next drill (@incident-commander, due: 2025-01-15)
## Recommendations
- Continue quarterly DR drills
- Rotate incident commander role
- Add chaos engineering (random failures)
8. Chaos Engineering for HA
8.1. Chaos Monkey for PostgreSQL
#!/bin/bash
# chaos-monkey.sh - Randomly kill PostgreSQL nodes
NODES=("node1" "node2" "node3")
INTERVAL=3600 # 1 hour between failures
while true; do
# Random node
NODE=${NODES[$RANDOM % ${#NODES[@]}]}
# Random failure type
FAILURE_TYPE=$((RANDOM % 3))
case $FAILURE_TYPE in
0)
echo "$(date): Stopping Patroni on $NODE"
ssh $NODE "sudo systemctl stop patroni"
;;
1)
echo "$(date): Simulating network partition on $NODE"
ssh $NODE "sudo iptables -A INPUT -p tcp --dport 5432 -j DROP"
sleep 300
ssh $NODE "sudo iptables -D INPUT -p tcp --dport 5432 -j DROP"
;;
2)
echo "$(date): Stopping etcd on $NODE"
ssh $NODE "sudo systemctl stop etcd"
;;
esac
# Wait for recovery
sleep 300
# Restore if not auto-recovered
ssh $NODE "sudo systemctl start patroni"
ssh $NODE "sudo systemctl start etcd"
# Wait before next chaos
sleep $INTERVAL
done
8.2. Automated DR testing
# automated_dr_test.yml
---
- name: Automated DR Drill
hosts: postgres_cluster
vars:
drill_start_time: "{{ ansible_date_time.iso8601 }}"
tasks:
- name: Record baseline metrics
shell: patronictl -c /etc/patroni/patroni.yml list
register: baseline
- name: Inject failure on leader
shell: |
LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
ssh $LEADER "sudo systemctl stop patroni"
delegate_to: localhost
- name: Wait for failover
wait_for:
timeout: 60
- name: Verify new leader elected
shell: patronictl -c /etc/patroni/patroni.yml list | grep Leader | wc -l
register: leader_count
failed_when: leader_count.stdout != "1"
- name: Measure RTO
shell: |
echo "RTO: $(( $(date +%s) - $(date -d '{{ drill_start_time }}' +%s) )) seconds"
register: rto_result
- name: Generate drill report
template:
src: drill_report.j2
dest: /tmp/drill_report_{{ drill_start_time }}.txt
- name: Send report to Slack
uri:
url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
method: POST
body_format: json
body:
text: "DR Drill completed. RTO: {{ rto_result.stdout }}"
9. Best Practices
✅ DO
- Schedule regular drills - Quarterly minimum
- Test all scenarios - Not just easy ones
- Rotate roles - Everyone should be IC once
- Document everything - Timestamped notes
- Measure RTO/RPO - Track improvements
- Post-mortem every drill - Learn and improve
- Update runbooks - Keep documentation current
- Involve all teams - Cross-functional practice
- Test backups - Restore verification essential
- Automate where possible - Reduce human error
❌ DON'T
- Don't skip drills - "Too busy" is not an excuse
- Don't test only easy scenarios - Hard ones matter most
- Don't ignore action items - Follow up on improvements
- Don't reuse same scenario - Vary the drills
- Don't rely on one person - Bus factor = 1 is dangerous
- Don't rush - Proper testing takes time
- Don't skip post-mortems - Learning opportunity
10. Lab Exercises
Lab 1: Execute failover drill
Tasks:
- Plan and schedule drill
- Assign team roles
- Execute leader failover
- Document timeline
- Calculate RTO/RPO
- Write post-mortem
Lab 2: PITR recovery drill
Tasks:
- Create test data
- Simulate data corruption
- Identify PITR target time
- Restore to separate instance
- Verify recovered data
- Document procedure
Lab 3: Multi-DC failover
Tasks:
- Setup 2-DC cluster
- Simulate DC1 total failure
- Manually promote DC2
- Update application config
- Measure downtime
- Document lessons learned
Lab 4: Chaos engineering
Tasks:
- Implement chaos monkey script
- Run for 24 hours
- Monitor cluster behavior
- Document failures and recoveries
- Identify weak points
- Improve HA configuration
11. Tổng kết
DR Drill Frequency
Scenario Frequency:
- Single node failure: Monthly (automated)
- Leader failover: Quarterly
- DC failure: Semi-annually
- PITR recovery: Quarterly
- Full DR: Annually
Success Criteria
A successful DR drill has:
✅ Met RTO/RPO targets
✅ Zero data loss (or within RPO)
✅ All team members participated
✅ Documentation updated
✅ Action items identified
✅ Post-mortem completed
✅ Next drill scheduled
Key Metrics to Track
- Detection time (how fast we notice)
- Response time (how fast we act)
- Recovery time (how fast we restore)
- Data loss (how much data lost)
- Team coordination (how well we work together)
Next Steps
Bài 28 sẽ cover Thiết Kế Kiến Trúc HA:
- Requirements gathering
- Architecture design documents
- Capacity planning
- Cost estimation
- Design review process