Bài 27: Disaster Recovery Drills

Mục tiêu

Sau bài học này, bạn sẽ:

Plan comprehensive disaster recovery procedures
Execute DR drills systematically
Measure and optimize RTO/RPO
Conduct incident response exercises
Document and improve DR processes

1. DR Planning Foundation

1.1. Key DR metrics

RTO (Recovery Time Objective):
- Maximum acceptable downtime
- Example: 15 minutes

RPO (Recovery Point Objective):
- Maximum acceptable data loss
- Example: 5 minutes

RTA (Recovery Time Actual):
- Actual time taken in drill
- Goal: RTA < RTO

RPD (Recovery Point Detected):
- Actual data loss in drill
- Goal: RPD < RPO

1.2. DR scenarios to test

1. Single node failure
   - Impact: Low (automatic failover)
   - RTO: < 1 minute
   - RPO: 0 (synchronous replication)

2. Leader node failure
   - Impact: Medium (brief disruption)
   - RTO: < 2 minutes
   - RPO: 0

3. Complete datacenter failure
   - Impact: High (manual intervention)
   - RTO: < 15 minutes
   - RPO: < 5 minutes

4. Data corruption
   - Impact: High (PITR required)
   - RTO: 1-4 hours
   - RPO: Last valid backup

5. Human error (DROP TABLE)
   - Impact: Medium-High
   - RTO: 30 minutes - 2 hours
   - RPO: Point-in-time before error

2. DR Drill Preparation

2.1. Pre-drill checklist

☐ Review DR documentation
☐ Verify all backups are current
☐ Test backup restoration (dry run)
☐ Confirm monitoring/alerting works
☐ Notify stakeholders of drill
☐ Schedule during low-traffic period
☐ Prepare rollback procedure
☐ Assemble response team
☐ Set up communication channels (Slack, Zoom)
☐ Document drill objectives
☐ Prepare stopwatch for timing
☐ Set up screen recording (for post-mortem)

2.2. DR team roles

Incident Commander:
- Owns overall response
- Makes final decisions
- Coordinates teams

Database Admin:
- Executes PostgreSQL recovery
- Manages Patroni cluster
- Validates data integrity

System Admin:
- Manages infrastructure
- Network connectivity
- Firewall rules

Application Owner:
- Tests application functionality
- Validates business logic
- User acceptance testing

Communications Lead:
- Updates stakeholders
- Documents timeline
- Post-mortem facilitator

Observer (optional):
- Takes notes
- Times each step
- Identifies improvements

3. Scenario 1: Single Replica Failure

3.1. Drill procedure

# Step 1: Simulate replica failure (10:00:00)
ssh node2 "sudo systemctl stop patroni"

# Step 2: Monitor automatic recovery (10:00:15)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# Expected output after 30 seconds:
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host       | Role    | State   | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11  | Leader  | running |  5 |           |
# | node2  | 10.0.1.12  | Replica | STOPPED |    |           |  ← Down
# | node3  | 10.0.1.13  | Replica | running |  5 |         0 |
# +--------+------------+---------+---------+----+-----------+

# Step 3: Verify read traffic routes to remaining replica (10:01:00)
psql -h haproxy-vip -U postgres -c "SELECT inet_server_addr();"
# Should return node1 or node3, NOT node2

# Step 4: Restore failed replica (10:05:00)
ssh node2 "sudo systemctl start patroni"

# Step 5: Wait for replication catchup (10:05:30)
patronictl -c /etc/patroni/patroni.yml list
# node2 should show "streaming" state

# Step 6: Verify replication lag is minimal (10:06:00)
psql -h node2 -U postgres -c "
  SELECT pg_wal_lsn_diff(
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn()
  ) AS lag_bytes;
"
# lag_bytes should be < 1MB

3.2. Expected results

Timeline:
- 10:00:00: Failure injected
- 10:00:30: Failure detected by Patroni
- 10:01:00: Traffic automatically rerouted
- 10:05:00: Recovery initiated
- 10:06:00: Full recovery complete

RTO: 1 minute (time until traffic rerouted)
RPO: 0 bytes (no data loss)

Impact:
- No application downtime
- Slightly increased load on remaining replica
- Monitoring alerts triggered (expected)

4. Scenario 2: Leader Failover

4.1. Drill procedure

# Step 1: Record current leader (10:00:00)
CURRENT_LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
echo "Current leader: $CURRENT_LEADER"

# Step 2: Simulate leader failure (10:00:05)
ssh $CURRENT_LEADER "sudo systemctl stop patroni"

# Step 3: Monitor automatic failover (10:00:10)
watch -n 1 'patronictl -c /etc/patroni/patroni.yml list'

# Expected: New leader elected in 15-30 seconds
# + Cluster: postgres-cluster -------+----+-----------+
# | Member | Host       | Role    | State   | TL | Lag in MB |
# +--------+------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11  | Replica | STOPPED |    |           |  ← Old leader
# | node2  | 10.0.1.12  | Leader  | running |  6 |           |  ← NEW leader
# | node3  | 10.0.1.13  | Replica | running |  6 |         0 |
# +--------+------------+---------+---------+----+-----------+

# Step 4: Test write operations (10:00:45)
psql -h haproxy-vip -U postgres <<EOF
CREATE TABLE drill_test_$(date +%s) (id serial primary key, data text);
INSERT INTO drill_test_$(date +%s) (data) VALUES ('DR drill success');
SELECT * FROM drill_test_$(date +%s);
EOF

# Step 5: Verify application connectivity (10:01:00)
# Run application health checks
curl -f http://app-server/health || echo "Application DOWN"

# Step 6: Restore old leader as replica (10:03:00)
ssh $CURRENT_LEADER "sudo systemctl start patroni"

# Step 7: Wait for reintegration (10:03:30)
patronictl -c /etc/patroni/patroni.yml list
# node1 should rejoin as replica

# Step 8: Validate replication (10:04:00)
psql -h $CURRENT_LEADER -U postgres -c "SELECT pg_is_in_recovery();"
# Should return 't' (true = replica)

4.2. Expected results

Timeline:
- 10:00:05: Leader failure injected
- 10:00:20: Failure detected (TTL expired)
- 10:00:35: New leader elected
- 10:00:45: Write operations succeed
- 10:01:00: Application fully functional
- 10:04:00: Old leader rejoins as replica

RTO: 30 seconds (leader election time)
RPO: 0 bytes (with synchronous replication)

Impact:
- 30 seconds of write unavailability
- Read operations continue on replicas
- ~10-20 failed write requests (depending on traffic)
- Monitoring alerts triggered

5. Scenario 3: Complete Datacenter Failure

5.1. Drill procedure

# Setup: Assume 2 datacenters
# DC1: node1 (leader), node2 (replica)
# DC2: node3 (replica)

# Step 1: Simulate DC1 total failure (10:00:00)
for node in node1 node2; do
  ssh $node "sudo systemctl stop patroni"
  ssh $node "sudo systemctl stop etcd"  # Simulate network partition
done

# Step 2: Monitor DC2 status (10:00:15)
ssh node3 "patronictl -c /etc/patroni/patroni.yml list"
# Expected: No leader (quorum lost)

# Step 3: Manual intervention - promote DC2 replica (10:02:00)
# First, verify DC1 is truly down (not network glitch)
ping -c 3 node1 && echo "WARNING: DC1 still reachable!"

# Remove DC1 from etcd cluster
ssh node3 "etcdctl member list"
ssh node3 "etcdctl member remove <node1_member_id>"
ssh node3 "etcdctl member remove <node2_member_id>"

# Step 4: Promote node3 to leader (10:03:00)
ssh node3 "patronictl -c /etc/patroni/patroni.yml failover postgres-cluster --candidate node3 --force"

# Step 5: Update application connection strings (10:04:00)
# Point to DC2: node3 (now leader)
# This may require DNS update or load balancer reconfiguration

# Step 6: Verify write operations (10:05:00)
psql -h node3 -U postgres <<EOF
CREATE TABLE dc_failover_test (id serial primary key, recovered_at timestamp default now());
INSERT INTO dc_failover_test VALUES (DEFAULT);
SELECT * FROM dc_failover_test;
EOF

# Step 7: When DC1 recovers, reintegrate (later, during maintenance)
# Bring up DC1 nodes as replicas of DC2
ssh node1 "sudo systemctl start etcd"
ssh node1 "sudo systemctl start patroni"
# Wait for replication catchup
patronictl -c /etc/patroni/patroni.yml list

5.2. Expected results

Timeline:
- 10:00:00: DC1 failure
- 10:02:00: Decision to failover to DC2
- 10:03:00: Manual promotion of DC2 leader
- 10:04:00: Application reconfiguration
- 10:05:00: Service fully restored

RTO: 5 minutes (includes decision time)
RPO: 0-5 minutes (depends on replication lag at failure time)

Impact:
- 5 minutes of complete outage
- Possible data loss if async replication
- Manual intervention required
- Requires application update

6. Scenario 4: Point-in-Time Recovery (Data Corruption)

6.1. Drill procedure

# Setup: Simulate accidental table drop at 10:30:00
psql -h leader -U postgres <<EOF
CREATE TABLE important_data (id serial, data text);
INSERT INTO important_data (data) SELECT 'Record ' || generate_series(1, 1000);
SELECT count(*) FROM important_data;  -- 1000 rows
EOF

# Record current time before corruption
BEFORE_CORRUPTION=$(date -u +"%Y-%m-%d %H:%M:%S")
echo "Before corruption: $BEFORE_CORRUPTION"

# Simulate data corruption at 10:30:00
psql -h leader -U postgres -c "DROP TABLE important_data;"
echo "Table dropped (simulating accident) at $(date)"

# Step 1: Detect data loss (10:30:30)
psql -h leader -U postgres -c "SELECT * FROM important_data;"
# ERROR: relation "important_data" does not exist

# Step 2: Identify PITR target time (10:31:00)
PITR_TARGET=$BEFORE_CORRUPTION
echo "Will recover to: $PITR_TARGET"

# Step 3: Setup recovery environment (10:32:00)
# Create separate recovery instance (don't disturb production!)
sudo mkdir -p /var/lib/postgresql/18/pitr_recovery
sudo chown postgres:postgres /var/lib/postgresql/18/pitr_recovery

# Step 4: Restore base backup (10:33:00)
sudo -u postgres pg_basebackup \
  -h leader \
  -D /var/lib/postgresql/18/pitr_recovery \
  -X stream -P

# Step 5: Configure recovery (10:35:00)
cat << EOF | sudo tee /var/lib/postgresql/18/pitr_recovery/recovery.signal
# PITR recovery signal file
EOF

sudo -u postgres tee /var/lib/postgresql/18/pitr_recovery/postgresql.auto.conf <<EOF
restore_command = 'cp /var/lib/postgresql/wal_archive/%f %p'
recovery_target_time = '$PITR_TARGET'
recovery_target_action = 'promote'
EOF

# Step 6: Start recovery instance (10:36:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
  -D /var/lib/postgresql/18/pitr_recovery \
  -l /tmp/pitr_recovery.log \
  start

# Step 7: Wait for recovery completion (10:40:00)
tail -f /tmp/pitr_recovery.log
# Look for: "database system is ready to accept connections"

# Step 8: Verify recovered data (10:41:00)
psql -h localhost -p 5433 -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows

# Step 9: Export recovered data (10:42:00)
pg_dump -h localhost -p 5433 -U postgres -t important_data > recovered_data.sql

# Step 10: Import to production (10:43:00)
psql -h leader -U postgres < recovered_data.sql

# Step 11: Verify production (10:44:00)
psql -h leader -U postgres -c "SELECT count(*) FROM important_data;"
# Should return: 1000 rows ✅

# Step 12: Cleanup recovery instance (10:45:00)
sudo -u postgres /usr/lib/postgresql/18/bin/pg_ctl \
  -D /var/lib/postgresql/18/pitr_recovery stop
sudo rm -rf /var/lib/postgresql/18/pitr_recovery

6.2. Expected results

Timeline:
- 10:30:00: Data corruption detected
- 10:31:00: PITR target time identified
- 10:33:00: Base backup restoration started
- 10:36:00: PITR recovery initiated
- 10:41:00: Data recovery complete
- 10:44:00: Data restored to production
- 10:45:00: Cleanup complete

RTO: 15 minutes (data restoration)
RPO: 0 (recovered to exact point before corruption)

Impact:
- Temporary read-only mode during restoration
- Requires manual data export/import
- No service downtime (recovery on separate instance)

7. DR Drill Metrics and Reporting

7.1. Drill scorecard

Scenario: Leader Failover Drill
Date: 2024-11-25
Duration: 30 minutes
Participants: 5 team members

Metrics:
☑ RTO Target: 2 minutes
  RTO Actual: 35 seconds ✅ (Better than target)

☑ RPO Target: 0 bytes
  RPO Actual: 0 bytes ✅

☑ Detection Time: 15 seconds ✅
☑ Failover Time: 20 seconds ✅
☑ Validation Time: 5 minutes ⚠️ (Could be faster)

Issues Found:
1. Monitoring alert delayed by 10 seconds (configuration issue)
2. Runbook step 3 outdated (missing new command)
3. Team member unfamiliar with patronictl commands

Action Items:
☐ Fix monitoring alert configuration
☐ Update runbook documentation
☐ Schedule training session for new commands
☐ Re-test in 2 weeks

7.2. Post-drill analysis

# DR Drill Post-Mortem: Leader Failover

## Summary
Successfully executed planned leader failover drill. RTO/RPO targets exceeded. Identified 3 areas for improvement.

## Timeline
| Time | Event | Owner |
|------|-------|-------|
| 10:00:00 | Drill initiated | DBA |
| 10:00:15 | Leader stopped | DBA |
| 10:00:30 | Failure detected | Monitoring |
| 10:00:35 | New leader elected | Patroni |
| 10:00:50 | Write operations tested | DBA |
| 10:01:00 | Application health check | App Owner |
| 10:05:00 | Old leader rejoined | DBA |

## What Went Well
✅ Automatic failover worked flawlessly
✅ Zero data loss confirmed
✅ Team communication effective
✅ Documentation mostly accurate

## What Could Be Improved
⚠️ Monitoring alert configuration needs tuning
⚠️ Runbook has outdated commands
⚠️ One team member needs additional training

## Action Items
1. [ ] Update Prometheus alert rules (@sre-team, due: 2024-11-30)
2. [ ] Revise DR runbook (@dba-team, due: 2024-11-28)
3. [ ] Conduct patronictl training (@dba-lead, due: 2024-12-05)
4. [ ] Schedule next drill (@incident-commander, due: 2025-01-15)

## Recommendations
- Continue quarterly DR drills
- Rotate incident commander role
- Add chaos engineering (random failures)

8. Chaos Engineering for HA

8.1. Chaos Monkey for PostgreSQL

#!/bin/bash
# chaos-monkey.sh - Randomly kill PostgreSQL nodes

NODES=("node1" "node2" "node3")
INTERVAL=3600  # 1 hour between failures

while true; do
  # Random node
  NODE=${NODES[$RANDOM % ${#NODES[@]}]}
  
  # Random failure type
  FAILURE_TYPE=$((RANDOM % 3))
  
  case $FAILURE_TYPE in
    0)
      echo "$(date): Stopping Patroni on $NODE"
      ssh $NODE "sudo systemctl stop patroni"
      ;;
    1)
      echo "$(date): Simulating network partition on $NODE"
      ssh $NODE "sudo iptables -A INPUT -p tcp --dport 5432 -j DROP"
      sleep 300
      ssh $NODE "sudo iptables -D INPUT -p tcp --dport 5432 -j DROP"
      ;;
    2)
      echo "$(date): Stopping etcd on $NODE"
      ssh $NODE "sudo systemctl stop etcd"
      ;;
  esac
  
  # Wait for recovery
  sleep 300
  
  # Restore if not auto-recovered
  ssh $NODE "sudo systemctl start patroni"
  ssh $NODE "sudo systemctl start etcd"
  
  # Wait before next chaos
  sleep $INTERVAL
done

8.2. Automated DR testing

# automated_dr_test.yml
---
- name: Automated DR Drill
  hosts: postgres_cluster
  vars:
    drill_start_time: "{{ ansible_date_time.iso8601 }}"
  tasks:
    - name: Record baseline metrics
      shell: patronictl -c /etc/patroni/patroni.yml list
      register: baseline
      
    - name: Inject failure on leader
      shell: |
        LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
        ssh $LEADER "sudo systemctl stop patroni"
      delegate_to: localhost
      
    - name: Wait for failover
      wait_for:
        timeout: 60
        
    - name: Verify new leader elected
      shell: patronictl -c /etc/patroni/patroni.yml list | grep Leader | wc -l
      register: leader_count
      failed_when: leader_count.stdout != "1"
      
    - name: Measure RTO
      shell: |
        echo "RTO: $(( $(date +%s) - $(date -d '{{ drill_start_time }}' +%s) )) seconds"
      register: rto_result
      
    - name: Generate drill report
      template:
        src: drill_report.j2
        dest: /tmp/drill_report_{{ drill_start_time }}.txt
      
    - name: Send report to Slack
      uri:
        url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL
        method: POST
        body_format: json
        body:
          text: "DR Drill completed. RTO: {{ rto_result.stdout }}"

9. Best Practices

✅ DO

Schedule regular drills - Quarterly minimum
Test all scenarios - Not just easy ones
Rotate roles - Everyone should be IC once
Document everything - Timestamped notes
Measure RTO/RPO - Track improvements
Post-mortem every drill - Learn and improve
Update runbooks - Keep documentation current
Involve all teams - Cross-functional practice
Test backups - Restore verification essential
Automate where possible - Reduce human error

❌ DON'T

Don't skip drills - "Too busy" is not an excuse
Don't test only easy scenarios - Hard ones matter most
Don't ignore action items - Follow up on improvements
Don't reuse same scenario - Vary the drills
Don't rely on one person - Bus factor = 1 is dangerous
Don't rush - Proper testing takes time
Don't skip post-mortems - Learning opportunity

10. Lab Exercises

Lab 1: Execute failover drill

Tasks:

Plan and schedule drill
Assign team roles
Execute leader failover
Document timeline
Calculate RTO/RPO
Write post-mortem

Lab 2: PITR recovery drill

Tasks:

Create test data
Simulate data corruption
Identify PITR target time
Restore to separate instance
Verify recovered data
Document procedure

Lab 3: Multi-DC failover

Tasks:

Setup 2-DC cluster
Simulate DC1 total failure
Manually promote DC2
Update application config
Measure downtime
Document lessons learned

Lab 4: Chaos engineering

Tasks:

Implement chaos monkey script
Run for 24 hours
Monitor cluster behavior
Document failures and recoveries
Identify weak points
Improve HA configuration

11. Tổng kết

DR Drill Frequency

Scenario Frequency:
- Single node failure: Monthly (automated)
- Leader failover: Quarterly
- DC failure: Semi-annually
- PITR recovery: Quarterly
- Full DR: Annually

Success Criteria

A successful DR drill has:
✅ Met RTO/RPO targets
✅ Zero data loss (or within RPO)
✅ All team members participated
✅ Documentation updated
✅ Action items identified
✅ Post-mortem completed
✅ Next drill scheduled

Key Metrics to Track

- Detection time (how fast we notice)
- Response time (how fast we act)
- Recovery time (how fast we restore)
- Data loss (how much data lost)
- Team coordination (how well we work together)

Next Steps

Bài 28 sẽ cover Thiết Kế Kiến Trúc HA:

Requirements gathering
Architecture design documents
Capacity planning
Cost estimation
Design review process

Menu

Bài 27: Disaster Recovery Drills

DUY TRAN

Bài 27: Disaster Recovery Drills

Bài học khóa học

Bài 1: Tổng quan về PostgreSQL High Availability

Bài 2: Streaming Replication trong PostgreSQL

Bài 3: Giới thiệu Patroni và etcd

Bài 4: Chuẩn bị hạ tầng

Bài 1: Giới thiệu và Cài đặt Nginx

Bài 2: Cấu hình Cơ bản Nginx

Bài 3: Logging và Monitoring Nginx

Bài 4: Reverse Proxy

Bài 5: Load Balancing

Bài 6: NGINX CACHING

Bài 7: SSL/TLS và HTTPS trong NGINX

Bài 8: Performance Tuning NGINX

Bài 9: Security trong NGINX

Bài 10: Rewrite và Redirects trong NGINX

Bài 11: Nginx với Application Stack trong NGINX

Bài 12: Monitoring và Logging trong NGINX

Bài 13: High Availability và Load Balancing Advanced trong NGINX

Bài 14: Microservices và Service Mesh trong NGINX

Bài 15: Production Best Practices và Advanced Topics trong NGINX

Bài 5: Cài đặt PostgreSQL

Bài 6: Cài đặt và cấu hình etcd cluster

Bài 7: Cài đặt Patroni

Bài 8: Cấu hình Patroni chi tiết

Bài 9: Bootstrap PostgreSQL Cluster

Bài 10: Quản lý Replication

Bài 11: Patroni Callbacks

Bài 12: Patroni REST API

Bài 13: Automatic Failover

Bài 14: Switchover có kế hoạch (Planned Switchover)

Bài 15: Recovering failed nodes

Bài 16: Backup và Point-in-Time Recovery (PITR)

Bài 17: Monitoring Patroni Cluster

Bài 18: Performance Tuning

Bài 19: Logging và Troubleshooting

Bài 20: Security Best Practices

Bài 21: Multi-datacenter Setup

Bài 22: Patroni với Kubernetes

Bài 24: Upgrade Strategies

Bài 23: Patroni Configuration Management

Bài 25: Real-world Case Studies

Bài 26: Automation với Ansible

Bài 27: Disaster Recovery Drills

Bài 28: Thiết Kế Kiến Trúc HA

Bài 29: Deploy Production-ready Cluster

BÀI 1: GIỚI THIỆU KUBERNETES VÀ CONTAINER ORCHESTRATION

Mục tiêu

1. DR Planning Foundation

1.1. Key DR metrics

1.2. DR scenarios to test

2. DR Drill Preparation

2.1. Pre-drill checklist

2.2. DR team roles

3. Scenario 1: Single Replica Failure

3.1. Drill procedure

3.2. Expected results

4. Scenario 2: Leader Failover

4.1. Drill procedure

4.2. Expected results

5. Scenario 3: Complete Datacenter Failure

5.1. Drill procedure

5.2. Expected results

6. Scenario 4: Point-in-Time Recovery (Data Corruption)

6.1. Drill procedure

6.2. Expected results

7. DR Drill Metrics and Reporting

7.1. Drill scorecard

7.2. Post-drill analysis

8. Chaos Engineering for HA

8.1. Chaos Monkey for PostgreSQL

8.2. Automated DR testing

9. Best Practices

✅ DO

❌ DON'T

10. Lab Exercises

Lab 1: Execute failover drill