Bài 14: Switchover có kế hoạch (Planned Switchover)

Mục tiêu

Sau bài học này, bạn sẽ:

Phân biệt switchover vs failover
Thực hiện planned switchover an toàn
Hiểu graceful vs immediate switchover
Minimize downtime trong maintenance
Automate switchover cho rolling updates
Handle switchover trong production

1. Switchover Overview

1.1. Switchover là gì?

Switchover = Có kế hoạch promote một replica lên làm primary.

So sánh với Failover:

Aspect	Failover	Switchover
Trigger	Primary failure (unplanned)	Manual/scheduled (planned)
Downtime	30-60 seconds	0-10 seconds
Data loss	Possible (if async)	Zero (controlled)
Control	Automatic	Manual/scripted
Timing	Unpredictable	Scheduled

1.2. Khi nào cần switchover?

Common scenarios:

A. Hardware maintenance

Scenario: Need to replace failing disk on primary server
  → Switchover to replica
  → Perform maintenance on old primary
  → Keep as replica or switchover back

B. Software upgrades

Scenario: OS kernel update requires reboot
  → Switchover to replica
  → Update & reboot old primary
  → Verify, then switchover back (optional)

C. Database migration

Scenario: Move database to larger server
  → Add new server as replica
  → Switchover to new server
  → Remove old server

D. Datacenter migration

Scenario: Move from DC1 to DC2
  → Setup replicas in DC2
  → Switchover primary to DC2
  → Decommission DC1 nodes

E. Testing

Scenario: Test HA readiness before production
  → Perform switchover in staging
  → Validate application behavior
  → Measure downtime

1.3. Switchover Benefits

✅ Zero data loss - All transactions committed before switch

✅ Controlled timing - During maintenance window

✅ Lower risk - Coordinated, tested process

✅ Minimal downtime - 0-10 seconds vs 30-60 for failover

✅ Reversible - Can switchover back if issues

2. Types of Switchover

2.1. Graceful Switchover (Default)

Process:

1. Verify cluster healthy
2. Wait for replication lag = 0
3. Stop new connections to old primary
4. Wait for active transactions to complete
5. Promote new primary
6. Reconfigure old primary as replica

Downtime: ~5-10 seconds ✅
Data loss: None ✅

Command:

patronictl switchover postgres

2.2. Immediate Switchover

Process:

1. Immediately promote replica
2. Kill active connections on old primary
3. Demote old primary (force if needed)

Downtime: ~2-5 seconds ✅
Data loss: Possible if transactions in-flight ⚠️

Command:

patronictl switchover postgres --force

2.3. Scheduled Switchover

Process:

1. Schedule switchover at specific time
2. Patroni waits until scheduled time
3. Performs graceful switchover automatically

Downtime: ~5-10 seconds ✅
Automation: Full ✅

Command:

patronictl switchover postgres --scheduled 2024-11-25T02:00:00

3. Switchover Prerequisites

3.1. Cluster health check

# 1. Verify all nodes running
patronictl list postgres

# Expected:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Leader  | running |  2 |           |
# | node2  | 10.0.1.12:5432| Replica | running |  2 |         0 | ✅
# | node3  | 10.0.1.13:5432| Replica | running |  2 |         0 | ✅
# +--------+---------------+---------+---------+----+-----------+

# All nodes must be:
# - State: running ✅
# - Lag: 0 or very low ✅
# - Same timeline ✅

3.2. Replication lag check

# Check lag on all replicas
sudo -u postgres psql -h 10.0.1.11 -c "
SELECT application_name,
       client_addr,
       state,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication
ORDER BY lag_bytes DESC;
"

# Desired:
# application_name | client_addr | state     | lag_bytes | replay_lag
# -----------------+-------------+-----------+-----------+------------
# node2            | 10.0.1.12   | streaming |         0 | 00:00:00   ✅
# node3            | 10.0.1.13   | streaming |         0 | 00:00:00   ✅

3.3. Target candidate check

# Check if target has nofailover tag
patronictl show-config postgres | grep -A10 "tags:"

# Target node should have:
tags:
  nofailover: false  # ✅ Can be promoted
  priority: 100      # Higher = preferred

# NOT:
tags:
  nofailover: true   # ❌ Cannot be promoted

3.4. Connection availability

# Test connection to target
psql -h 10.0.1.12 -U postgres -c "SELECT 1;"

# Test application user
psql -h 10.0.1.12 -U app_user -d myapp -c "SELECT 1;"

4. Performing Switchover

4.1. Interactive Switchover (Recommended)

Step-by-step:

# 1. Initiate switchover
patronictl switchover postgres

# Patroni prompts:

Master [node1]:  ← Current primary (press Enter to accept)
Candidate ['node2', 'node3'] []:  ← Type target, e.g., "node2"
When should the switchover take place (e.g. 2024-11-25T10:00 )  [now]:  ← Press Enter for immediate
Are you sure you want to switchover cluster postgres, demoting current master node1? [y/N]: y

Output:

2024-11-25 10:30:00.123 UTC [INFO]: Switching over from node1 to node2
2024-11-25 10:30:02.456 UTC [INFO]: Waiting for replica node2 to catch up...
2024-11-25 10:30:02.789 UTC [INFO]: Replica node2 lag: 0 bytes ✅
2024-11-25 10:30:03.012 UTC [INFO]: Promoting node2...
2024-11-25 10:30:05.234 UTC [INFO]: node2 promoted successfully
2024-11-25 10:30:06.567 UTC [INFO]: Demoting node1...
2024-11-25 10:30:08.890 UTC [INFO]: node1 reconfigured as replica
2024-11-25 10:30:10.123 UTC [INFO]: Switchover completed ✅

Total time: 10 seconds

4.2. Non-interactive Switchover

Direct command:

# Specify master and candidate explicitly
patronictl switchover postgres \
  --master node1 \
  --candidate node2 \
  --force

# --force: Skip confirmation prompt

4.3. Scheduled Switchover

Schedule for maintenance window:

# Schedule switchover at 2 AM
patronictl switchover postgres \
  --master node1 \
  --candidate node2 \
  --scheduled "2024-11-25T02:00:00"

# Patroni will automatically execute at scheduled time

Verify scheduled switchover:

# Check pending actions
curl -s http://10.0.1.11:8008/patroni | jq '.scheduled_switchover'

# Output:
# {
#   "at": "2024-11-25T02:00:00+00:00",
#   "from": "node1",
#   "to": "node2"
# }

Cancel scheduled switchover:

# If plans change
patronictl flush postgres switchover

4.4. Switchover with REST API

Trigger via API:

# POST to current leader
curl -X POST http://10.0.1.11:8008/switchover \
  -H "Content-Type: application/json" \
  -d '{
    "leader": "node1",
    "candidate": "node2"
  }'

# Response:
# {
#   "status": "ok",
#   "message": "Switchover scheduled"
# }

5. Switchover Timeline

5.1. Detailed flow

T+0s: INITIATE SWITCHOVER
  Command: patronictl switchover postgres --master node1 --candidate node2

T+0.5s: PRE-CHECKS
  ✓ node1 is current leader
  ✓ node2 is healthy replica
  ✓ node2 replication lag: 0 bytes
  ✓ node2 timeline matches: 2

T+1s: PREPARE OLD PRIMARY (node1)
  - Checkpoint: CHECKPOINT;
  - Flush WAL
  - Set session_replication_role = 'replica' (prevent writes soon)

T+2s: WAIT FOR LAG = 0
  - Monitor: pg_stat_replication.replay_lag
  - node2 lag: 0 bytes ✅
  - All WAL replayed

T+3s: PAUSE OLD PRIMARY
  - Set: pg_catalog.pg_pause_wal_replay() on replicas (not needed, they're already replaying)
  - Actually: Just ensure all WAL consumed

T+4s: DEMOTE OLD PRIMARY (node1)
  - Remove leader lock from DCS
  - Stop accepting new connections (pg_ctl reload with max_connections=0)
  - Wait for active transactions (timeout: 30s default)

T+5s: PROMOTE NEW PRIMARY (node2)
  - Acquire leader lock in DCS
  - Execute: SELECT pg_promote();
  - Timeline: 2 → 3
  - Run callbacks: on_role_change, post_promote

T+7s: VERIFY NEW PRIMARY
  - pg_is_in_recovery() → false ✅
  - Accepting connections
  - Timeline = 3

T+8s: RECONFIGURE OLD PRIMARY (node1)
  - Update primary_conninfo → node2:5432
  - Update recovery.signal
  - Restart PostgreSQL in recovery mode
  - Timeline: 2 → 3

T+10s: REPLICATION RESTORED
  - node1 now streaming from node2
  - node3 updated to stream from node2
  - All replicas timeline = 3

T+10s: SWITCHOVER COMPLETE ✅
  Primary: node2 (was replica)
  Replica: node1 (was primary)
  Replica: node3

Total downtime: ~5-10 seconds
Data loss: None ✅

5.2. What happens to active connections?

During switchover:

Client connections to old primary (node1):

Option A: Graceful (default)
  - New connections: REJECTED
  - Active queries: ALLOWED TO COMPLETE (timeout: 30s)
  - Idle connections: TERMINATED after queries done

Option B: Force (--force)
  - All connections: TERMINATED IMMEDIATELY
  - Active queries: ROLLBACK
  - Faster but risky ⚠️

Application behavior:

# Well-written application with retry logic
import psycopg2

def execute_query():
    retries = 3
    for i in range(retries):
        try:
            conn = psycopg2.connect("host=10.0.1.11 ...")
            cursor = conn.cursor()
            cursor.execute("SELECT * FROM users;")
            return cursor.fetchall()
        except psycopg2.OperationalError as e:
            if i < retries - 1:
                time.sleep(1)  # Wait and retry
                continue
            raise

6. Verification After Switchover

6.1. Cluster status

patronictl list postgres

# Expected:
# + Cluster: postgres (7001234567890123456) ----+----+-----------+
# | Member | Host          | Role    | State   | TL | Lag in MB |
# +--------+---------------+---------+---------+----+-----------+
# | node1  | 10.0.1.11:5432| Replica | running |  3 |         0 | ← Was Leader
# | node2  | 10.0.1.12:5432| Leader  | running |  3 |           | ← Was Replica
# | node3  | 10.0.1.13:5432| Replica | running |  3 |         0 |
# +--------+---------------+---------+---------+----+-----------+

# Check:
# ✅ node2 is now Leader
# ✅ Timeline changed: 2 → 3
# ✅ All nodes running
# ✅ Replication lag = 0

6.2. Replication status

# On new primary (node2)
sudo -u postgres psql -h 10.0.1.12 -c "
SELECT application_name, client_addr, state, sync_state
FROM pg_stat_replication;
"

# Expected:
# application_name | client_addr | state     | sync_state
# -----------------+-------------+-----------+------------
# node1            | 10.0.1.11   | streaming | async
# node3            | 10.0.1.13   | streaming | async

# Both replicas should be streaming from node2 ✅

6.3. Write test

# Insert on new primary
sudo -u postgres psql -h 10.0.1.12 -d testdb -c "
INSERT INTO test_table (data, created_at) 
VALUES ('After switchover', NOW())
RETURNING *;
"

# Verify on replicas
sudo -u postgres psql -h 10.0.1.11 -d testdb -c "
SELECT * FROM test_table ORDER BY created_at DESC LIMIT 1;
"

sudo -u postgres psql -h 10.0.1.13 -d testdb -c "
SELECT * FROM test_table ORDER BY created_at DESC LIMIT 1;
"

# Should see the new row on both replicas ✅

6.4. Timeline verification

# Check timeline on all nodes
for node in 10.0.1.11 10.0.1.12 10.0.1.13; do
  echo "=== $node ==="
  sudo -u postgres psql -h $node -c "
    SELECT timeline_id, pg_is_in_recovery() AS is_replica
    FROM pg_control_checkpoint();
  "
done

# All should report:
# timeline_id | is_replica
# ------------+------------
#           3 | t/f

7. Switchover Best Practices

7.1. Pre-switchover checklist

#!/bin/bash
# pre-switchover-check.sh

echo "=== Pre-Switchover Checks ==="

# 1. Cluster health
echo "1. Checking cluster health..."
patronictl list postgres | grep -q "running" || { echo "❌ Not all nodes running"; exit 1; }
echo "✅ All nodes running"

# 2. Replication lag
echo "2. Checking replication lag..."
lag=$(sudo -u postgres psql -h 10.0.1.11 -t -c "
  SELECT COALESCE(MAX(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn)), 0)
  FROM pg_stat_replication;
")
if [ "$lag" -gt 1048576 ]; then  # 1MB
  echo "❌ Lag too high: $lag bytes"
  exit 1
fi
echo "✅ Lag acceptable: $lag bytes"

# 3. Target candidate available
echo "3. Checking target candidate..."
patronictl list postgres | grep node2 | grep -q "running" || { echo "❌ node2 not available"; exit 1; }
echo "✅ Target candidate available"

# 4. No scheduled maintenance
echo "4. Checking scheduled actions..."
curl -s http://10.0.1.11:8008/patroni | jq -e '.scheduled_switchover == null' > /dev/null || {
  echo "⚠️  Another switchover already scheduled"
}

echo ""
echo "✅ All pre-checks passed. Safe to proceed."

7.2. Minimize downtime strategies

A. Connection pooler

Use PgBouncer/HAProxy between app and database:

App → PgBouncer → Primary
              ↓
            Replicas

During switchover:
1. PgBouncer detects primary change
2. Reconnects to new primary automatically
3. Application sees minimal disruption

B. Read-replica routing

Route read queries to replicas during switchover:

- Write queries: Wait for new primary
- Read queries: Continue on replicas (may be slightly stale)

Result: Partial availability during switchover

C. Application-level retry

# Implement exponential backoff
def execute_with_retry(query, max_retries=3):
    for i in range(max_retries):
        try:
            return execute_query(query)
        except OperationalError:
            if i == max_retries - 1:
                raise
            time.sleep(2 ** i)  # 1s, 2s, 4s

7.3. Communication plan

Before switchover:

T-24h: Announce maintenance window
  - Email: ops@, dev@, stakeholders
  - Slack: #incidents, #ops
  - Status page: Update with scheduled maintenance

T-1h: Reminder notification
  - Final checks
  - Confirm go/no-go

T-5min: Begin maintenance
  - Start switchover
  - Monitor dashboards

During switchover:

- Real-time updates in ops channel
- Monitor metrics (latency, error rate)
- Have rollback plan ready

After switchover:

- Verify all systems operational
- Post-switchover validation
- Update documentation
- Send completion notification

8. Troubleshooting Switchover

8.1. Issue: Switchover command hangs

Symptoms: patronictl switchover never completes.

Diagnosis:

# Check what Patroni is waiting for
sudo journalctl -u patroni -f

# Common causes:

# A. High replication lag
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT application_name, 
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
  FROM pg_stat_replication;
"
# If lag > 0, Patroni waits for lag = 0

# B. Active long-running queries
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT pid, usename, state, query_start, query
  FROM pg_stat_activity
  WHERE state = 'active' AND query_start < now() - interval '5 minutes';
"
# Kill blocking queries:
# SELECT pg_terminate_backend(pid);

Solution:

# Option 1: Wait for lag to catch up (recommended)
# Option 2: Use --force to skip wait (risk data loss)
# Option 3: Cancel and reschedule
Ctrl+C  # Cancel current switchover attempt

8.2. Issue: Candidate not eligible

Symptoms: Error "candidate is not eligible".

Diagnosis:

# Check nofailover tag
patronictl show-config postgres | grep -A5 "node2:"

# If output shows:
# node2:
#   tags:
#     nofailover: true  ← Problem!

Solution:

# Remove nofailover tag
patronictl edit-config postgres

# Edit:
tags:
  nofailover: false  # Change to false

# Restart Patroni on node2
sudo systemctl restart patroni

8.3. Issue: Old primary won't demote

Symptoms: Switchover fails, old primary still leader.

Diagnosis:

# Check Patroni logs on old primary
sudo journalctl -u patroni -n 100 | grep -i "demote\|error"

# Possible causes:
# - PostgreSQL won't stop
# - Active transactions won't terminate
# - File permission issues

Solution:

# Force demote via REST API
curl -X POST http://10.0.1.11:8008/restart

# Or manually:
sudo -u postgres psql -h 10.0.1.11 -c "
  SELECT pg_terminate_backend(pid)
  FROM pg_stat_activity
  WHERE pid != pg_backend_pid();
"

sudo systemctl restart patroni

8.4. Issue: Replication broken after switchover

Symptoms: Old primary not replicating from new primary.

Diagnosis:

# Check replication status
patronictl list postgres

# If node1 shows "stopped" or "streaming: False"

# Check logs
sudo journalctl -u patroni -u postgresql -n 100

Solution:

# A. Restart Patroni (usually auto-fixes)
sudo systemctl restart patroni

# B. Manual reinit if needed
patronictl reinit postgres node1

# Patroni will:
# 1. Stop PostgreSQL on node1
# 2. Remove data directory
# 3. pg_basebackup from node2
# 4. Start as replica

9. Switchover Automation

9.1. Scripted switchover

#!/bin/bash
# automated-switchover.sh

set -e

CLUSTER="postgres"
OLD_PRIMARY="node1"
NEW_PRIMARY="node2"

echo "=== Starting Automated Switchover ==="
echo "From: $OLD_PRIMARY → To: $NEW_PRIMARY"

# Pre-checks
echo "Running pre-checks..."
./pre-switchover-check.sh || exit 1

# Perform switchover
echo "Executing switchover..."
patronictl switchover $CLUSTER \
  --master $OLD_PRIMARY \
  --candidate $NEW_PRIMARY \
  --force

# Wait for completion
echo "Waiting for switchover to complete..."
sleep 15

# Post-checks
echo "Running post-checks..."
new_leader=$(patronictl list $CLUSTER | grep Leader | awk '{print $2}')
if [ "$new_leader" == "$NEW_PRIMARY" ]; then
  echo "✅ Switchover successful!"
  echo "New leader: $new_leader"
else
  echo "❌ Switchover failed!"
  echo "Current leader: $new_leader"
  exit 1
fi

# Verify replication
echo "Verifying replication..."
patronictl list $CLUSTER

echo "=== Switchover Complete ==="

9.2. Ansible playbook

# switchover.yml
---
- name: Perform Patroni switchover
  hosts: localhost
  gather_facts: no
  vars:
    cluster_name: postgres
    old_primary: node1
    new_primary: node2
  
  tasks:
    - name: Pre-check cluster health
      command: patronictl list {{ cluster_name }}
      register: cluster_status
      changed_when: false
    
    - name: Verify all nodes running
      assert:
        that:
          - "'running' in cluster_status.stdout"
        fail_msg: "Not all nodes are running"
    
    - name: Execute switchover
      command: >
        patronictl switchover {{ cluster_name }}
        --master {{ old_primary }}
        --candidate {{ new_primary }}
        --force
      register: switchover_result
    
    - name: Wait for switchover completion
      pause:
        seconds: 15
    
    - name: Verify new leader
      command: patronictl list {{ cluster_name }}
      register: final_status
      changed_when: false
    
    - name: Display result
      debug:
        msg: "{{ final_status.stdout_lines }}"
    
    - name: Verify leadership
      assert:
        that:
          - "'{{ new_primary }}' in final_status.stdout"
          - "'Leader' in final_status.stdout"
        fail_msg: "Switchover failed"
        success_msg: "Switchover successful"

Run:

ansible-playbook switchover.yml

9.3. CI/CD integration

# .github/workflows/db-maintenance.yml
name: Database Maintenance Switchover

on:
  schedule:
    - cron: '0 2 * * 0'  # Every Sunday at 2 AM
  workflow_dispatch:  # Manual trigger

jobs:
  switchover:
    runs-on: self-hosted
    steps:
      - name: Notify start
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Starting scheduled database switchover"}'
      
      - name: Pre-checks
        run: ./scripts/pre-switchover-check.sh
      
      - name: Execute switchover
        run: |
          patronictl switchover postgres \
            --master node1 \
            --candidate node2 \
            --force
      
      - name: Verify
        run: ./scripts/post-switchover-verify.sh
      
      - name: Notify completion
        if: always()
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} \
            -d '{"text": "Switchover completed: ${{ job.status }}"}'

10. Rolling Updates with Switchover

10.1. Update strategy

Scenario: Update PostgreSQL from 17 → 18.

Steps:

1. Update replica node3 (least critical)
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni
   - Verify replication

2. Update replica node2
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni
   - Verify replication

3. Switchover to node2 (now updated)
   - patronictl switchover --master node1 --candidate node2

4. Update old primary node1
   - Stop Patroni
   - Upgrade PostgreSQL
   - Start Patroni (now replica)
   - Verify replication

5. Optionally switchover back to node1
   - patronictl switchover --master node2 --candidate node1

Result: Zero-downtime upgrade ✅

10.2. Kernel update example

#!/bin/bash
# rolling-kernel-update.sh

NODES=("node1" "node2" "node3")
PRIMARY=$(patronictl list postgres | grep Leader | awk '{print $2}')

echo "Current primary: $PRIMARY"

# Update replicas first
for node in "${NODES[@]}"; do
  if [ "$node" == "$PRIMARY" ]; then
    continue  # Skip primary for now
  fi
  
  echo "=== Updating $node ==="
  ssh $node 'sudo yum update -y kernel && sudo reboot'
  
  echo "Waiting for $node to come back..."
  sleep 60
  
  # Wait for node to rejoin
  until patronictl list postgres | grep $node | grep -q "running"; do
    echo "Waiting for $node..."
    sleep 10
  done
  
  echo "✅ $node updated and rejoined"
done

# Now switchover from primary
NEW_PRIMARY=${NODES[1]}  # Pick a replica
if [ "$NEW_PRIMARY" == "$PRIMARY" ]; then
  NEW_PRIMARY=${NODES[2]}
fi

echo "=== Switching over from $PRIMARY to $NEW_PRIMARY ==="
patronictl switchover postgres \
  --master $PRIMARY \
  --candidate $NEW_PRIMARY \
  --force

sleep 15

# Update old primary
echo "=== Updating $PRIMARY ==="
ssh $PRIMARY 'sudo yum update -y kernel && sudo reboot'

echo "Waiting for $PRIMARY to rejoin as replica..."
sleep 60

until patronictl list postgres | grep $PRIMARY | grep -q "running"; do
  echo "Waiting for $PRIMARY..."
  sleep 10
done

echo "✅ All nodes updated!"
patronictl list postgres

11. Lab Exercises

Lab 1: Basic switchover

Tasks:

Check current primary: patronictl list
Perform switchover: patronictl switchover postgres
Measure downtime with continuous query loop
Verify new topology
Document observations

Lab 2: Scheduled switchover

Tasks:

Schedule switchover for 2 minutes from now
Monitor logs during wait period
Observe automatic execution
Cancel a scheduled switchover (repeat and test cancel)

Lab 3: Forced vs graceful

Tasks:

Create long-running query: SELECT pg_sleep(300);
Attempt graceful switchover (observe wait)
Cancel and retry with --force
Compare behavior and downtime

Lab 4: Rolling update simulation

Tasks:

Start with 3-node cluster
"Update" node3 (simulate by restarting)
"Update" node2
Switchover to node2
"Update" node1
Verify all nodes operational

Lab 5: Switchover under load

Tasks:

Start pgbench: pgbench -c 10 -T 300
During load, perform switchover
Analyze pgbench output for errors
Calculate success rate
Test with connection pooler (PgBouncer)

12. Tổng kết

Key Concepts

✅ Switchover = Planned, controlled role change

✅ Graceful = Wait for transactions (slower, safer)

✅ Immediate = Force termination (faster, riskier)

✅ Scheduled = Automated at specific time

✅ Zero downtime = Achievable with proper architecture

Switchover vs Failover

Aspect	Switchover	Failover
Planning	Scheduled	Unplanned
Control	Manual	Automatic
Downtime	0-10s	30-60s
Data loss	None	Possible
Reversible	Yes	No

Best Practices

✅ Test in staging first
✅ Schedule during low-traffic windows
✅ Use graceful mode (default)
✅ Verify lag = 0 before switchover
✅ Monitor during process
✅ Have rollback plan
✅ Communicate with stakeholders
✅ Document procedure

Next Steps

Bài 15 sẽ cover Recovering Failed Nodes:

Rejoin old primary after failover
pg_rewind usage and scenarios
Full rebuild with pg_basebackup
Timeline divergence resolution
Split-brain recovery

Menu

Bài 14: Switchover có kế hoạch (Planned Switchover)

DUY TRAN