Bài 16: Backup và Point-in-Time Recovery (PITR)

Mục tiêu

Sau bài học này, bạn sẽ:

Thiết lập WAL archiving
Thực hiện backup với pg_basebackup
Configure continuous archiving
Restore database to specific point in time
Automate backup strategies
Implement disaster recovery plan

1. Backup Strategies Overview

1.1. Types of backups

A. Logical Backup

# pg_dump / pg_dumpall
pg_dump -h localhost -U postgres mydb > mydb.sql
pg_dumpall -h localhost -U postgres > cluster.sql

# Pros:
# ✅ Easy to restore specific tables
# ✅ Portable across PostgreSQL versions
# ✅ Human-readable (text)

# Cons:
# ❌ Slow for large databases
# ❌ Not suitable for PITR
# ❌ Requires downtime for consistent backup

B. Physical Backup

# pg_basebackup / File system snapshot
pg_basebackup -D /backup/base -Ft -z -P

# Pros:
# ✅ Fast backup and restore
# ✅ Enables PITR with WAL archiving
# ✅ Consistent snapshot

# Cons:
# ❌ Cannot restore individual tables
# ❌ Must match PostgreSQL version
# ❌ Larger backup size

C. Continuous Archiving (WAL Archiving)

WAL files archived continuously
+ Base backup
= Point-in-Time Recovery capability

# Pros:
# ✅ Can restore to ANY point in time
# ✅ Minimal data loss (RPO: seconds)
# ✅ Online backup (no downtime)

# Cons:
# ❌ More complex setup
# ❌ Requires storage for WAL archives
# ❌ More moving parts

1.2. RTO and RPO

RTO (Recovery Time Objective) = How long to restore service?

pg_basebackup only: ~30min - 2hrs (depends on size)
PITR: Same + WAL replay time (~10-30min)
Streaming replication: ~30-60s (failover)

RPO (Recovery Point Objective) = How much data loss acceptable?

Daily backup: Up to 24 hours data loss ❌
WAL archiving: Up to last archived segment (~16MB)
Synchronous replication: Zero data loss ✅

1.3. Backup strategy decision matrix

Requirement	Solution
Zero data loss	Synchronous replication + PITR
Fast recovery (<1hr)	Streaming replication
PITR capability	WAL archiving + pg_basebackup
Long-term retention	Periodic pg_basebackup
Disaster recovery	Off-site backups + PITR

2. WAL Archiving Setup

2.1. Understanding WAL archiving

WAL (Write-Ahead Log) = Transaction log files

Normal operation:
  Transaction → WAL file → Data files
  
WAL archiving:
  Transaction → WAL file → Data files
                     ↓
                  Archive location (safe storage)

WAL segments:

# Default: 16MB per segment
# Location: $PGDATA/pg_wal/

ls -lh /var/lib/postgresql/18/data/pg_wal/
# 000000010000000000000001  (16MB)
# 000000010000000000000002  (16MB)
# 000000010000000000000003  (16MB)
# ...

2.2. Configure WAL archiving

PostgreSQL configuration

# Edit postgresql.conf or use ALTER SYSTEM
sudo -u postgres psql -c "
ALTER SYSTEM SET wal_level = 'replica';  -- or 'logical'
ALTER SYSTEM SET archive_mode = 'on';
ALTER SYSTEM SET archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f';
ALTER SYSTEM SET archive_timeout = 300;  -- Force archive every 5 min
"

# Restart PostgreSQL
sudo systemctl restart postgresql

Parameters explained:

wal_level: 'replica'
  # 'minimal': No archiving possible
  # 'replica': Required for archiving and replication
  # 'logical': For logical replication

archive_mode: 'on'
  # Enable archiving

archive_command: 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'
  # %f = WAL filename (e.g., 000000010000000000000001)
  # %p = WAL full path (e.g., /var/lib/postgresql/18/data/pg_wal/000000010000000000000001)
  # test ! -f = Don't overwrite existing files
  # cp = Copy to archive location

archive_timeout: 300
  # Force WAL switch every 5 minutes (even if not full)
  # Ensures RPO <= 5 minutes

Create archive directory

# On primary server
sudo mkdir -p /mnt/wal_archive
sudo chown postgres:postgres /mnt/wal_archive
sudo chmod 700 /mnt/wal_archive

# Verify archiving working
sudo -u postgres psql -c "SELECT pg_switch_wal();"
# Forces current WAL file to be archived

# Check archive
ls -lh /mnt/wal_archive/
# Should see WAL files appearing

2.3. Advanced archive commands

A. Archive to remote server (rsync)

# archive_command using rsync
archive_command = 'rsync -a %p backup-server:/mnt/wal_archive/%f'

B. Archive to S3 (wal-g)

# Install wal-g
wget https://github.com/wal-g/wal-g/releases/download/v2.0.1/wal-g-pg-ubuntu-20.04-amd64.tar.gz
tar -xzf wal-g-pg-ubuntu-20.04-amd64.tar.gz
sudo mv wal-g-pg-ubuntu-20.04-amd64 /usr/local/bin/wal-g
sudo chmod +x /usr/local/bin/wal-g

# Configure
sudo -u postgres tee /var/lib/postgresql/.walrc <<EOF
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
WALG_S3_PREFIX=s3://my-bucket/postgres-wal
EOF

# Set archive_command
archive_command = '/usr/local/bin/wal-g wal-push %p'

C. Archive with compression

# Compress before archiving
archive_command = 'gzip < %p > /mnt/wal_archive/%f.gz'

# Or with pigz (parallel gzip)
archive_command = 'pigz < %p > /mnt/wal_archive/%f.gz'

2.4. Monitor archiving

-- Check archiving status
SELECT archived_count, 
       failed_count,
       last_archived_wal,
       last_archived_time,
       last_failed_wal,
       last_failed_time
FROM pg_stat_archiver;

-- Example output:
--  archived_count | failed_count | last_archived_wal        | last_archived_time          
-- ----------------+--------------+--------------------------+-----------------------------
--            1234 |            0 | 000000010000000000000056 | 2024-11-25 10:30:15.123456

-- If failed_count > 0, check logs!

# Check PostgreSQL logs for archive errors
sudo journalctl -u postgresql | grep -i archive

# Common errors:
# - Permission denied on archive directory
# - Archive directory full
# - Network timeout (for remote archiving)

3. Base Backup with pg_basebackup

3.1. Basic pg_basebackup

# Full backup to directory
sudo -u postgres pg_basebackup \
  -D /backup/base/$(date +%Y%m%d_%H%M%S) \
  -Fp \
  -Xs \
  -P \
  -v

# Flags:
# -D: Destination directory
# -Fp: Plain format (directory)
# -Xs: Stream WAL during backup (ensures consistency)
# -P: Show progress
# -v: Verbose

Output:

pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/6000000 on timeline 3
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_12345"
245678/245678 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/6000168
pg_basebackup: syncing data to disk ...
pg_basebackup: renaming backup_manifest.tmp to backup_manifest
pg_basebackup: base backup completed

3.2. Compressed tar backup

# Backup as compressed tar
sudo -u postgres pg_basebackup \
  -D /backup/tar \
  -Ft \
  -z \
  -P \
  -v

# Flags:
# -Ft: Tar format
# -z: Gzip compression

# Result:
ls -lh /backup/tar/
# base.tar.gz         (main data)
# pg_wal.tar.gz       (WAL files)
# backup_manifest     (verification)

3.3. Backup to remote server

# Stream directly to remote server
sudo -u postgres pg_basebackup \
  -D - \
  -Ft \
  -z \
  | ssh backup-server "cat > /backup/postgres-$(date +%Y%m%d).tar.gz"

3.4. Backup with replication slot

# Create replication slot first
sudo -u postgres psql -c "
  SELECT pg_create_physical_replication_slot('backup_slot');
"

# Backup using slot
sudo -u postgres pg_basebackup \
  -D /backup/base/$(date +%Y%m%d) \
  -Fp \
  -Xs \
  -P \
  -S backup_slot

# Slot ensures WAL files aren't removed during backup

3.5. Verify backup

# Check backup_manifest
cat /backup/base/20241125/backup_manifest | jq

# Verify checksum
sudo -u postgres pg_verifybackup /backup/base/20241125

# Output:
# backup successfully verified
# ✅

4. Point-in-Time Recovery (PITR)

4.1. PITR concepts

PITR = Restore database to any point in time (not just backup time)

Timeline:

T0: Base backup taken
  ↓
T1: Transaction A committed
  ↓
T2: Transaction B committed
  ↓
T3: Transaction C committed (ERROR! Want to undo)
  ↓
T4: Now

With PITR, can restore to T2 (before Transaction C)

Requirements:

Base backup (pg_basebackup)
WAL archive from backup until target time
Recovery target specification

4.2. Prepare for PITR

Create recovery directory:

# Stop PostgreSQL on target server
sudo systemctl stop postgresql

# Backup current data (safety)
sudo mv /var/lib/postgresql/18/data /var/lib/postgresql/18/data.old

# Create new data directory
sudo mkdir -p /var/lib/postgresql/18/data
sudo chown postgres:postgres /var/lib/postgresql/18/data

Restore base backup:

# From plain directory backup
sudo cp -a /backup/base/20241125/* /var/lib/postgresql/18/data/

# Or from tar backup
cd /var/lib/postgresql/18/data
sudo -u postgres tar -xzf /backup/tar/base.tar.gz
sudo -u postgres tar -xzf /backup/tar/pg_wal.tar.gz

4.3. Configure recovery

Create recovery configuration:

# PostgreSQL 12+: Use recovery.signal + postgresql.conf

# Step 1: Create recovery.signal
sudo -u postgres touch /var/lib/postgresql/18/data/recovery.signal

# Step 2: Configure recovery in postgresql.conf
sudo -u postgres tee -a /var/lib/postgresql/18/data/postgresql.auto.conf <<EOF
restore_command = 'cp /mnt/wal_archive/%f %p'
recovery_target_time = '2024-11-25 10:30:00'
recovery_target_action = 'promote'
EOF

Recovery parameters:

restore_command: 'cp /mnt/wal_archive/%f %p'
  # How to fetch archived WAL files
  # %f = WAL filename
  # %p = Destination path

recovery_target_time: '2024-11-25 10:30:00'
  # Restore to this timestamp

recovery_target_action: 'promote'
  # After reaching target: promote to normal operation
  # Options: 'pause', 'promote', 'shutdown'

4.4. Recovery target options

A. Recover to specific time

-- In postgresql.auto.conf
recovery_target_time = '2024-11-25 10:30:00'

B. Recover to specific transaction

-- Find transaction ID
SELECT txid_current();  -- Before bad transaction

-- In postgresql.auto.conf
recovery_target_xid = '12345678'

C. Recover to specific LSN

-- In postgresql.auto.conf
recovery_target_lsn = '0/6000000'

D. Recover to latest

-- In postgresql.auto.conf
# No recovery_target_* parameter
# Will replay all available WAL

E. Recovery target inclusive/exclusive

-- Default: exclusive (stop BEFORE target)
recovery_target_inclusive = 'off'

-- Inclusive: include target transaction
recovery_target_inclusive = 'on'

4.5. Perform recovery

# Start PostgreSQL
sudo systemctl start postgresql

# Monitor logs
sudo journalctl -u postgresql -f

Log output:

2024-11-25 11:00:00 LOG: starting PostgreSQL 18.0
2024-11-25 11:00:01 LOG: entering standby mode
2024-11-25 11:00:02 LOG: redo starts at 0/6000000
2024-11-25 11:00:03 LOG: restored log file "000000010000000000000006" from archive
2024-11-25 11:00:05 LOG: restored log file "000000010000000000000007" from archive
2024-11-25 11:00:08 LOG: recovery stopping before commit of transaction 12345678, time 2024-11-25 10:30:00
2024-11-25 11:00:09 LOG: pausing at the end of recovery
2024-11-25 11:00:09 HINT: Execute pg_wal_replay_resume() to continue.

If paused, resume:

-- Check status
SELECT pg_is_in_recovery();
-- true

-- Resume (will promote if action = promote)
SELECT pg_wal_replay_resume();

-- Or promote manually
SELECT pg_promote();

Verify recovery:

-- Check database state
SELECT pg_is_in_recovery();
-- false (if promoted)

-- Verify data
SELECT * FROM critical_table WHERE created_at >= '2024-11-25 10:25:00';
-- Should see data up to recovery target time

4.6. Timeline after PITR

After PITR, timeline increments:

Original timeline: 3
After PITR: 4

This prevents accidentally replaying WAL from "future" timeline

# Check new timeline
sudo -u postgres psql -c "
  SELECT timeline_id FROM pg_control_checkpoint();
"
# timeline_id
# ------------
#           4

5. Automation with Patroni

5.1. Patroni WAL archiving

Configure in patroni.yml:

postgresql:
  parameters:
    wal_level: replica
    archive_mode: 'on'
    archive_command: 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'
    archive_timeout: 300
  
  # Restore command for replicas
  recovery_conf:
    restore_command: 'cp /mnt/wal_archive/%f %p'

Patroni automatically:

Configures archiving on primary
Configures restore on replicas
Handles timeline changes

5.2. Backup script

#!/bin/bash
# backup.sh - Automated PostgreSQL backup

set -e

BACKUP_DIR="/backup/postgres"
RETENTION_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)

# Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"

# Run pg_basebackup
sudo -u postgres pg_basebackup \
  -D "$BACKUP_DIR/$DATE" \
  -Fp \
  -Xs \
  -P \
  -v \
  -c fast

# Verify backup
sudo -u postgres pg_verifybackup "$BACKUP_DIR/$DATE"

# Create metadata
cat > "$BACKUP_DIR/$DATE/backup_info.txt" <<EOF
Backup Date: $(date)
Hostname: $(hostname)
PostgreSQL Version: $(sudo -u postgres psql -t -c "SELECT version();")
Database Size: $(du -sh "$BACKUP_DIR/$DATE" | awk '{print $1}')
EOF

# Remove old backups
find "$BACKUP_DIR" -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;

# Log
echo "$(date): Backup completed successfully: $BACKUP_DIR/$DATE" | tee -a /var/log/postgres-backup.log

# Optional: Upload to S3
# aws s3 sync "$BACKUP_DIR/$DATE" "s3://my-bucket/postgres-backups/$DATE/"

# Send notification
# curl -X POST https://hooks.slack.com/... -d '{"text":"Backup completed"}'

Schedule with cron:

# Run daily at 2 AM
sudo crontab -u postgres -e

# Add:
0 2 * * * /usr/local/bin/backup.sh >> /var/log/postgres-backup.log 2>&1

5.3. WAL-G integration

Install WAL-G:

# Download
wget https://github.com/wal-g/wal-g/releases/download/v2.0.1/wal-g-pg-ubuntu-20.04-amd64.tar.gz
tar -xzf wal-g-pg-ubuntu-20.04-amd64.tar.gz
sudo mv wal-g-pg-ubuntu-20.04-amd64 /usr/local/bin/wal-g
sudo chmod +x /usr/local/bin/wal-g

Configure:

# Create config
sudo -u postgres tee /var/lib/postgresql/.walrc <<EOF
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_REGION=us-east-1
WALG_S3_PREFIX=s3://my-bucket/postgres
WALG_COMPRESSION_METHOD=lz4
WALG_DELTA_MAX_STEPS=6
EOF

# Update postgresql.conf
archive_command = '/usr/local/bin/wal-g wal-push %p'
restore_command = '/usr/local/bin/wal-g wal-fetch %f %p'

Take backup with WAL-G:

# Full backup
sudo -u postgres wal-g backup-push /var/lib/postgresql/18/data

# List backups
sudo -u postgres wal-g backup-list

# name                          modified             wal_segment_backup_start
# base_000000010000000000000004 2024-11-25T10:00:00Z 000000010000000000000004

Restore with WAL-G:

# Stop PostgreSQL
sudo systemctl stop postgresql

# Clear data directory
sudo rm -rf /var/lib/postgresql/18/data/*

# Restore latest backup
sudo -u postgres wal-g backup-fetch /var/lib/postgresql/18/data LATEST

# Or restore specific backup
sudo -u postgres wal-g backup-fetch /var/lib/postgresql/18/data base_000000010000000000000004

# Configure PITR (if needed)
sudo -u postgres touch /var/lib/postgresql/18/data/recovery.signal
echo "recovery_target_time = '2024-11-25 10:30:00'" | \
  sudo -u postgres tee -a /var/lib/postgresql/18/data/postgresql.auto.conf

# Start PostgreSQL
sudo systemctl start postgresql

6. Disaster Recovery Planning

6.1. DR strategy

3-2-1 Rule:

3: Keep 3 copies of data
2: Store on 2 different media types
1: Keep 1 copy off-site

Example:
- Production database (live)
- Local backup (same datacenter)
- S3 backup (cloud, different region)

6.2. DR checklist

Preparation:

✅ WAL archiving enabled and tested
✅ Regular base backups (daily/weekly)
✅ Off-site backup storage (S3, remote server)
✅ Backup verification automated
✅ Restore procedures documented
✅ DR drills scheduled (quarterly)
✅ Monitoring and alerting configured
✅ Backup retention policy defined
✅ Encryption for backups (at rest and in transit)
✅ Access controls (who can restore)

6.3. DR scenarios and procedures

Scenario 1: Database corruption

# 1. Identify corruption
SELECT * FROM corrupt_table;  # ERROR

# 2. Stop PostgreSQL
sudo systemctl stop postgresql

# 3. Restore from last good backup
sudo rm -rf /var/lib/postgresql/18/data
sudo cp -a /backup/base/20241125 /var/lib/postgresql/18/data

# 4. Configure PITR to just before corruption
echo "recovery_target_time = '2024-11-25 09:55:00'" | \
  sudo tee -a /var/lib/postgresql/18/data/postgresql.auto.conf
sudo touch /var/lib/postgresql/18/data/recovery.signal

# 5. Start and verify
sudo systemctl start postgresql

Scenario 2: Datacenter failure

# 1. Provision new servers in different region

# 2. Install PostgreSQL + Patroni

# 3. Restore from off-site backup (S3)
aws s3 sync s3://my-bucket/postgres-backups/20241125 /backup/restore/
sudo cp -a /backup/restore /var/lib/postgresql/18/data

# 4. Restore WAL from S3
# (configure restore_command with wal-g or S3)

# 5. Start cluster
sudo systemctl start patroni

# 6. Update DNS/Load balancer to new region

# RTO: ~1-2 hours (depends on backup size and network)

Scenario 3: Accidental data deletion

# Oops: DELETE FROM users WHERE ...;  (without WHERE clause)

# Option A: PITR to before deletion
# (See section 4.4)

# Option B: Restore to separate instance and copy data
sudo -u postgres pg_basebackup -D /tmp/restore ...
# Start on different port
# Copy missing data to production

6.4. Recovery metrics

RTO (Recovery Time Objective):

Target: < 2 hours for full DR

Breakdown:
- Detection: 5-10 min
- Decision: 10-15 min
- Restore base backup: 30-60 min
- Replay WAL: 10-30 min
- Verification: 10-20 min
- DNS/Traffic switch: 5-10 min

Total: ~70-145 min

RPO (Recovery Point Objective):

Target: < 5 minutes data loss

With synchronous replication: 0 (zero data loss)
With WAL archiving: < archive_timeout (e.g., 5 min)
With daily backup only: Up to 24 hours ❌

7. Monitoring and Alerting

7.1. Key backup metrics

-- Archive status
SELECT archived_count, 
       failed_count,
       EXTRACT(EPOCH FROM (now() - last_archived_time)) AS seconds_since_last_archive
FROM pg_stat_archiver;

-- Alert if seconds_since_last_archive > 600 (10 min)

# Backup age
find /backup/base -maxdepth 1 -type d -name "202*" -mtime -1 | wc -l
# Should be >= 1 (at least one backup in last 24h)

7.2. Prometheus metrics

# Alert rules
groups:
  - name: backup_alerts
    rules:
      - alert: PostgreSQLArchivingFailed
        expr: pg_stat_archiver_failed_count > 0
        labels:
          severity: critical
        annotations:
          summary: "WAL archiving failures detected"
      
      - alert: PostgreSQLNoRecentBackup
        expr: time() - pg_backup_last_success_timestamp > 86400
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "No backup in last 24 hours"
      
      - alert: PostgreSQLArchiveDelayHigh
        expr: |
          time() - pg_stat_archiver_last_archived_time > 600
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "WAL archiving delayed > 10 min"

7.3. Backup verification

#!/bin/bash
# verify-backup.sh

BACKUP_DIR="/backup/base"
LATEST_BACKUP=$(ls -td $BACKUP_DIR/*/ | head -1)

echo "Verifying: $LATEST_BACKUP"

# 1. Check backup_manifest exists
if [ ! -f "$LATEST_BACKUP/backup_manifest" ]; then
  echo "❌ backup_manifest missing"
  exit 1
fi

# 2. Run pg_verifybackup
if sudo -u postgres pg_verifybackup "$LATEST_BACKUP" > /dev/null 2>&1; then
  echo "✅ Backup verified successfully"
else
  echo "❌ Backup verification failed"
  exit 1
fi

# 3. Check size (should be reasonable)
SIZE=$(du -sb "$LATEST_BACKUP" | awk '{print $1}')
MIN_SIZE=1000000000  # 1GB minimum
if [ "$SIZE" -lt "$MIN_SIZE" ]; then
  echo "⚠️  Backup size suspicious: $(du -sh "$LATEST_BACKUP" | awk '{print $1}')"
  exit 1
fi

# 4. Test restore to temp location (optional, resource-intensive)
# ...

echo "✅ All backup checks passed"

8. Best Practices

✅ DO

Enable WAL archiving - Required for PITR
Automate backups - Daily pg_basebackup via cron/systemd timer
Test restores regularly - Monthly DR drills
Monitor archiving - Alert on failures
Keep multiple backup generations - 7 daily + 4 weekly + 12 monthly
Off-site backups - S3, different region/datacenter
Encrypt backups - At rest and in transit
Document procedures - Runbooks for restoration
Verify backups - pg_verifybackup after each backup
Calculate RTO/RPO - Know your limits

❌ DON'T

Don't skip testing - Untested backup = no backup
Don't store only locally - Datacenter failure = data loss
Don't ignore archive failures - Silent data loss risk
Don't delete WAL too early - Need for PITR
Don't forget retention - Storage costs vs recovery needs
Don't backup to same disk - Disk failure = everything lost
Don't skip encryption - Security/compliance risk
Don't assume it works - Verify, verify, verify

9. Lab Exercises

Lab 1: Setup WAL archiving

Tasks:

Configure archive_mode and archive_command
Create archive directory
Force WAL switch: SELECT pg_switch_wal();
Verify WAL files in archive directory
Monitor pg_stat_archiver

Lab 2: Take base backup

Tasks:

Use pg_basebackup to create backup
Verify with pg_verifybackup
Calculate backup size and time
Compress backup and compare size
Document backup metadata

Lab 3: Perform PITR

Tasks:

Create test table with timestamps
Take base backup
Insert more data
Note specific timestamp
Insert "bad" data after timestamp
Restore to timestamp (before bad data)
Verify recovery point is correct

Lab 4: Automate backup

Tasks:

Write backup script with retention
Add error handling and notifications
Schedule with cron
Test script execution
Monitor backup logs

Lab 5: DR drill

Tasks:

Simulate total database loss (remove data directory)
Restore from backup
Replay WAL to latest point
Measure RTO (time to restore)
Verify data integrity
Document lessons learned

10. Troubleshooting

Issue: Archiving not working

Symptoms: failed_count > 0 in pg_stat_archiver

Check:

# Check archive_command manually
sudo -u postgres bash -c 'f=000000010000000000000001; p=/var/lib/postgresql/18/data/pg_wal/$f; test ! -f /mnt/wal_archive/$f && cp $p /mnt/wal_archive/$f'

echo $?  # Should be 0

Common causes:

Permission denied on archive directory
Archive directory doesn't exist
Disk full
Network issue (if remote archiving)

Issue: pg_basebackup slow

Optimize:

# Use compression
pg_basebackup -z ...

# Parallel copy (if multiple tablespaces)
pg_basebackup -j 4 ...

# Adjust checkpoint_timeout
ALTER SYSTEM SET checkpoint_timeout = '15min';

# Use faster storage for backup destination

Issue: PITR fails - WAL not found

Error: could not open file "000000010000000000000007": No such file or directory

Cause: WAL file not in archive or restore_command wrong

Fix:

# Check WAL exists in archive
ls -lh /mnt/wal_archive/000000010000000000000007

# Test restore_command manually
sudo -u postgres cp /mnt/wal_archive/000000010000000000000007 /tmp/test_wal

# If missing, cannot recover to that point
# Restore to earlier point or use latest available WAL

11. Tổng kết

Backup Strategy Summary

Method	RPO	RTO	Complexity	Use Case
pg_dump	Hours/Days	Hours	Low	Small DBs, migration
pg_basebackup	Last backup	30-120 min	Medium	Regular backups
WAL archiving	Minutes	30-120 min	Medium	PITR capability
Replication	Seconds/0	30-60 sec	High	HA, zero data loss

Key Concepts

✅ WAL Archiving - Continuous backup of transaction logs

✅ pg_basebackup - Physical backup of entire cluster

✅ PITR - Restore to any point in time using base backup + WAL

✅ RTO - How fast can you recover

✅ RPO - How much data loss is acceptable

✅ 3-2-1 Rule - 3 copies, 2 media types, 1 off-site

Recovery Checklist

Stop PostgreSQL
Restore base backup
Create recovery.signal
Configure restore_command
Set recovery target (time/xid/lsn)
Start PostgreSQL
Monitor recovery logs
Verify recovery point
Promote if satisfied
Update replication if needed

Next Steps

Bài 17 sẽ cover Monitoring và Observability:

Prometheus + Grafana setup
Key PostgreSQL metrics
Patroni monitoring
Log aggregation
Alerting strategies
Performance dashboards

Menu

Bài 16: Backup và Point-in-Time Recovery (PITR)

DUY TRAN