Bài 16: Backup và Point-in-Time Recovery (PITR)
Sử dụng pg_basebackup, cấu hình WAL archiving, continuous archiving và thực hiện Point-in-Time Recovery (PITR).
Bài 16: Backup và Point-in-Time Recovery (PITR)
Mục tiêu
Sau bài học này, bạn sẽ:
- Thiết lập WAL archiving
- Thực hiện backup với pg_basebackup
- Configure continuous archiving
- Restore database to specific point in time
- Automate backup strategies
- Implement disaster recovery plan
1. Backup Strategies Overview
1.1. Types of backups
A. Logical Backup
# pg_dump / pg_dumpall
pg_dump -h localhost -U postgres mydb > mydb.sql
pg_dumpall -h localhost -U postgres > cluster.sql
# Pros:
# ✅ Easy to restore specific tables
# ✅ Portable across PostgreSQL versions
# ✅ Human-readable (text)
# Cons:
# ❌ Slow for large databases
# ❌ Not suitable for PITR
# ❌ Requires downtime for consistent backup
B. Physical Backup
# pg_basebackup / File system snapshot
pg_basebackup -D /backup/base -Ft -z -P
# Pros:
# ✅ Fast backup and restore
# ✅ Enables PITR with WAL archiving
# ✅ Consistent snapshot
# Cons:
# ❌ Cannot restore individual tables
# ❌ Must match PostgreSQL version
# ❌ Larger backup size
C. Continuous Archiving (WAL Archiving)
WAL files archived continuously
+ Base backup
= Point-in-Time Recovery capability
# Pros:
# ✅ Can restore to ANY point in time
# ✅ Minimal data loss (RPO: seconds)
# ✅ Online backup (no downtime)
# Cons:
# ❌ More complex setup
# ❌ Requires storage for WAL archives
# ❌ More moving parts
1.2. RTO and RPO
RTO (Recovery Time Objective) = How long to restore service?
pg_basebackup only: ~30min - 2hrs (depends on size)
PITR: Same + WAL replay time (~10-30min)
Streaming replication: ~30-60s (failover)
RPO (Recovery Point Objective) = How much data loss acceptable?
Daily backup: Up to 24 hours data loss ❌
WAL archiving: Up to last archived segment (~16MB)
Synchronous replication: Zero data loss ✅
1.3. Backup strategy decision matrix
| Requirement | Solution |
|---|---|
| Zero data loss | Synchronous replication + PITR |
| Fast recovery (<1hr) | Streaming replication |
| PITR capability | WAL archiving + pg_basebackup |
| Long-term retention | Periodic pg_basebackup |
| Disaster recovery | Off-site backups + PITR |
2. WAL Archiving Setup
2.1. Understanding WAL archiving
WAL (Write-Ahead Log) = Transaction log files
Normal operation:
Transaction → WAL file → Data files
WAL archiving:
Transaction → WAL file → Data files
↓
Archive location (safe storage)
WAL segments:
# Default: 16MB per segment
# Location: $PGDATA/pg_wal/
ls -lh /var/lib/postgresql/18/data/pg_wal/
# 000000010000000000000001 (16MB)
# 000000010000000000000002 (16MB)
# 000000010000000000000003 (16MB)
# ...
2.2. Configure WAL archiving
PostgreSQL configuration
# Edit postgresql.conf or use ALTER SYSTEM
sudo -u postgres psql -c "
ALTER SYSTEM SET wal_level = 'replica'; -- or 'logical'
ALTER SYSTEM SET archive_mode = 'on';
ALTER SYSTEM SET archive_command = 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f';
ALTER SYSTEM SET archive_timeout = 300; -- Force archive every 5 min
"
# Restart PostgreSQL
sudo systemctl restart postgresql
Parameters explained:
wal_level: 'replica'
# 'minimal': No archiving possible
# 'replica': Required for archiving and replication
# 'logical': For logical replication
archive_mode: 'on'
# Enable archiving
archive_command: 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'
# %f = WAL filename (e.g., 000000010000000000000001)
# %p = WAL full path (e.g., /var/lib/postgresql/18/data/pg_wal/000000010000000000000001)
# test ! -f = Don't overwrite existing files
# cp = Copy to archive location
archive_timeout: 300
# Force WAL switch every 5 minutes (even if not full)
# Ensures RPO <= 5 minutes
Create archive directory
# On primary server
sudo mkdir -p /mnt/wal_archive
sudo chown postgres:postgres /mnt/wal_archive
sudo chmod 700 /mnt/wal_archive
# Verify archiving working
sudo -u postgres psql -c "SELECT pg_switch_wal();"
# Forces current WAL file to be archived
# Check archive
ls -lh /mnt/wal_archive/
# Should see WAL files appearing
2.3. Advanced archive commands
A. Archive to remote server (rsync)
# archive_command using rsync
archive_command = 'rsync -a %p backup-server:/mnt/wal_archive/%f'
B. Archive to S3 (wal-g)
# Install wal-g
wget https://github.com/wal-g/wal-g/releases/download/v2.0.1/wal-g-pg-ubuntu-20.04-amd64.tar.gz
tar -xzf wal-g-pg-ubuntu-20.04-amd64.tar.gz
sudo mv wal-g-pg-ubuntu-20.04-amd64 /usr/local/bin/wal-g
sudo chmod +x /usr/local/bin/wal-g
# Configure
sudo -u postgres tee /var/lib/postgresql/.walrc <<EOF
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key
AWS_REGION=us-east-1
WALG_S3_PREFIX=s3://my-bucket/postgres-wal
EOF
# Set archive_command
archive_command = '/usr/local/bin/wal-g wal-push %p'
C. Archive with compression
# Compress before archiving
archive_command = 'gzip < %p > /mnt/wal_archive/%f.gz'
# Or with pigz (parallel gzip)
archive_command = 'pigz < %p > /mnt/wal_archive/%f.gz'
2.4. Monitor archiving
-- Check archiving status
SELECT archived_count,
failed_count,
last_archived_wal,
last_archived_time,
last_failed_wal,
last_failed_time
FROM pg_stat_archiver;
-- Example output:
-- archived_count | failed_count | last_archived_wal | last_archived_time
-- ----------------+--------------+--------------------------+-----------------------------
-- 1234 | 0 | 000000010000000000000056 | 2024-11-25 10:30:15.123456
-- If failed_count > 0, check logs!
# Check PostgreSQL logs for archive errors
sudo journalctl -u postgresql | grep -i archive
# Common errors:
# - Permission denied on archive directory
# - Archive directory full
# - Network timeout (for remote archiving)
3. Base Backup with pg_basebackup
3.1. Basic pg_basebackup
# Full backup to directory
sudo -u postgres pg_basebackup \
-D /backup/base/$(date +%Y%m%d_%H%M%S) \
-Fp \
-Xs \
-P \
-v
# Flags:
# -D: Destination directory
# -Fp: Plain format (directory)
# -Xs: Stream WAL during backup (ensures consistency)
# -P: Show progress
# -v: Verbose
Output:
pg_basebackup: initiating base backup, waiting for checkpoint to complete
pg_basebackup: checkpoint completed
pg_basebackup: write-ahead log start point: 0/6000000 on timeline 3
pg_basebackup: starting background WAL receiver
pg_basebackup: created temporary replication slot "pg_basebackup_12345"
245678/245678 kB (100%), 1/1 tablespace
pg_basebackup: write-ahead log end point: 0/6000168
pg_basebackup: syncing data to disk ...
pg_basebackup: renaming backup_manifest.tmp to backup_manifest
pg_basebackup: base backup completed
3.2. Compressed tar backup
# Backup as compressed tar
sudo -u postgres pg_basebackup \
-D /backup/tar \
-Ft \
-z \
-P \
-v
# Flags:
# -Ft: Tar format
# -z: Gzip compression
# Result:
ls -lh /backup/tar/
# base.tar.gz (main data)
# pg_wal.tar.gz (WAL files)
# backup_manifest (verification)
3.3. Backup to remote server
# Stream directly to remote server
sudo -u postgres pg_basebackup \
-D - \
-Ft \
-z \
| ssh backup-server "cat > /backup/postgres-$(date +%Y%m%d).tar.gz"
3.4. Backup with replication slot
# Create replication slot first
sudo -u postgres psql -c "
SELECT pg_create_physical_replication_slot('backup_slot');
"
# Backup using slot
sudo -u postgres pg_basebackup \
-D /backup/base/$(date +%Y%m%d) \
-Fp \
-Xs \
-P \
-S backup_slot
# Slot ensures WAL files aren't removed during backup
3.5. Verify backup
# Check backup_manifest
cat /backup/base/20241125/backup_manifest | jq
# Verify checksum
sudo -u postgres pg_verifybackup /backup/base/20241125
# Output:
# backup successfully verified
# ✅
4. Point-in-Time Recovery (PITR)
4.1. PITR concepts
PITR = Restore database to any point in time (not just backup time)
Timeline:
T0: Base backup taken
↓
T1: Transaction A committed
↓
T2: Transaction B committed
↓
T3: Transaction C committed (ERROR! Want to undo)
↓
T4: Now
With PITR, can restore to T2 (before Transaction C)
Requirements:
- Base backup (pg_basebackup)
- WAL archive from backup until target time
- Recovery target specification
4.2. Prepare for PITR
Create recovery directory:
# Stop PostgreSQL on target server
sudo systemctl stop postgresql
# Backup current data (safety)
sudo mv /var/lib/postgresql/18/data /var/lib/postgresql/18/data.old
# Create new data directory
sudo mkdir -p /var/lib/postgresql/18/data
sudo chown postgres:postgres /var/lib/postgresql/18/data
Restore base backup:
# From plain directory backup
sudo cp -a /backup/base/20241125/* /var/lib/postgresql/18/data/
# Or from tar backup
cd /var/lib/postgresql/18/data
sudo -u postgres tar -xzf /backup/tar/base.tar.gz
sudo -u postgres tar -xzf /backup/tar/pg_wal.tar.gz
4.3. Configure recovery
Create recovery configuration:
# PostgreSQL 12+: Use recovery.signal + postgresql.conf
# Step 1: Create recovery.signal
sudo -u postgres touch /var/lib/postgresql/18/data/recovery.signal
# Step 2: Configure recovery in postgresql.conf
sudo -u postgres tee -a /var/lib/postgresql/18/data/postgresql.auto.conf <<EOF
restore_command = 'cp /mnt/wal_archive/%f %p'
recovery_target_time = '2024-11-25 10:30:00'
recovery_target_action = 'promote'
EOF
Recovery parameters:
restore_command: 'cp /mnt/wal_archive/%f %p'
# How to fetch archived WAL files
# %f = WAL filename
# %p = Destination path
recovery_target_time: '2024-11-25 10:30:00'
# Restore to this timestamp
recovery_target_action: 'promote'
# After reaching target: promote to normal operation
# Options: 'pause', 'promote', 'shutdown'
4.4. Recovery target options
A. Recover to specific time
-- In postgresql.auto.conf
recovery_target_time = '2024-11-25 10:30:00'
B. Recover to specific transaction
-- Find transaction ID
SELECT txid_current(); -- Before bad transaction
-- In postgresql.auto.conf
recovery_target_xid = '12345678'
C. Recover to specific LSN
-- In postgresql.auto.conf
recovery_target_lsn = '0/6000000'
D. Recover to latest
-- In postgresql.auto.conf
# No recovery_target_* parameter
# Will replay all available WAL
E. Recovery target inclusive/exclusive
-- Default: exclusive (stop BEFORE target)
recovery_target_inclusive = 'off'
-- Inclusive: include target transaction
recovery_target_inclusive = 'on'
4.5. Perform recovery
# Start PostgreSQL
sudo systemctl start postgresql
# Monitor logs
sudo journalctl -u postgresql -f
Log output:
2024-11-25 11:00:00 LOG: starting PostgreSQL 18.0
2024-11-25 11:00:01 LOG: entering standby mode
2024-11-25 11:00:02 LOG: redo starts at 0/6000000
2024-11-25 11:00:03 LOG: restored log file "000000010000000000000006" from archive
2024-11-25 11:00:05 LOG: restored log file "000000010000000000000007" from archive
2024-11-25 11:00:08 LOG: recovery stopping before commit of transaction 12345678, time 2024-11-25 10:30:00
2024-11-25 11:00:09 LOG: pausing at the end of recovery
2024-11-25 11:00:09 HINT: Execute pg_wal_replay_resume() to continue.
If paused, resume:
-- Check status
SELECT pg_is_in_recovery();
-- true
-- Resume (will promote if action = promote)
SELECT pg_wal_replay_resume();
-- Or promote manually
SELECT pg_promote();
Verify recovery:
-- Check database state
SELECT pg_is_in_recovery();
-- false (if promoted)
-- Verify data
SELECT * FROM critical_table WHERE created_at >= '2024-11-25 10:25:00';
-- Should see data up to recovery target time
4.6. Timeline after PITR
After PITR, timeline increments:
Original timeline: 3
After PITR: 4
This prevents accidentally replaying WAL from "future" timeline
# Check new timeline
sudo -u postgres psql -c "
SELECT timeline_id FROM pg_control_checkpoint();
"
# timeline_id
# ------------
# 4
5. Automation with Patroni
5.1. Patroni WAL archiving
Configure in patroni.yml:
postgresql:
parameters:
wal_level: replica
archive_mode: 'on'
archive_command: 'test ! -f /mnt/wal_archive/%f && cp %p /mnt/wal_archive/%f'
archive_timeout: 300
# Restore command for replicas
recovery_conf:
restore_command: 'cp /mnt/wal_archive/%f %p'
Patroni automatically:
- Configures archiving on primary
- Configures restore on replicas
- Handles timeline changes
5.2. Backup script
#!/bin/bash
# backup.sh - Automated PostgreSQL backup
set -e
BACKUP_DIR="/backup/postgres"
RETENTION_DAYS=7
DATE=$(date +%Y%m%d_%H%M%S)
# Create backup directory
mkdir -p "$BACKUP_DIR/$DATE"
# Run pg_basebackup
sudo -u postgres pg_basebackup \
-D "$BACKUP_DIR/$DATE" \
-Fp \
-Xs \
-P \
-v \
-c fast
# Verify backup
sudo -u postgres pg_verifybackup "$BACKUP_DIR/$DATE"
# Create metadata
cat > "$BACKUP_DIR/$DATE/backup_info.txt" <<EOF
Backup Date: $(date)
Hostname: $(hostname)
PostgreSQL Version: $(sudo -u postgres psql -t -c "SELECT version();")
Database Size: $(du -sh "$BACKUP_DIR/$DATE" | awk '{print $1}')
EOF
# Remove old backups
find "$BACKUP_DIR" -maxdepth 1 -type d -mtime +$RETENTION_DAYS -exec rm -rf {} \;
# Log
echo "$(date): Backup completed successfully: $BACKUP_DIR/$DATE" | tee -a /var/log/postgres-backup.log
# Optional: Upload to S3
# aws s3 sync "$BACKUP_DIR/$DATE" "s3://my-bucket/postgres-backups/$DATE/"
# Send notification
# curl -X POST https://hooks.slack.com/... -d '{"text":"Backup completed"}'
Schedule with cron:
# Run daily at 2 AM
sudo crontab -u postgres -e
# Add:
0 2 * * * /usr/local/bin/backup.sh >> /var/log/postgres-backup.log 2>&1
5.3. WAL-G integration
Install WAL-G:
# Download
wget https://github.com/wal-g/wal-g/releases/download/v2.0.1/wal-g-pg-ubuntu-20.04-amd64.tar.gz
tar -xzf wal-g-pg-ubuntu-20.04-amd64.tar.gz
sudo mv wal-g-pg-ubuntu-20.04-amd64 /usr/local/bin/wal-g
sudo chmod +x /usr/local/bin/wal-g
Configure:
# Create config
sudo -u postgres tee /var/lib/postgresql/.walrc <<EOF
AWS_ACCESS_KEY_ID=your_key
AWS_SECRET_ACCESS_KEY=your_secret
AWS_REGION=us-east-1
WALG_S3_PREFIX=s3://my-bucket/postgres
WALG_COMPRESSION_METHOD=lz4
WALG_DELTA_MAX_STEPS=6
EOF
# Update postgresql.conf
archive_command = '/usr/local/bin/wal-g wal-push %p'
restore_command = '/usr/local/bin/wal-g wal-fetch %f %p'
Take backup with WAL-G:
# Full backup
sudo -u postgres wal-g backup-push /var/lib/postgresql/18/data
# List backups
sudo -u postgres wal-g backup-list
# name modified wal_segment_backup_start
# base_000000010000000000000004 2024-11-25T10:00:00Z 000000010000000000000004
Restore with WAL-G:
# Stop PostgreSQL
sudo systemctl stop postgresql
# Clear data directory
sudo rm -rf /var/lib/postgresql/18/data/*
# Restore latest backup
sudo -u postgres wal-g backup-fetch /var/lib/postgresql/18/data LATEST
# Or restore specific backup
sudo -u postgres wal-g backup-fetch /var/lib/postgresql/18/data base_000000010000000000000004
# Configure PITR (if needed)
sudo -u postgres touch /var/lib/postgresql/18/data/recovery.signal
echo "recovery_target_time = '2024-11-25 10:30:00'" | \
sudo -u postgres tee -a /var/lib/postgresql/18/data/postgresql.auto.conf
# Start PostgreSQL
sudo systemctl start postgresql
6. Disaster Recovery Planning
6.1. DR strategy
3-2-1 Rule:
3: Keep 3 copies of data
2: Store on 2 different media types
1: Keep 1 copy off-site
Example:
- Production database (live)
- Local backup (same datacenter)
- S3 backup (cloud, different region)
6.2. DR checklist
Preparation:
✅ WAL archiving enabled and tested
✅ Regular base backups (daily/weekly)
✅ Off-site backup storage (S3, remote server)
✅ Backup verification automated
✅ Restore procedures documented
✅ DR drills scheduled (quarterly)
✅ Monitoring and alerting configured
✅ Backup retention policy defined
✅ Encryption for backups (at rest and in transit)
✅ Access controls (who can restore)
6.3. DR scenarios and procedures
Scenario 1: Database corruption
# 1. Identify corruption
SELECT * FROM corrupt_table; # ERROR
# 2. Stop PostgreSQL
sudo systemctl stop postgresql
# 3. Restore from last good backup
sudo rm -rf /var/lib/postgresql/18/data
sudo cp -a /backup/base/20241125 /var/lib/postgresql/18/data
# 4. Configure PITR to just before corruption
echo "recovery_target_time = '2024-11-25 09:55:00'" | \
sudo tee -a /var/lib/postgresql/18/data/postgresql.auto.conf
sudo touch /var/lib/postgresql/18/data/recovery.signal
# 5. Start and verify
sudo systemctl start postgresql
Scenario 2: Datacenter failure
# 1. Provision new servers in different region
# 2. Install PostgreSQL + Patroni
# 3. Restore from off-site backup (S3)
aws s3 sync s3://my-bucket/postgres-backups/20241125 /backup/restore/
sudo cp -a /backup/restore /var/lib/postgresql/18/data
# 4. Restore WAL from S3
# (configure restore_command with wal-g or S3)
# 5. Start cluster
sudo systemctl start patroni
# 6. Update DNS/Load balancer to new region
# RTO: ~1-2 hours (depends on backup size and network)
Scenario 3: Accidental data deletion
# Oops: DELETE FROM users WHERE ...; (without WHERE clause)
# Option A: PITR to before deletion
# (See section 4.4)
# Option B: Restore to separate instance and copy data
sudo -u postgres pg_basebackup -D /tmp/restore ...
# Start on different port
# Copy missing data to production
6.4. Recovery metrics
RTO (Recovery Time Objective):
Target: < 2 hours for full DR
Breakdown:
- Detection: 5-10 min
- Decision: 10-15 min
- Restore base backup: 30-60 min
- Replay WAL: 10-30 min
- Verification: 10-20 min
- DNS/Traffic switch: 5-10 min
Total: ~70-145 min
RPO (Recovery Point Objective):
Target: < 5 minutes data loss
With synchronous replication: 0 (zero data loss)
With WAL archiving: < archive_timeout (e.g., 5 min)
With daily backup only: Up to 24 hours ❌
7. Monitoring and Alerting
7.1. Key backup metrics
-- Archive status
SELECT archived_count,
failed_count,
EXTRACT(EPOCH FROM (now() - last_archived_time)) AS seconds_since_last_archive
FROM pg_stat_archiver;
-- Alert if seconds_since_last_archive > 600 (10 min)
# Backup age
find /backup/base -maxdepth 1 -type d -name "202*" -mtime -1 | wc -l
# Should be >= 1 (at least one backup in last 24h)
7.2. Prometheus metrics
# Alert rules
groups:
- name: backup_alerts
rules:
- alert: PostgreSQLArchivingFailed
expr: pg_stat_archiver_failed_count > 0
labels:
severity: critical
annotations:
summary: "WAL archiving failures detected"
- alert: PostgreSQLNoRecentBackup
expr: time() - pg_backup_last_success_timestamp > 86400
for: 1h
labels:
severity: warning
annotations:
summary: "No backup in last 24 hours"
- alert: PostgreSQLArchiveDelayHigh
expr: |
time() - pg_stat_archiver_last_archived_time > 600
for: 5m
labels:
severity: warning
annotations:
summary: "WAL archiving delayed > 10 min"
7.3. Backup verification
#!/bin/bash
# verify-backup.sh
BACKUP_DIR="/backup/base"
LATEST_BACKUP=$(ls -td $BACKUP_DIR/*/ | head -1)
echo "Verifying: $LATEST_BACKUP"
# 1. Check backup_manifest exists
if [ ! -f "$LATEST_BACKUP/backup_manifest" ]; then
echo "❌ backup_manifest missing"
exit 1
fi
# 2. Run pg_verifybackup
if sudo -u postgres pg_verifybackup "$LATEST_BACKUP" > /dev/null 2>&1; then
echo "✅ Backup verified successfully"
else
echo "❌ Backup verification failed"
exit 1
fi
# 3. Check size (should be reasonable)
SIZE=$(du -sb "$LATEST_BACKUP" | awk '{print $1}')
MIN_SIZE=1000000000 # 1GB minimum
if [ "$SIZE" -lt "$MIN_SIZE" ]; then
echo "⚠️ Backup size suspicious: $(du -sh "$LATEST_BACKUP" | awk '{print $1}')"
exit 1
fi
# 4. Test restore to temp location (optional, resource-intensive)
# ...
echo "✅ All backup checks passed"
8. Best Practices
✅ DO
- Enable WAL archiving - Required for PITR
- Automate backups - Daily pg_basebackup via cron/systemd timer
- Test restores regularly - Monthly DR drills
- Monitor archiving - Alert on failures
- Keep multiple backup generations - 7 daily + 4 weekly + 12 monthly
- Off-site backups - S3, different region/datacenter
- Encrypt backups - At rest and in transit
- Document procedures - Runbooks for restoration
- Verify backups - pg_verifybackup after each backup
- Calculate RTO/RPO - Know your limits
❌ DON'T
- Don't skip testing - Untested backup = no backup
- Don't store only locally - Datacenter failure = data loss
- Don't ignore archive failures - Silent data loss risk
- Don't delete WAL too early - Need for PITR
- Don't forget retention - Storage costs vs recovery needs
- Don't backup to same disk - Disk failure = everything lost
- Don't skip encryption - Security/compliance risk
- Don't assume it works - Verify, verify, verify
9. Lab Exercises
Lab 1: Setup WAL archiving
Tasks:
- Configure archive_mode and archive_command
- Create archive directory
- Force WAL switch:
SELECT pg_switch_wal(); - Verify WAL files in archive directory
- Monitor pg_stat_archiver
Lab 2: Take base backup
Tasks:
- Use pg_basebackup to create backup
- Verify with pg_verifybackup
- Calculate backup size and time
- Compress backup and compare size
- Document backup metadata
Lab 3: Perform PITR
Tasks:
- Create test table with timestamps
- Take base backup
- Insert more data
- Note specific timestamp
- Insert "bad" data after timestamp
- Restore to timestamp (before bad data)
- Verify recovery point is correct
Lab 4: Automate backup
Tasks:
- Write backup script with retention
- Add error handling and notifications
- Schedule with cron
- Test script execution
- Monitor backup logs
Lab 5: DR drill
Tasks:
- Simulate total database loss (remove data directory)
- Restore from backup
- Replay WAL to latest point
- Measure RTO (time to restore)
- Verify data integrity
- Document lessons learned
10. Troubleshooting
Issue: Archiving not working
Symptoms: failed_count > 0 in pg_stat_archiver
Check:
# Check archive_command manually
sudo -u postgres bash -c 'f=000000010000000000000001; p=/var/lib/postgresql/18/data/pg_wal/$f; test ! -f /mnt/wal_archive/$f && cp $p /mnt/wal_archive/$f'
echo $? # Should be 0
Common causes:
- Permission denied on archive directory
- Archive directory doesn't exist
- Disk full
- Network issue (if remote archiving)
Issue: pg_basebackup slow
Optimize:
# Use compression
pg_basebackup -z ...
# Parallel copy (if multiple tablespaces)
pg_basebackup -j 4 ...
# Adjust checkpoint_timeout
ALTER SYSTEM SET checkpoint_timeout = '15min';
# Use faster storage for backup destination
Issue: PITR fails - WAL not found
Error: could not open file "000000010000000000000007": No such file or directory
Cause: WAL file not in archive or restore_command wrong
Fix:
# Check WAL exists in archive
ls -lh /mnt/wal_archive/000000010000000000000007
# Test restore_command manually
sudo -u postgres cp /mnt/wal_archive/000000010000000000000007 /tmp/test_wal
# If missing, cannot recover to that point
# Restore to earlier point or use latest available WAL
11. Tổng kết
Backup Strategy Summary
| Method | RPO | RTO | Complexity | Use Case |
|---|---|---|---|---|
| pg_dump | Hours/Days | Hours | Low | Small DBs, migration |
| pg_basebackup | Last backup | 30-120 min | Medium | Regular backups |
| WAL archiving | Minutes | 30-120 min | Medium | PITR capability |
| Replication | Seconds/0 | 30-60 sec | High | HA, zero data loss |
Key Concepts
✅ WAL Archiving - Continuous backup of transaction logs
✅ pg_basebackup - Physical backup of entire cluster
✅ PITR - Restore to any point in time using base backup + WAL
✅ RTO - How fast can you recover
✅ RPO - How much data loss is acceptable
✅ 3-2-1 Rule - 3 copies, 2 media types, 1 off-site
Recovery Checklist
- Stop PostgreSQL
- Restore base backup
- Create recovery.signal
- Configure restore_command
- Set recovery target (time/xid/lsn)
- Start PostgreSQL
- Monitor recovery logs
- Verify recovery point
- Promote if satisfied
- Update replication if needed
Next Steps
Bài 17 sẽ cover Monitoring và Observability:
- Prometheus + Grafana setup
- Key PostgreSQL metrics
- Patroni monitoring
- Log aggregation
- Alerting strategies
- Performance dashboards