Bài 24: Upgrade Strategies
Upgrade PostgreSQL major version, Patroni version, zero-downtime upgrade techniques, rollback procedures và lab upgrade PG 17 lên 18.
Bài 24: Upgrade Strategies
Mục tiêu
Sau bài học này, bạn sẽ:
- Plan and execute PostgreSQL major version upgrades
- Upgrade Patroni with zero downtime
- Use pg_upgrade for in-place upgrades
- Implement logical replication for upgrades
- Rollback failed upgrades safely
1. Upgrade Planning
1.1. Pre-upgrade checklist
☐ Review PostgreSQL release notes
☐ Check extension compatibility
☐ Test upgrade in staging environment
☐ Backup all data (full + WAL archive)
☐ Document current versions
☐ Schedule maintenance window
☐ Notify stakeholders
☐ Prepare rollback plan
☐ Verify disk space (need 2x current data size)
☐ Check for deprecated features in new version
☐ Update monitoring/alerting
☐ Prepare downtime communication
1.2. Version compatibility matrix
| From → To | Method | Downtime | Risk |
|---|---|---|---|
| 17 → 18 | pg_upgrade | Minutes | Low |
| 15 → 18 | pg_upgrade | Minutes | Medium |
| 12 → 18 | Logical replication | None | Medium |
| 9.6 → 18 | Dump/restore | Hours | High |
1.3. Document current state
# PostgreSQL version
psql -c "SELECT version();"
# Installed extensions
psql -c "\dx"
# Database sizes
psql -c "SELECT datname, pg_size_pretty(pg_database_size(datname)) FROM pg_database;"
# Patroni version
patronictl version
# etcd version
etcdctl version
2. PostgreSQL Minor Version Upgrade
2.1. Minor upgrade process (e.g., 18.0 → 18.1)
# Minor upgrades are easy - just update packages
# On each node (one at a time):
# 1. Update packages
sudo apt-get update
sudo apt-get install --only-upgrade postgresql-18
# 2. Restart Patroni (will restart PostgreSQL)
sudo systemctl restart patroni
# 3. Verify new version
psql -c "SELECT version();"
# Patroni handles failover automatically during restart
2.2. Rolling minor upgrade
# Upgrade replicas first
for node in node2 node3; do
echo "Upgrading $node..."
ssh $node "sudo apt-get update && sudo apt-get install -y --only-upgrade postgresql-18"
ssh $node "sudo systemctl restart patroni"
sleep 30 # Wait for replica to catch up
done
# Switchover to upgraded replica
patronictl switchover postgres-cluster --leader node1 --candidate node2
# Upgrade old leader (now replica)
ssh node1 "sudo apt-get update && sudo apt-get install -y --only-upgrade postgresql-18"
ssh node1 "sudo systemctl restart patroni"
3. PostgreSQL Major Version Upgrade with pg_upgrade
3.1. Architecture
Before (PostgreSQL 17):
node1 (17, Leader)
node2 (17, Replica)
node3 (17, Replica)
During upgrade:
node1 (17, Leader) ← Still serving traffic
node2 (18, NEW) ← Upgrading
node3 (17, Replica)
After upgrade:
node1 (18, Leader) ← Upgraded
node2 (18, Replica)
node3 (18, Replica)
3.2. Install new PostgreSQL version
# Install PostgreSQL 18 alongside 17
sudo apt-get install -y postgresql-18 postgresql-18-contrib
# Both versions now installed:
# /usr/lib/postgresql/17/
# /usr/lib/postgresql/18/
3.3. Prepare for upgrade (node2 - first replica)
# 1. Stop Patroni on node2
sudo systemctl stop patroni
# 2. Create new data directory for v18
sudo mkdir -p /var/lib/postgresql/18/data
sudo chown postgres:postgres /var/lib/postgresql/18/data
# 3. Initialize new cluster
sudo -u postgres /usr/lib/postgresql/18/bin/initdb \
-D /var/lib/postgresql/18/data \
--encoding=UTF8 \
--data-checksums
# 4. Run pg_upgrade
sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/17/data \
--new-datadir=/var/lib/postgresql/18/data \
--old-bindir=/usr/lib/postgresql/17/bin \
--new-bindir=/usr/lib/postgresql/18/bin \
--check # Dry run first!
# If check passes, run actual upgrade:
sudo -u postgres /usr/lib/postgresql/18/bin/pg_upgrade \
--old-datadir=/var/lib/postgresql/17/data \
--new-datadir=/var/lib/postgresql/18/data \
--old-bindir=/usr/lib/postgresql/17/bin \
--new-bindir=/usr/lib/postgresql/18/bin \
--link # Use hard links (faster)
# Expected output:
# Performing Consistency Checks
# -----------------------------
# ...
# Upgrade Complete
# ----------------
3.4. Update Patroni configuration for v18
# /etc/patroni/patroni.yml
postgresql:
bin_dir: /usr/lib/postgresql/18/bin # Changed from 17
data_dir: /var/lib/postgresql/18/data # Changed from 17
# ... rest of config
# Start Patroni with new version
sudo systemctl start patroni
# Verify node2 is now running v18
psql -h node2 -U postgres -c "SELECT version();"
3.5. Upgrade remaining nodes
# Repeat process for node3
ssh node3 "sudo systemctl stop patroni"
ssh node3 "# ... same pg_upgrade steps ..."
ssh node3 "sudo systemctl start patroni"
# Finally, upgrade node1 (current leader)
# Switchover to node2 first
patronictl switchover postgres-cluster --leader node1 --candidate node2
# Now upgrade node1
ssh node1 "sudo systemctl stop patroni"
ssh node1 "# ... same pg_upgrade steps ..."
ssh node1 "sudo systemctl start patroni"
# All nodes now on v18!
patronictl list
3.6. Post-upgrade tasks
# Run generated optimize scripts
sudo -u postgres ./analyze_new_cluster.sh
sudo -u postgres ./reindex_hash.sh # If upgrading from < 10
# Update extensions
psql -c "ALTER EXTENSION pg_stat_statements UPDATE;"
# Vacuum analyze all databases
vacuumdb --all --analyze-in-stages
# Remove old cluster (after verifying everything works!)
# sudo -u postgres ./delete_old_cluster.sh
4. Zero-Downtime Upgrade with Logical Replication
4.1. Architecture
Production (v17):
node1 (v17, Leader) ← Serving traffic
↓ Logical replication
New cluster (v18):
node4 (v18, Leader) ← Receiving changes
node5 (v18, Replica)
After cutover:
Application → node4 (v18) ← New primary
4.2. Setup new v18 cluster
# Install PostgreSQL 18 on new servers
# Setup Patroni cluster (node4, node5, node6)
# See previous lessons for installation
# Verify new cluster
patronictl -c /etc/patroni/patroni-v18.yml list
4.3. Create publication on v17 (source)
-- On node1 (v17 leader)
CREATE PUBLICATION pg17_to_pg18 FOR ALL TABLES;
-- Or specific tables:
-- CREATE PUBLICATION pg17_to_pg18 FOR TABLE users, orders, products;
-- Verify
SELECT * FROM pg_publication;
4.4. Create subscription on v18 (target)
-- On node4 (v18 leader)
CREATE SUBSCRIPTION pg18_from_pg17
CONNECTION 'host=node1 port=5432 dbname=myapp user=replicator password=rep_pass'
PUBLICATION pg17_to_pg18
WITH (copy_data = true, create_slot = true);
-- Monitor initial sync
SELECT * FROM pg_stat_subscription;
-- Wait for initial data copy to complete
-- subname | pg18_from_pg17
-- received_lsn | 0/3000000
-- ...
4.5. Monitor replication lag
-- On v17 (source)
SELECT slot_name, active, restart_lsn, confirmed_flush_lsn
FROM pg_replication_slots
WHERE slot_name LIKE '%pg18%';
-- On v18 (target)
SELECT subname,
received_lsn,
latest_end_lsn,
pg_size_pretty(pg_wal_lsn_diff(latest_end_lsn, received_lsn)) AS lag
FROM pg_stat_subscription;
4.6. Cutover procedure
# 1. Stop writes to v17 (put app in maintenance mode)
# Or set database to read-only:
psql -h node1 -U postgres -c "ALTER SYSTEM SET default_transaction_read_only = on;"
psql -h node1 -U postgres -c "SELECT pg_reload_conf();"
# 2. Wait for replication to catch up
psql -h node4 -U postgres -c "SELECT pg_size_pretty(pg_wal_lsn_diff(latest_end_lsn, received_lsn)) FROM pg_stat_subscription;"
# Should be 0 bytes
# 3. Disable subscription on v18
psql -h node4 -U postgres -c "ALTER SUBSCRIPTION pg18_from_pg17 DISABLE;"
# 4. Drop subscription (optional, after confirming everything works)
# psql -h node4 -U postgres -c "DROP SUBSCRIPTION pg18_from_pg17;"
# 5. Update application connection strings to point to node4
# 6. Enable writes on v18
psql -h node4 -U postgres -c "ALTER SYSTEM SET default_transaction_read_only = off;"
psql -h node4 -U postgres -c "SELECT pg_reload_conf();"
# 7. Verify application works on v18
# 8. Keep v17 cluster running for rollback (1-2 weeks)
5. Patroni Version Upgrade
5.1. Check compatibility
# Check current Patroni version
patronictl version
# Check PostgreSQL compatibility
# Patroni 3.2.0+ supports PostgreSQL 18
# See: https://github.com/zalando/patroni/releases
5.2. Upgrade Patroni (Python package)
# On each node:
# 1. Upgrade via pip
sudo pip3 install --upgrade patroni[etcd]
# Or specific version:
# sudo pip3 install patroni[etcd]==3.2.2
# 2. Verify new version
patronictl version
# 3. Restart Patroni service
sudo systemctl restart patroni
# No downtime - Patroni handles failover automatically
5.3. Rolling Patroni upgrade
# Upgrade replicas first
for node in node2 node3; do
echo "Upgrading Patroni on $node..."
ssh $node "sudo pip3 install --upgrade patroni[etcd]"
ssh $node "sudo systemctl restart patroni"
sleep 10
done
# Switchover to upgraded replica
patronictl switchover postgres-cluster --leader node1 --candidate node2
# Upgrade old leader
ssh node1 "sudo pip3 install --upgrade patroni[etcd]"
ssh node1 "sudo systemctl restart patroni"
6. etcd Upgrade
6.1. etcd minor upgrade
# On each etcd node:
sudo systemctl stop etcd
sudo apt-get update
sudo apt-get install --only-upgrade etcd
sudo systemctl start etcd
# Verify cluster health
etcdctl endpoint health
6.2. etcd major upgrade (e.g., 3.4 → 3.5)
# Follow official etcd upgrade guide
# https://etcd.io/docs/latest/upgrades/
# Key steps:
# 1. Backup etcd data
etcdctl snapshot save backup.db
# 2. Upgrade one member at a time
# 3. Verify cluster health after each upgrade
# 4. Update Patroni to use new etcd API version
7. Rollback Strategies
7.1. Rollback pg_upgrade
# Before deleting old cluster, you can rollback
# 1. Stop Patroni
sudo systemctl stop patroni
# 2. Restore old configuration
# /etc/patroni/patroni.yml
postgresql:
bin_dir: /usr/lib/postgresql/17/bin # Back to 17
data_dir: /var/lib/postgresql/17/data
# 3. Start Patroni
sudo systemctl start patroni
# Old cluster resumes operation
7.2. Rollback logical replication
# If cutover to v18 fails, rollback to v17
# 1. Stop application
# 2. Set v17 to read-write
psql -h node1 -U postgres -c "ALTER SYSTEM SET default_transaction_read_only = off;"
psql -h node1 -U postgres -c "SELECT pg_reload_conf();"
# 3. Update application connection strings to v17
# 4. Resume normal operations
# Note: Any writes to v18 during cutover will be LOST!
# Consider setting up reverse replication v18 → v17 if needed
7.3. Rollback Patroni upgrade
# Downgrade Patroni if upgrade causes issues
sudo pip3 install patroni[etcd]==3.1.2 # Previous version
sudo systemctl restart patroni
8. Testing Upgrades
8.1. Staging environment test
# 1. Clone production to staging
pg_basebackup -h prod-leader -D /var/lib/postgresql/staging -X stream
# 2. Perform upgrade in staging
# ... follow upgrade procedures ...
# 3. Run application tests
# ... smoke tests, integration tests ...
# 4. Benchmark performance
pgbench -i -s 100 myapp
pgbench -c 10 -j 2 -t 1000 myapp
# 5. Document issues and timings
8.2. Upgrade rehearsal
# Practice upgrade multiple times
# Time each step
# Identify bottlenecks
# Refine procedures
# Example timing log:
# Step 1: Stop Patroni - 5 seconds
# Step 2: pg_upgrade --check - 30 seconds
# Step 3: pg_upgrade - 10 minutes
# Step 4: Start Patroni - 15 seconds
# Step 5: Replication catchup - 2 minutes
# Total: ~13 minutes
9. Best Practices
✅ DO
- Test in staging first - Multiple times
- Backup everything - Full backup + WAL archive
- Use pg_upgrade --check - Catch issues early
- Document procedures - Step-by-step runbook
- Schedule maintenance window - Off-peak hours
- Monitor closely - During and after upgrade
- Keep old version - Don't delete for 1-2 weeks
- Use logical replication - For zero-downtime
- Upgrade extensions - After PostgreSQL upgrade
- Vacuum analyze - After major upgrade
❌ DON'T
- Don't skip backups - Critical safety net
- Don't upgrade all at once - Rolling upgrades
- Don't delete old cluster immediately - Keep for rollback
- Don't ignore release notes - Breaking changes
- Don't skip testing - Staging is essential
- Don't upgrade during peak hours - Plan maintenance window
- Don't forget about extensions - May need updates
10. Lab Exercises
Lab 1: Minor version upgrade
Tasks:
- Check current PostgreSQL version
- Update packages on replica
- Restart Patroni on replica
- Switchover to upgraded replica
- Upgrade old leader
Lab 2: Major version upgrade with pg_upgrade
Tasks:
- Install PostgreSQL 18 on all nodes
- Run pg_upgrade --check on replica
- Perform pg_upgrade on replica
- Update Patroni configuration
- Complete rolling upgrade
Lab 3: Zero-downtime upgrade with logical replication
Tasks:
- Setup new v18 cluster
- Create publication on v17
- Create subscription on v18
- Monitor replication lag
- Perform cutover
- Verify application functionality
Lab 4: Rollback procedure
Tasks:
- Simulate failed upgrade
- Stop Patroni on all nodes
- Restore old configuration
- Restart old cluster
- Verify rollback successful
11. Tổng kết
Upgrade Methods Comparison
| Method | Downtime | Complexity | Risk | Use Case |
|---|---|---|---|---|
| pg_upgrade | Minutes | Low | Low | Minor version jumps |
| Logical replication | None | High | Medium | Zero-downtime required |
| Dump/restore | Hours | Low | Low | Ancient versions |
| pg_upgrade --link | Seconds | Medium | Medium | Same server upgrade |
Typical Timeline
Week 1-2: Planning and preparation
- Review release notes
- Test in staging
- Document procedures
Week 3: Dress rehearsal
- Full upgrade test in staging
- Time each step
- Identify issues
Week 4: Production upgrade
- Execute during maintenance window
- Monitor closely
- Be ready to rollback
Week 5-6: Stabilization
- Monitor for issues
- Performance tuning
- Keep old cluster for safety
Week 7+: Cleanup
- Delete old cluster
- Update documentation
- Post-mortem review
Upgrade Checklist
Pre-upgrade:
☐ Full backup completed
☐ Staging test successful
☐ Release notes reviewed
☐ Maintenance window scheduled
☐ Rollback plan documented
☐ Stakeholders notified
During upgrade:
☐ Backups verified
☐ pg_upgrade --check passed
☐ Upgrade completed
☐ Replication working
☐ Application connectivity verified
Post-upgrade:
☐ Extensions updated
☐ Vacuum analyze completed
☐ Performance validated
☐ Monitoring updated
☐ Documentation updated
Next Steps
Bài 25 sẽ cover Real-world Case Studies:
- Production architecture examples
- Scaling to 1000+ queries/second
- Cost optimization techniques
- Lessons learned from failures
- Industry-specific implementations