Bài 29: Deploy Production-ready Cluster

Mục tiêu

Sau bài học này, bạn sẽ:

Deploy complete production-ready PostgreSQL HA cluster
Implement all best practices learned in this course
Create comprehensive operational documentation
Perform final validation and handoff
Complete capstone assessment

1. Pre-Deployment Checklist

1.1. Infrastructure readiness

☐ Hardware/VMs provisioned
  ☐ 3+ PostgreSQL nodes (Leader + 2 Replicas minimum)
  ☐ 3 etcd nodes (can co-locate with PostgreSQL)
  ☐ 2 HAProxy/Load balancer nodes
  ☐ 1 Monitoring server (Prometheus + Grafana)
  ☐ 1 Bastion host (for secure access)

☐ Network configuration
  ☐ VPC/VLAN created with appropriate CIDR
  ☐ Subnets configured (public + private)
  ☐ Security groups/firewall rules defined
  ☐ NAT gateway for internet access
  ☐ VPN for remote access (optional)

☐ Storage provisioned
  ☐ Data volumes (SSD, appropriate IOPS)
  ☐ WAL archive storage (S3/NFS)
  ☐ Backup storage (S3/GCS/tape)
  ☐ Log storage (centralized logging)

☐ DNS configuration
  ☐ postgres-master.example.com → HAProxy master
  ☐ postgres-replica.example.com → HAProxy replicas
  ☐ postgres-admin.example.com → Direct access (VPN only)

☐ Security
  ☐ SSL certificates generated and installed
  ☐ SSH keys distributed
  ☐ Secrets management (Vault/AWS Secrets Manager)
  ☐ Audit logging configured

☐ Monitoring
  ☐ Prometheus installed and configured
  ☐ Grafana dashboards imported
  ☐ Alert rules configured
  ☐ PagerDuty/Slack integration tested

☐ Documentation
  ☐ Architecture diagrams updated
  ☐ Runbooks created
  ☐ Contact list (on-call rotation)
  ☐ Escalation procedures

2. Step-by-Step Deployment

2.1. Phase 1: Base system setup (Day 1)

#!/bin/bash
# deploy_phase1.sh - Base system setup

set -e

NODES=("pg-node1" "pg-node2" "pg-node3")

echo "=== Phase 1: Base System Setup ==="

for node in "${NODES[@]}"; do
  echo "Configuring $node..."
  
  # Update system
  ssh $node "sudo apt-get update && sudo apt-get upgrade -y"
  
  # Install required packages
  ssh $node "sudo apt-get install -y \
    curl wget vim git htop \
    net-tools python3 python3-pip \
    postgresql-common"
  
  # Configure system limits
  ssh $node "sudo tee /etc/security/limits.d/postgres.conf" <<EOF
postgres soft nofile 65536
postgres hard nofile 65536
postgres soft nproc 8192
postgres hard nproc 8192
EOF
  
  # Configure sysctl
  ssh $node "sudo tee /etc/sysctl.d/99-postgres.conf" <<EOF
vm.swappiness = 1
vm.overcommit_memory = 2
vm.dirty_ratio = 10
vm.dirty_background_ratio = 3
net.ipv4.tcp_keepalive_time = 60
net.ipv4.tcp_keepalive_intvl = 10
net.ipv4.tcp_keepalive_probes = 6
EOF
  
  ssh $node "sudo sysctl -p /etc/sysctl.d/99-postgres.conf"
  
  # Create directories
  ssh $node "sudo mkdir -p /var/lib/postgresql/wal_archive"
  ssh $node "sudo mkdir -p /var/lib/postgresql/backups"
  
  echo "$node configured ✓"
done

echo "Phase 1 complete! ✅"

2.2. Phase 2: etcd cluster (Day 1)

#!/bin/bash
# deploy_phase2.sh - Deploy etcd cluster

echo "=== Phase 2: etcd Cluster Setup ==="

# Using Ansible for etcd deployment
ansible-playbook -i inventory.ini etcd-playbook.yml

# Verify etcd cluster
echo "Verifying etcd cluster health..."
ssh pg-node1 "etcdctl endpoint health --cluster"

# Expected output:
# http://10.0.1.11:2379 is healthy: successfully committed proposal: took = 1.234ms
# http://10.0.1.12:2379 is healthy: successfully committed proposal: took = 1.456ms
# http://10.0.1.13:2379 is healthy: successfully committed proposal: took = 1.678ms

echo "Phase 2 complete! ✅"

2.3. Phase 3: PostgreSQL + Patroni (Day 2)

#!/bin/bash
# deploy_phase3.sh - Deploy PostgreSQL + Patroni

echo "=== Phase 3: PostgreSQL + Patroni Setup ==="

# Deploy with Ansible
ansible-playbook -i inventory.ini postgresql-patroni-playbook.yml

# Wait for cluster initialization
echo "Waiting for Patroni cluster to initialize..."
sleep 60

# Verify cluster
ssh pg-node1 "patronictl -c /etc/patroni/patroni.yml list"

# Expected output:
# + Cluster: postgres-cluster -------+----+-----------+
# | Member   | Host       | Role    | State     | TL | Lag in MB |
# +----------+------------+---------+-----------+----+-----------+
# | pg-node1 | 10.0.1.11  | Leader  | running   |  1 |           |
# | pg-node2 | 10.0.1.12  | Replica | streaming |  1 |         0 |
# | pg-node3 | 10.0.1.13  | Replica | streaming |  1 |         0 |
# +----------+------------+---------+-----------+----+-----------+

echo "Phase 3 complete! ✅"

2.4. Phase 4: Connection pooling (Day 2)

#!/bin/bash
# deploy_phase4.sh - Deploy PgBouncer

echo "=== Phase 4: PgBouncer Setup ==="

ansible-playbook -i inventory.ini pgbouncer-playbook.yml

# Test connection through PgBouncer
psql -h pg-node1 -p 6432 -U postgres -c "SHOW POOLS;"

echo "Phase 4 complete! ✅"

2.5. Phase 5: Load balancing (Day 3)

#!/bin/bash
# deploy_phase5.sh - Deploy HAProxy

echo "=== Phase 5: HAProxy Setup ==="

ansible-playbook -i inventory.ini haproxy-playbook.yml

# Test connections
echo "Testing master connection..."
psql -h postgres-master.example.com -U postgres -c "SELECT inet_server_addr();"

echo "Testing replica connection..."
psql -h postgres-replica.example.com -U postgres -c "SELECT inet_server_addr();"

echo "Phase 5 complete! ✅"

2.6. Phase 6: Monitoring (Day 3)

#!/bin/bash
# deploy_phase6.sh - Deploy monitoring stack

echo "=== Phase 6: Monitoring Setup ==="

ansible-playbook -i inventory.ini monitoring-playbook.yml

# Verify Prometheus targets
curl http://prometheus.example.com:9090/api/v1/targets | jq '.data.activeTargets[] | select(.labels.job=="postgres") | {instance: .labels.instance, health: .health}'

# Import Grafana dashboards
for dashboard in dashboards/*.json; do
  curl -X POST \
    http://admin:admin@grafana.example.com:3000/api/dashboards/db \
    -H "Content-Type: application/json" \
    -d @$dashboard
done

echo "Phase 6 complete! ✅"
echo "Grafana: http://grafana.example.com:3000"

2.7. Phase 7: Backup configuration (Day 4)

#!/bin/bash
# deploy_phase7.sh - Configure backups

echo "=== Phase 7: Backup Configuration ==="

# Deploy pgBackRest or WAL-G
ansible-playbook -i inventory.ini backup-playbook.yml

# Schedule backup cron jobs
ssh pg-node1 "sudo -u postgres crontab -l" <<EOF
# Daily full backup at 2 AM
0 2 * * * /usr/local/bin/pg_backup.sh full

# Hourly incremental backup
0 * * * * /usr/local/bin/pg_backup.sh incremental

# Continuous WAL archiving (handled by PostgreSQL archive_command)
EOF

# Test backup
echo "Testing backup..."
ssh pg-node1 "sudo -u postgres /usr/local/bin/pg_backup.sh full --test"

# Test restore (to separate directory)
echo "Testing restore..."
ssh pg-node1 "sudo -u postgres /usr/local/bin/pg_restore.sh /var/lib/postgresql/restore_test"

echo "Phase 7 complete! ✅"

3. Post-Deployment Validation

3.1. Functional testing

#!/bin/bash
# validate_deployment.sh

echo "=== Deployment Validation ==="

# Test 1: Cluster health
echo "Test 1: Cluster health"
patronictl -c /etc/patroni/patroni.yml list
if [ $? -eq 0 ]; then
  echo "✅ Cluster is healthy"
else
  echo "❌ Cluster health check failed"
  exit 1
fi

# Test 2: Replication lag
echo "Test 2: Replication lag"
LAG=$(psql -h pg-node2 -U postgres -Atc "
  SELECT pg_wal_lsn_diff(
    pg_last_wal_receive_lsn(),
    pg_last_wal_replay_lsn()
  );
")
if [ $LAG -lt 1048576 ]; then  # < 1MB
  echo "✅ Replication lag acceptable: $LAG bytes"
else
  echo "⚠️  High replication lag: $LAG bytes"
fi

# Test 3: Write operations
echo "Test 3: Write operations"
psql -h postgres-master.example.com -U postgres <<EOF
CREATE TABLE validation_test (id serial primary key, data text, created_at timestamp default now());
INSERT INTO validation_test (data) VALUES ('Test data 1'), ('Test data 2'), ('Test data 3');
SELECT * FROM validation_test;
EOF
if [ $? -eq 0 ]; then
  echo "✅ Write operations successful"
else
  echo "❌ Write operations failed"
  exit 1
fi

# Test 4: Read from replica
echo "Test 4: Read from replica"
psql -h postgres-replica.example.com -U postgres -c "SELECT * FROM validation_test;"
if [ $? -eq 0 ]; then
  echo "✅ Read from replica successful"
else
  echo "❌ Read from replica failed"
  exit 1
fi

# Test 5: Automatic failover
echo "Test 5: Automatic failover (simulation)"
read -p "Press Enter to simulate leader failure..."
CURRENT_LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
ssh $CURRENT_LEADER "sudo systemctl stop patroni"
echo "Waiting 30 seconds for failover..."
sleep 30
patronictl -c /etc/patroni/patroni.yml list
NEW_LEADER=$(patronictl -c /etc/patroni/patroni.yml list | grep Leader | awk '{print $2}')
if [ "$CURRENT_LEADER" != "$NEW_LEADER" ]; then
  echo "✅ Automatic failover successful: $CURRENT_LEADER → $NEW_LEADER"
  # Restore old leader
  ssh $CURRENT_LEADER "sudo systemctl start patroni"
else
  echo "❌ Failover did not occur"
  exit 1
fi

# Test 6: Backup and restore
echo "Test 6: Backup and restore"
sudo -u postgres /usr/local/bin/pg_backup.sh full
if [ $? -eq 0 ]; then
  echo "✅ Backup successful"
else
  echo "❌ Backup failed"
  exit 1
fi

# Test 7: Monitoring
echo "Test 7: Monitoring"
curl -s http://prometheus.example.com:9090/api/v1/query?query=up | jq '.data.result[] | select(.metric.job=="postgres")'
if [ $? -eq 0 ]; then
  echo "✅ Monitoring operational"
else
  echo "❌ Monitoring check failed"
  exit 1
fi

echo ""
echo "🎉 All validation tests passed!"
echo "Production cluster is ready! ✅"

3.2. Performance testing

#!/bin/bash
# performance_test.sh

echo "=== Performance Testing ==="

# Test 1: Single connection throughput
echo "Test 1: Single connection throughput"
pgbench -i -s 100 testdb
pgbench -c 1 -j 1 -t 10000 testdb
# Expected: > 500 TPS

# Test 2: Multi-connection throughput
echo "Test 2: Multi-connection throughput (10 connections)"
pgbench -c 10 -j 2 -t 10000 testdb
# Expected: > 3,000 TPS

# Test 3: Read-only workload
echo "Test 3: Read-only workload on replica"
pgbench -c 10 -j 2 -S -t 10000 -h postgres-replica.example.com testdb
# Expected: > 5,000 TPS

# Test 4: Connection pooling efficiency
echo "Test 4: Connection pooling (100 connections)"
pgbench -c 100 -j 4 -t 1000 -h pg-node1 -p 6432 testdb
# Should handle without errors

# Test 5: Replication lag under load
echo "Test 5: Replication lag under load"
pgbench -c 50 -j 4 -T 60 testdb &
PGBENCH_PID=$!
while kill -0 $PGBENCH_PID 2>/dev/null; do
  LAG=$(psql -h pg-node2 -U postgres -Atc "SELECT pg_wal_lsn_diff(pg_last_wal_receive_lsn(), pg_last_wal_replay_lsn());")
  echo "Current replication lag: $LAG bytes"
  sleep 5
done
# Expected: Lag < 10MB throughout test

echo "Performance testing complete!"

4. Operational Documentation

4.1. Runbook structure

# PostgreSQL HA Cluster Runbook

## 1. Cluster Overview
- Architecture: 3-node Patroni cluster with HAProxy
- Leader: pg-node1 (10.0.1.11)
- Replicas: pg-node2 (10.0.1.12), pg-node3 (10.0.1.13)
- Load balancers: haproxy1, haproxy2
- Monitoring: Prometheus + Grafana
- Backup: Daily full to S3, continuous WAL archiving

## 2. Common Tasks

### 2.1. Check cluster status
```bash
patronictl -c /etc/patroni/patroni.yml list

2.2. Check replication lag

psql -h pg-node2 -U postgres -c "
  SELECT client_addr,
         pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes
  FROM pg_stat_replication;
"

2.3. Perform planned switchover

patronictl -c /etc/patroni/patroni.yml switchover postgres-cluster \
  --leader pg-node1 \
  --candidate pg-node2 \
  --scheduled 'now'

2.4. Add new replica

# See detailed procedure in Section 8

2.5. Manual backup

sudo -u postgres /usr/local/bin/pg_backup.sh full

3. Troubleshooting

3.1. Cluster split-brain

Symptoms: Multiple leaders reported Resolution: See Section 9.1

3.2. High replication lag

Symptoms: Lag > 100MB for > 5 minutes Resolution: See Section 9.2

3.3. Disk space exhaustion

Symptoms: Disk usage > 90% Resolution: See Section 9.3

4. Emergency Procedures

4.1. Complete cluster failure

Check etcd cluster health
If etcd down, restore from backup
Reinitialize Patroni cluster
Restore data from backup if needed

4.2. Data corruption

Stop writes (set read-only)
Identify corruption extent
Perform PITR to point before corruption
Validate restored data
Resume normal operations

5. Escalation

L1 Support: DevOps on-call (PagerDuty)
L2 Support: DBA team (Slack: #dba-oncall)
L3 Support: Senior DBA (Phone: xxx-xxx-xxxx)

### 4.2. Monitoring dashboard guide

```markdown
# Grafana Dashboard Guide

## Primary Dashboard: PostgreSQL Cluster Overview

### Panels:

1. **Cluster Health**
   - Shows current leader
   - Replica count
   - Failed/stopped nodes
   - Alert: Any node down for > 1 minute

2. **Query Performance**
   - Queries per second (QPS)
   - Average query duration
   - 95th percentile latency
   - Alert: p95 latency > 100ms

3. **Replication Lag**
   - Lag in bytes for each replica
   - Lag in seconds
   - Alert: Lag > 10MB or > 10 seconds

4. **Resource Usage**
   - CPU usage per node
   - Memory usage
   - Disk I/O
   - Alert: CPU > 80%, Memory > 90%, Disk > 85%

5. **Connections**
   - Active connections
   - Idle connections
   - PgBouncer pool usage
   - Alert: Connections > 90% of max_connections

6. **Disk Space**
   - Data directory usage
   - WAL directory usage
   - Backup storage usage
   - Alert: Any filesystem > 85%

7. **Backup Status**
   - Last backup time
   - Backup size
   - WAL archiving status
   - Alert: No backup in 25 hours

## How to Use:
- Access: http://grafana.example.com:3000
- Username: admin (stored in 1Password)
- Time range: Last 1 hour (default), adjustable
- Refresh: 10 seconds auto-refresh

5. Knowledge Transfer

5.1. Training checklist

☐ PostgreSQL fundamentals
  ☐ Architecture (processes, memory, storage)
  ☐ Replication (streaming, logical)
  ☐ Backup and recovery (PITR)

☐ Patroni operations
  ☐ Cluster management (patronictl commands)
  ☐ Configuration management (edit-config)
  ☐ Failover and switchover
  ☐ Troubleshooting common issues

☐ Monitoring and alerting
  ☐ Grafana dashboards interpretation
  ☐ Prometheus queries
  ☐ Alert handling procedures
  ☐ PagerDuty escalation

☐ Backup and restore
  ☐ Manual backup execution
  ☐ Restore procedures (full and PITR)
  ☐ Backup validation

☐ Incident response
  ☐ Runbook navigation
  ☐ Communication protocols
  ☐ Post-mortem process

☐ Maintenance tasks
  ☐ Vacuum and analyze
  ☐ Index maintenance
  ☐ Configuration changes
  ☐ Version upgrades

5.2. Handoff meeting agenda

# Production Cluster Handoff Meeting

Date: [Date]
Duration: 2 hours
Attendees: Project team, Operations team, Management

## Agenda:

1. **Introduction** (10 min)
   - Project overview
   - Architecture summary

2. **Live Demo** (30 min)
   - Cluster status check
   - Query execution
   - Monitoring dashboards
   - Simulate failover
   - Restore from backup

3. **Documentation Review** (20 min)
   - Architecture diagrams
   - Runbooks
   - Monitoring guide
   - Backup procedures

4. **Handoff Materials** (15 min)
   - Access credentials (1Password)
   - Git repository access
   - Monitoring URL and credentials
   - PagerDuty integration
   - Contact list

5. **Q&A** (30 min)
   - Open questions from operations team
   - Clarifications

6. **Action Items** (10 min)
   - Shadow period: 2 weeks
   - First on-call rotation
   - Knowledge assessment date

7. **Sign-off** (5 min)
   - Formal handoff acceptance
   - Support plan for first 30 days

## Deliverables:
- [ ] Architecture documentation (Confluence)
- [ ] Runbooks (GitHub)
- [ ] Monitoring dashboards (Grafana)
- [ ] Access credentials (1Password)
- [ ] Contact list (PagerDuty)
- [ ] Training materials (Google Drive)

6. Production Go-Live Checklist

D-7 (One week before):
☐ All validation tests passed
☐ Performance benchmarks met
☐ Monitoring and alerting verified
☐ Backup and restore tested
☐ Runbooks reviewed and approved
☐ Operations team trained
☐ Stakeholders notified of go-live date
☐ Rollback plan documented

D-1 (Day before):
☐ Final smoke tests passed
☐ All data migrated (if applicable)
☐ DNS records prepared (not yet applied)
☐ Load balancer configured
☐ On-call rotation confirmed
☐ War room scheduled (Zoom/Slack)
☐ Communication plan ready

D-Day (Go-live):
☐ 08:00: Final system check
☐ 09:00: Enable monitoring alerts
☐ 10:00: Update DNS to point to new cluster
☐ 10:15: Verify application connectivity
☐ 10:30: Monitor for errors (30 min)
☐ 11:00: Declare success or rollback
☐ 12:00: Post go-live review meeting
☐ EOD: Document any issues and resolutions

D+1 (Day after):
☐ Review monitoring data (full 24 hours)
☐ Check backup completed successfully
☐ Verify replication lag within targets
☐ Confirm no alerts or incidents
☐ Operations team debrief

D+7 (One week after):
☐ Performance review against baselines
☐ Cost analysis (actual vs estimated)
☐ Lessons learned session
☐ Update documentation with findings
☐ Formal project closure

7. Final Assessment

7.1. Capstone project requirements

# Capstone Project: Deploy Production-Ready PostgreSQL HA Cluster

## Objective:
Deploy a fully functional, production-ready PostgreSQL High Availability cluster that meets all requirements specified in Bài 28.

## Requirements:

1. **Architecture** (20 points)
   - [ ] 3-node Patroni cluster deployed
   - [ ] etcd cluster configured
   - [ ] HAProxy load balancing implemented
   - [ ] Network properly segmented

2. **High Availability** (20 points)
   - [ ] Automatic failover functional
   - [ ] RTO < 30 seconds demonstrated
   - [ ] RPO = 0 (synchronous replication)
   - [ ] No single point of failure

3. **Backup & Recovery** (15 points)
   - [ ] Automated daily backups configured
   - [ ] WAL archiving functional
   - [ ] PITR tested successfully
   - [ ] Backup retention policy implemented

4. **Monitoring & Alerting** (15 points)
   - [ ] Prometheus monitoring deployed
   - [ ] Grafana dashboards configured
   - [ ] Alert rules defined
   - [ ] PagerDuty/Slack integration working

5. **Security** (10 points)
   - [ ] SSL/TLS encryption enabled
   - [ ] Network firewall rules configured
   - [ ] Audit logging enabled
   - [ ] Secrets properly managed

6. **Documentation** (10 points)
   - [ ] Architecture diagram created
   - [ ] Runbooks written
   - [ ] Monitoring guide documented
   - [ ] Handoff materials prepared

7. **Testing** (10 points)
   - [ ] All functional tests passed
   - [ ] Performance benchmarks met
   - [ ] Failover drill successful
   - [ ] PITR restore validated

## Deliverables:

1. Working PostgreSQL HA cluster (accessible for validation)
2. Architecture documentation (Markdown/Confluence)
3. Runbooks (GitHub repository)
4. Monitoring dashboards (Grafana export)
5. Test results and evidence (screenshots/logs)
6. Video presentation (15 minutes)

## Grading:
- Total: 100 points
- Pass: 70+ points
- Excellence: 90+ points

## Submission:
- Due: [Date]
- Format: GitHub repository + video link
- Presentation: Live demo + Q&A (30 minutes)

7.2. Assessment rubric

Criteria	Excellent (9-10)	Good (7-8)	Satisfactory (5-6)	Needs Improvement (0-4)
Architecture	All components deployed, well-designed, scalable	Most components present, minor issues	Basic setup, some components missing	Incomplete or non-functional
HA	RTO < 30s, RPO = 0, no downtime	RTO < 60s, minimal RPO, brief downtime	RTO > 60s, some data loss possible	Frequent failures, unacceptable RTO/RPO
Backup	Automated, tested, documented	Automated, tested	Manual process, untested	Not implemented
Monitoring	Comprehensive, automated alerts	Basic monitoring, some alerts	Manual checks only	No monitoring
Security	All best practices implemented	Most security measures in place	Basic security	Insecure configuration
Documentation	Comprehensive, clear, actionable	Good documentation, minor gaps	Basic docs, some missing info	Poor or missing docs
Testing	All tests passed, thorough	Most tests passed	Some tests passed	Testing incomplete

8. Tổng kết

Key Achievements

Congratulations! You have completed the PostgreSQL High Availability course.

You have learned:
✅ PostgreSQL replication and HA concepts
✅ Patroni cluster deployment and management
✅ etcd distributed configuration store
✅ Monitoring with Prometheus and Grafana
✅ Backup and recovery (PITR)
✅ Failover and switchover procedures
✅ Security best practices
✅ Multi-datacenter setups
✅ Kubernetes deployment
✅ Configuration management
✅ Upgrade strategies
✅ Real-world case studies
✅ Automation with Ansible
✅ Disaster recovery drills
✅ Architecture design
✅ Production deployment

You are now ready to:
- Deploy and manage production PostgreSQL HA clusters
- Design high-availability database architectures
- Troubleshoot and resolve HA issues
- Implement best practices for database reliability
- Train and mentor others on PostgreSQL HA

Next Steps

Continue your learning:

1. Advanced Topics:
   - PostgreSQL internals and performance tuning
   - Logical replication and multi-master setups
   - Sharding and horizontal scaling (Citus)
   - PostgreSQL on Kubernetes at scale

2. Certifications:
   - PostgreSQL Certified Professional (PGCP)
   - AWS Database Specialty
   - Kubernetes Administrator (CKA)

3. Community:
   - Join PostgreSQL mailing lists
   - Contribute to Patroni/PostgreSQL projects
   - Attend PostgreSQL conferences (PGConf)
   - Share knowledge through blog posts/talks

4. Practice:
   - Build personal projects with HA
   - Contribute to open-source databases
   - Participate in chaos engineering experiments
   - Mentor junior DBAs

Resources

Documentation:
- PostgreSQL Official Docs: https://www.postgresql.org/docs/
- Patroni GitHub: https://github.com/zalando/patroni
- Patroni Docs: https://patroni.readthedocs.io/

Community:
- PostgreSQL Slack: https://postgres-slack.herokuapp.com/
- r/PostgreSQL: https://reddit.com/r/PostgreSQL
- PostgreSQL Discord: https://discord.gg/postgresql

Training:
- Percona PostgreSQL Training
- 2ndQuadrant PostgreSQL Courses
- Crunchy Data PostgreSQL Training

Conferences:
- PGConf.US (annual)
- PostgreSQL Conference Europe
- FOSDEM PostgreSQL DevRoom

Final Words

Thank you for completing this course!

Remember:
- High availability is a journey, not a destination
- Always test your failover procedures
- Document everything
- Automate where possible
- Monitor relentlessly
- Learn from failures
- Share your knowledge

Good luck with your PostgreSQL HA deployments!

Feel free to reach out with questions or feedback.

Happy hacking! 🚀🐘

Menu

Bài 29: Deploy Production-ready Cluster

DUY TRAN

Bài 29: Deploy Production-ready Cluster

Bài học khóa học

Bài 1: Tổng quan về PostgreSQL High Availability

Bài 2: Streaming Replication trong PostgreSQL

Bài 3: Giới thiệu Patroni và etcd

Bài 4: Chuẩn bị hạ tầng

Bài 1: Giới thiệu và Cài đặt Nginx

Bài 2: Cấu hình Cơ bản Nginx

Bài 3: Logging và Monitoring Nginx

Bài 4: Reverse Proxy

Bài 5: Load Balancing

Bài 6: NGINX CACHING

Bài 7: SSL/TLS và HTTPS trong NGINX

Bài 8: Performance Tuning NGINX

Bài 9: Security trong NGINX

Bài 10: Rewrite và Redirects trong NGINX

Bài 11: Nginx với Application Stack trong NGINX

Bài 12: Monitoring và Logging trong NGINX

Bài 13: High Availability và Load Balancing Advanced trong NGINX

Bài 14: Microservices và Service Mesh trong NGINX

Bài 15: Production Best Practices và Advanced Topics trong NGINX

Bài 5: Cài đặt PostgreSQL

Bài 6: Cài đặt và cấu hình etcd cluster

Bài 7: Cài đặt Patroni

Bài 8: Cấu hình Patroni chi tiết

Bài 9: Bootstrap PostgreSQL Cluster

Bài 10: Quản lý Replication

Bài 11: Patroni Callbacks

Bài 12: Patroni REST API

Bài 13: Automatic Failover

Bài 14: Switchover có kế hoạch (Planned Switchover)

Bài 15: Recovering failed nodes

Bài 16: Backup và Point-in-Time Recovery (PITR)

Bài 17: Monitoring Patroni Cluster

Bài 18: Performance Tuning

Bài 19: Logging và Troubleshooting

Bài 20: Security Best Practices

Bài 21: Multi-datacenter Setup

Bài 22: Patroni với Kubernetes

Bài 24: Upgrade Strategies

Bài 23: Patroni Configuration Management

Bài 25: Real-world Case Studies

Bài 26: Automation với Ansible

Bài 27: Disaster Recovery Drills

Bài 28: Thiết Kế Kiến Trúc HA

Bài 29: Deploy Production-ready Cluster

BÀI 1: GIỚI THIỆU KUBERNETES VÀ CONTAINER ORCHESTRATION

Mục tiêu

1. Pre-Deployment Checklist

1.1. Infrastructure readiness

2. Step-by-Step Deployment

2.1. Phase 1: Base system setup (Day 1)

2.2. Phase 2: etcd cluster (Day 1)

2.3. Phase 3: PostgreSQL + Patroni (Day 2)

2.4. Phase 4: Connection pooling (Day 2)

2.5. Phase 5: Load balancing (Day 3)

2.6. Phase 6: Monitoring (Day 3)

2.7. Phase 7: Backup configuration (Day 4)

3. Post-Deployment Validation

3.1. Functional testing

3.2. Performance testing

4. Operational Documentation

4.1. Runbook structure

2.2. Check replication lag

2.3. Perform planned switchover

2.4. Add new replica

2.5. Manual backup

3. Troubleshooting

3.1. Cluster split-brain

3.2. High replication lag

3.3. Disk space exhaustion

4. Emergency Procedures

4.1. Complete cluster failure

4.2. Data corruption

5. Escalation

5. Knowledge Transfer

5.1. Training checklist