Bài 25: Real-world Case Studies

Mục tiêu

Sau bài học này, bạn sẽ:

Learn from production PostgreSQL HA deployments
Understand scaling strategies for high traffic
Analyze cost optimization techniques
Study incident post-mortems
Apply best practices from real scenarios

1. Case Study 1: E-commerce Platform (High Transaction Volume)

1.1. Company profile

Company: Online Retail Platform
Scale: 50M users, 500K daily transactions
Traffic: 10K queries/second peak
Data size: 5TB
Industry: E-commerce

1.2. Architecture

Production Setup:
├─ PostgreSQL 18 + Patroni
├─ 5-node cluster (3 DC1 + 2 DC2)
│  ├─ Leader: AWS r6g.4xlarge (16 vCPU, 128GB RAM)
│  ├─ Replicas: AWS r6g.2xlarge (8 vCPU, 64GB RAM)
│  └─ Storage: io2 SSD, 20K IOPS
├─ PgBouncer connection pooling (transaction mode)
├─ HAProxy load balancing
├─ Redis caching layer
└─ Monitoring: Prometheus + Grafana

Database separation:
├─ orders_db (heavy writes)
├─ products_db (mostly reads)
├─ users_db (mixed workload)
└─ analytics_db (read replica for reports)

1.3. Configuration highlights

# Patroni configuration
postgresql:
  parameters:
    # Memory
    shared_buffers: 32GB
    effective_cache_size: 96GB
    work_mem: 32MB
    maintenance_work_mem: 2GB
    
    # Connections
    max_connections: 500
    
    # Write performance
    wal_buffers: 64MB
    checkpoint_completion_target: 0.9
    max_wal_size: 16GB
    min_wal_size: 4GB
    
    # Query performance
    random_page_cost: 1.1  # SSD
    effective_io_concurrency: 200
    
    # Parallelism
    max_parallel_workers: 8
    max_parallel_workers_per_gather: 4

1.4. Challenges and solutions

Challenge 1: Connection exhaustion

Problem: 
- Peak traffic caused max_connections limit to be hit
- Application errors: "FATAL: too many connections"

Solution:
- Implemented PgBouncer with transaction pooling
- Reduced max_connections from 1000 to 500
- PgBouncer pool_size=100 per database
- Result: Handled 10K app connections with 500 DB connections

Challenge 2: Replication lag during flash sales

Problem:
- Flash sales caused 50K writes/second
- Replicas lagged by 5-10 seconds
- Read queries returned stale data

Solution:
- Increased wal_sender_timeout and wal_receiver_timeout
- Tuned checkpoint_completion_target to 0.9
- Added synchronous replication for critical tables:
  ALTER TABLE orders SET (synchronous_commit = 'remote_apply');
- Separated analytics queries to dedicated read replica

Challenge 3: Disk I/O bottleneck

Problem:
- Disk I/O saturated at 95%+ during peak
- Query latency increased from 5ms to 500ms

Solution:
- Upgraded storage from gp3 to io2 (20K IOPS)
- Implemented table partitioning for orders table
  - Range partitioning by order_date (daily)
  - Automated partition management
- Added btree indexes on frequently queried columns
- Result: I/O dropped to 40%, latency back to 5-10ms

1.5. Key metrics

Performance:
- Query throughput: 10K qps peak
- Average query latency: 8ms
- 99th percentile latency: 50ms
- Replication lag: < 100ms
- Failover time: 15 seconds

Availability:
- Uptime: 99.99% (52 minutes downtime/year)
- Unplanned downtime: 2 incidents, 15 minutes total
- Planned maintenance: 0 downtime (rolling updates)

Cost:
- Infrastructure: $15K/month (AWS)
- Staffing: 2 DBAs + 1 SRE
- Total: ~$50K/month

2. Case Study 2: SaaS Application (Multi-tenant)

2.1. Company profile

Company: Project Management SaaS
Scale: 100K tenants, 5M users
Traffic: 2K queries/second average
Data size: 2TB
Industry: SaaS

2.2. Architecture

Multi-tenant Strategy:
├─ Shared database, separate schemas per tenant
├─ Row-level security (RLS) for data isolation
├─ Connection pooling per tenant
└─ Automated backup per tenant

Cluster Setup:
├─ 3-node Patroni cluster
├─ PostgreSQL 18 on GCP Cloud SQL equivalent (self-managed)
├─ Compute Engine n2-highmem-8 (8 vCPU, 64GB RAM)
├─ Persistent SSD, 10K IOPS
└─ Automated daily backups to GCS

2.3. Multi-tenancy implementation

-- Schema per tenant
CREATE SCHEMA tenant_12345;
CREATE SCHEMA tenant_12346;

-- Row-level security
CREATE POLICY tenant_isolation ON users
  USING (tenant_id = current_setting('app.current_tenant')::bigint);

ALTER TABLE users ENABLE ROW LEVEL SECURITY;

-- Application sets tenant context
SET app.current_tenant = '12345';

-- Query automatically filtered by RLS
SELECT * FROM users;  -- Only sees tenant 12345's data

2.4. Challenges and solutions

Challenge 1: Large tenant impact

Problem:
- One tenant (10% of data) caused high CPU usage
- Impacted all other tenants
- "Noisy neighbor" problem

Solution:
- Implemented query timeout per tenant
  ALTER ROLE tenant_12345 SET statement_timeout = '30s';
- Added work_mem limit per tenant
  ALTER ROLE tenant_12345 SET work_mem = '8MB';
- Moved largest tenants to dedicated instances
- Implemented fair queuing with pg_cron

Challenge 2: Backup/restore for specific tenant

Problem:
- Needed to restore one tenant's data
- Full restore would impact all tenants

Solution:
- Implemented per-schema backup script:
  
#!/bin/bash
TENANT_ID=$1
pg_dump -n tenant_${TENANT_ID} myapp > tenant_${TENANT_ID}_backup.sql

- Logical backup to S3 per tenant, daily
- PITR for full database, per-tenant granular restore

Challenge 3: Schema migration across 100K tenants

Problem:
- Need to add column to table
- 100K schemas = 100K migrations
- Can't hold lock that long

Solution:
- Multi-phase migration:
  1. Add column as nullable (fast, no rewrite)
  2. Backfill data in batches (chunked updates)
  3. Add default value (after backfill)
  4. Add NOT NULL constraint (after validation)
  
-- Phase 1: Add column (instant)
ALTER TABLE users ADD COLUMN last_login_at TIMESTAMP;

-- Phase 2: Backfill (chunked)
DO $$
DECLARE
  tenant RECORD;
BEGIN
  FOR tenant IN SELECT schema_name FROM information_schema.schemata 
                WHERE schema_name LIKE 'tenant_%'
  LOOP
    EXECUTE format('UPDATE %I.users SET last_login_at = created_at 
                    WHERE last_login_at IS NULL', tenant.schema_name);
    COMMIT;  -- Commit per tenant
  END LOOP;
END $$;

-- Phase 3: Add default (after backfill)
ALTER TABLE users ALTER COLUMN last_login_at SET DEFAULT now();

-- Phase 4: Add NOT NULL (after validation)
ALTER TABLE users ALTER COLUMN last_login_at SET NOT NULL;

2.5. Key metrics

Performance:
- Query throughput: 2K qps average
- Average query latency: 15ms
- Replication lag: < 50ms
- Largest tenant: 50GB (isolated)

Availability:
- Uptime: 99.95%
- Failover time: 20 seconds

Cost:
- Infrastructure: $5K/month (GCP)
- Staffing: 1 DBA
- Cost per tenant: $0.05/month

3. Case Study 3: Financial Services (Compliance-heavy)

3.1. Company profile

Company: Online Banking Platform
Scale: 1M users, $100M transactions/day
Traffic: 500 queries/second
Data size: 10TB
Industry: Financial Services (heavily regulated)

3.2. Architecture

Compliance-focused Setup:
├─ PostgreSQL 18 + Patroni (on-premises)
├─ 5-node cluster + 2 DR site nodes
├─ HPE servers (bare metal, 32-core, 256GB RAM)
├─ Enterprise SSD RAID 10
├─ Full encryption at rest (LUKS)
├─ SSL/TLS for all connections
├─ pgAudit enabled (log all queries)
├─ Backup retention: 7 years (compliance)
└─ Disaster recovery tested quarterly

Security measures:
├─ Network: Air-gapped from internet
├─ Authentication: Client certificates + SCRAM-SHA-256
├─ Authorization: Row-level security for sensitive data
├─ Auditing: All queries logged to SIEM
└─ Monitoring: 24/7 SOC

3.3. Compliance configuration

-- Enable pgAudit
CREATE EXTENSION pgaudit;
ALTER SYSTEM SET pgaudit.log = 'all';
ALTER SYSTEM SET pgaudit.log_catalog = off;
ALTER SYSTEM SET pgaudit.log_parameter = on;
ALTER SYSTEM SET pgaudit.log_relation = on;

-- Immutable audit table
CREATE TABLE audit_log (
  id BIGSERIAL PRIMARY KEY,
  timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
  user_name TEXT NOT NULL,
  query TEXT NOT NULL,
  client_ip INET NOT NULL
) WITH (fillfactor=100);  -- No updates, append-only

-- Prevent deletion (compliance)
CREATE RULE no_delete AS ON DELETE TO audit_log DO INSTEAD NOTHING;
CREATE RULE no_update AS ON UPDATE TO audit_log DO INSTEAD NOTHING;

-- Separate tablespace on WORM (Write Once Read Many) storage
CREATE TABLESPACE audit_ts LOCATION '/mnt/worm_storage/audit';
ALTER TABLE audit_log SET TABLESPACE audit_ts;

3.4. Challenges and solutions

Challenge 1: 7-year backup retention

Problem:
- Compliance requires 7 years of backups
- 10TB database = 365 x 7 = 2,555 daily backups
- Storage costs astronomical

Solution:
- Implemented tiered backup strategy:
  - Daily full backups: 30 days (hot storage)
  - Weekly full backups: 1 year (warm storage)
  - Monthly full backups: 7 years (cold storage - tape)
- Compression with pgBackRest
- Result: Reduced storage from 25PB to 5PB

Challenge 2: Zero-tolerance for data loss (RPO = 0)

Problem:
- Banking regulations require no data loss
- Async replication has lag window

Solution:
- Synchronous replication to 2 replicas
ALTER SYSTEM SET synchronous_standby_names = 'ANY 2 (node2, node3, node4)';
ALTER SYSTEM SET synchronous_commit = 'remote_apply';

- Trade-off: 10ms additional latency
- Acceptable for financial transactions

Challenge 3: Disaster recovery drills

Problem:
- Quarterly DR drills required by auditors
- Can't disrupt production

Solution:
- Automated DR failover testing:
  1. Clone production to DR site (logical replication)
  2. Promote DR site to primary
  3. Run smoke tests (read-only queries)
  4. Measure RTO (target: < 1 hour)
  5. Restore production primary
  6. Document results for audit

- Implemented with Ansible playbooks
- Full drill takes 2 hours (outside business hours)

3.5. Key metrics

Performance:
- Query throughput: 500 qps
- Average query latency: 20ms (with sync replication)
- Replication lag: 0ms (synchronous)
- Failover time: 30 seconds

Availability:
- Uptime: 99.999% (5 minutes downtime/year)
- Unplanned downtime: 0 (in last 2 years)

Compliance:
- Audit log retention: 7 years
- Backup retention: 7 years
- DR drills: Quarterly (100% success rate)

Cost:
- Infrastructure: $30K/month (on-prem)
- Staffing: 3 DBAs + 2 security engineers
- Total: ~$100K/month

4.1. Company profile

Company: Social Media App
Scale: 500M users, 10B posts
Traffic: 50K queries/second (95% reads)
Data size: 50TB
Industry: Social Media

4.2. Architecture

Read-heavy Optimization:
├─ 1 Leader (writes only)
├─ 20 Read replicas (geographically distributed)
├─ CDN for static content
├─ Redis for session/cache
├─ Elasticsearch for search
└─ S3 for media files

Database sharding:
├─ Shard by user_id (hash-based)
├─ 50 shards (1TB each)
├─ Each shard: 1 leader + 5 replicas
└─ Vitess for shard management

4.3. Read scaling strategy

-- Read queries routed to replicas
-- Application logic:
if query_type == 'SELECT':
    conn = connect_to_replica()
else:
    conn = connect_to_leader()

-- Geographic routing
if user_location == 'us-west':
    replica = 'pg-us-west-replica-1'
elif user_location == 'eu-central':
    replica = 'pg-eu-central-replica-1'
else:
    replica = 'pg-us-east-replica-1'

4.4. Challenges and solutions

Challenge 1: Replication lag visible to users

Problem:
- User posts content, immediately refreshes page
- Content not visible (read from lagging replica)
- User thinks post failed

Solution:
- Sticky sessions after write:
  1. User writes to leader
  2. Application stores LSN in session cookie
  3. Next read checks replica LSN >= session LSN
  4. If replica behind, route to leader temporarily
  5. After replica catches up, route back to replica

-- PostgreSQL 10+ function
SELECT pg_last_wal_replay_lsn();  -- On replica
SELECT pg_current_wal_lsn();       -- On leader

-- App logic
if replica_lsn < session_lsn:
    route_to_leader()

Challenge 2: Hot partition (celebrity posts)

Problem:
- Celebrity with 100M followers posts content
- Single partition overwhelmed
- Query latency spikes to 10 seconds

Solution:
- Identify hot users (> 1M followers)
- Replicate hot user data to all shards
- Denormalize celebrity posts to separate table
- Use materialized views for timeline generation
CREATE MATERIALIZED VIEW celebrity_timeline AS 
  SELECT * FROM posts 
  WHERE user_id IN (SELECT user_id FROM celebrities)
  ORDER BY created_at DESC;

-- Refresh every 5 minutes
REFRESH MATERIALIZED VIEW CONCURRENTLY celebrity_timeline;

Challenge 3: Managing 20 read replicas

Problem:
- Manual management of 20 replicas is error-prone
- Need to add/remove replicas dynamically

Solution:
- Kubernetes + Zalando Postgres Operator
- Auto-scaling based on CPU/query load
- Example: Scale from 20 to 30 replicas during peak hours

apiVersion: acid.zalan.do/v1
kind: postgresql
spec:
  numberOfInstances: 20  # Auto-scaled by HPA
  resources:
    requests:
      cpu: 4
      memory: 16Gi
    limits:
      cpu: 8
      memory: 32Gi

4.5. Key metrics

Performance:
- Query throughput: 50K qps (48K reads, 2K writes)
- Average read latency: 5ms
- Average write latency: 15ms
- Replication lag: 100-500ms (acceptable for social media)

Availability:
- Uptime: 99.9%
- Read replicas can fail without user impact

Cost:
- Infrastructure: $80K/month (AWS)
- 50 shards x (1 leader + 5 replicas) = 300 instances
- Mostly r6g.xlarge (4 vCPU, 32GB RAM)

5. Lessons Learned (Cross-case Analysis)

5.1. Common patterns

✅ What works:
1. Connection pooling (PgBouncer) - Essential for high traffic
2. Read replicas - Cheapest way to scale reads
3. Monitoring with Prometheus - Early problem detection
4. Automated failover (Patroni) - Reduces MTTR
5. Table partitioning - Improves query performance
6. Backup automation - Prevents human error
7. Regular DR drills - Validates procedures
8. Documentation - Critical for incident response

❌ What doesn't work:
1. Over-sharding - Adds complexity without benefit
2. Premature optimization - YAGNI applies to databases too
3. Ignoring replication lag - Causes data consistency issues
4. Manual processes - Error-prone and slow
5. Single point of failure - No HA = no production

5.2. Cost optimization techniques

1. Right-sizing instances:
   - Start small, scale up based on metrics
   - Use burstable instances (t3/t4g) for dev/staging
   - Reserved instances for predictable workloads (40% savings)

2. Storage optimization:
   - gp3 instead of io2 for most workloads (60% cheaper)
   - Compress old partitions (pg_squeeze)
   - Archive to S3 for long-term retention

3. Reduce replica count:
   - 2-3 replicas sufficient for most workloads
   - Use read cache (Redis) before adding replicas

4. Connection pooling:
   - Reduces instance size requirements
   - 500 connections → 100 actual DB connections

5. Serverless options:
   - AWS RDS Proxy + Aurora Serverless for variable workload
   - Pay per request instead of fixed capacity

5.3. When to NOT use Patroni

Consider alternatives if:
1. Single instance is sufficient (< 100 qps)
2. Cloud-managed HA available (RDS, Cloud SQL)
3. Don't have skilled PostgreSQL DBA
4. Budget very limited
5. Development/testing only

Use Patroni when:
1. Need full control over configuration
2. On-premises or hybrid cloud
3. Compliance requires self-managed
4. Cost optimization vs managed services
5. High availability is critical

6. Lab Exercises

Lab 1: Calculate capacity planning

Tasks:

Estimate queries per second for your use case
Calculate required connections
Size instance (CPU, RAM, storage)
Estimate replication lag for replica count
Calculate total infrastructure cost

Lab 2: Design multi-tenant architecture

Tasks:

Choose tenancy model (shared vs dedicated)
Implement row-level security
Create backup strategy per tenant
Design migration procedure
Test noisy neighbor mitigation

Lab 3: Implement read replica scaling

Tasks:

Add read replica to cluster
Implement read/write routing in application
Measure replication lag
Test failover with replicas
Monitor query distribution

Lab 4: Cost optimization analysis

Tasks:

Audit current infrastructure costs
Identify optimization opportunities
Implement connection pooling
Right-size instances
Calculate cost savings

7. Tổng kết

Architecture Patterns Summary

Pattern	Best For	Complexity	Cost
Single Leader + Replicas	Read-heavy	Low	Low
Multi-datacenter	Geographic distribution	High	High
Sharding	Horizontal scaling	Very High	Medium
Multi-tenant	SaaS applications	Medium	Low

Key Takeaways

1. Connection pooling is non-negotiable at scale
2. Read replicas are the easiest way to scale
3. Monitoring and alerting prevent incidents
4. Backup and restore must be tested regularly
5. Documentation saves time during incidents
6. Automation reduces human error
7. Cost optimization is ongoing effort
8. Right-sizing prevents over-provisioning

Next Steps

Bài 26 sẽ cover Automation with Ansible:

Ansible playbooks for Patroni deployment
Configuration management automation
Automated testing frameworks
CI/CD integration for database changes
Infrastructure as Code

Menu

Bài 25: Real-world Case Studies

DUY TRAN

Bài 25: Real-world Case Studies

Bài học khóa học

Bài 1: Tổng quan về PostgreSQL High Availability

Bài 2: Streaming Replication trong PostgreSQL

Bài 3: Giới thiệu Patroni và etcd

Bài 4: Chuẩn bị hạ tầng

Bài 1: Giới thiệu và Cài đặt Nginx

Bài 2: Cấu hình Cơ bản Nginx

Bài 3: Logging và Monitoring Nginx

Bài 4: Reverse Proxy

Bài 5: Load Balancing

Bài 6: NGINX CACHING

Bài 7: SSL/TLS và HTTPS trong NGINX

Bài 8: Performance Tuning NGINX

Bài 9: Security trong NGINX

Bài 10: Rewrite và Redirects trong NGINX

Bài 11: Nginx với Application Stack trong NGINX

Bài 12: Monitoring và Logging trong NGINX

Bài 13: High Availability và Load Balancing Advanced trong NGINX

Bài 14: Microservices và Service Mesh trong NGINX

Bài 15: Production Best Practices và Advanced Topics trong NGINX

Bài 5: Cài đặt PostgreSQL

Bài 6: Cài đặt và cấu hình etcd cluster

Bài 7: Cài đặt Patroni

Bài 8: Cấu hình Patroni chi tiết

Bài 9: Bootstrap PostgreSQL Cluster

Bài 10: Quản lý Replication

Bài 11: Patroni Callbacks

Bài 12: Patroni REST API

Bài 13: Automatic Failover

Bài 14: Switchover có kế hoạch (Planned Switchover)

Bài 15: Recovering failed nodes

Bài 16: Backup và Point-in-Time Recovery (PITR)

Bài 17: Monitoring Patroni Cluster

Bài 18: Performance Tuning

Bài 19: Logging và Troubleshooting

Bài 20: Security Best Practices

Bài 21: Multi-datacenter Setup

Bài 22: Patroni với Kubernetes

Bài 24: Upgrade Strategies

Bài 23: Patroni Configuration Management

Bài 25: Real-world Case Studies

Bài 26: Automation với Ansible

Bài 27: Disaster Recovery Drills

Bài 28: Thiết Kế Kiến Trúc HA

Bài 29: Deploy Production-ready Cluster

BÀI 1: GIỚI THIỆU KUBERNETES VÀ CONTAINER ORCHESTRATION

Mục tiêu

1. Case Study 1: E-commerce Platform (High Transaction Volume)

1.1. Company profile

1.2. Architecture

1.3. Configuration highlights

1.4. Challenges and solutions

Challenge 1: Connection exhaustion

Challenge 2: Replication lag during flash sales

Challenge 3: Disk I/O bottleneck

1.5. Key metrics

2. Case Study 2: SaaS Application (Multi-tenant)

2.1. Company profile

2.2. Architecture

2.3. Multi-tenancy implementation

2.4. Challenges and solutions

Challenge 1: Large tenant impact

Challenge 2: Backup/restore for specific tenant

Challenge 3: Schema migration across 100K tenants

2.5. Key metrics

3. Case Study 3: Financial Services (Compliance-heavy)

3.1. Company profile

3.2. Architecture

3.3. Compliance configuration

3.4. Challenges and solutions

Challenge 1: 7-year backup retention

Challenge 2: Zero-tolerance for data loss (RPO = 0)

Challenge 3: Disaster recovery drills

3.5. Key metrics

4. Case Study 4: Social Media Platform (Read-heavy)

4.1. Company profile