Bài 25: Real-world Case Studies
Phân tích production architecture thực tế, chiến lược scaling, tối ưu chi phí và bài học kinh nghiệm từ các dự án thực tế.
Bài 25: Real-world Case Studies
Mục tiêu
Sau bài học này, bạn sẽ:
- Learn from production PostgreSQL HA deployments
- Understand scaling strategies for high traffic
- Analyze cost optimization techniques
- Study incident post-mortems
- Apply best practices from real scenarios
1. Case Study 1: E-commerce Platform (High Transaction Volume)
1.1. Company profile
Company: Online Retail Platform
Scale: 50M users, 500K daily transactions
Traffic: 10K queries/second peak
Data size: 5TB
Industry: E-commerce
1.2. Architecture
Production Setup:
├─ PostgreSQL 18 + Patroni
├─ 5-node cluster (3 DC1 + 2 DC2)
│ ├─ Leader: AWS r6g.4xlarge (16 vCPU, 128GB RAM)
│ ├─ Replicas: AWS r6g.2xlarge (8 vCPU, 64GB RAM)
│ └─ Storage: io2 SSD, 20K IOPS
├─ PgBouncer connection pooling (transaction mode)
├─ HAProxy load balancing
├─ Redis caching layer
└─ Monitoring: Prometheus + Grafana
Database separation:
├─ orders_db (heavy writes)
├─ products_db (mostly reads)
├─ users_db (mixed workload)
└─ analytics_db (read replica for reports)
1.3. Configuration highlights
# Patroni configuration
postgresql:
parameters:
# Memory
shared_buffers: 32GB
effective_cache_size: 96GB
work_mem: 32MB
maintenance_work_mem: 2GB
# Connections
max_connections: 500
# Write performance
wal_buffers: 64MB
checkpoint_completion_target: 0.9
max_wal_size: 16GB
min_wal_size: 4GB
# Query performance
random_page_cost: 1.1 # SSD
effective_io_concurrency: 200
# Parallelism
max_parallel_workers: 8
max_parallel_workers_per_gather: 4
1.4. Challenges and solutions
Challenge 1: Connection exhaustion
Problem:
- Peak traffic caused max_connections limit to be hit
- Application errors: "FATAL: too many connections"
Solution:
- Implemented PgBouncer with transaction pooling
- Reduced max_connections from 1000 to 500
- PgBouncer pool_size=100 per database
- Result: Handled 10K app connections with 500 DB connections
Challenge 2: Replication lag during flash sales
Problem:
- Flash sales caused 50K writes/second
- Replicas lagged by 5-10 seconds
- Read queries returned stale data
Solution:
- Increased wal_sender_timeout and wal_receiver_timeout
- Tuned checkpoint_completion_target to 0.9
- Added synchronous replication for critical tables:
ALTER TABLE orders SET (synchronous_commit = 'remote_apply');
- Separated analytics queries to dedicated read replica
Challenge 3: Disk I/O bottleneck
Problem:
- Disk I/O saturated at 95%+ during peak
- Query latency increased from 5ms to 500ms
Solution:
- Upgraded storage from gp3 to io2 (20K IOPS)
- Implemented table partitioning for orders table
- Range partitioning by order_date (daily)
- Automated partition management
- Added btree indexes on frequently queried columns
- Result: I/O dropped to 40%, latency back to 5-10ms
1.5. Key metrics
Performance:
- Query throughput: 10K qps peak
- Average query latency: 8ms
- 99th percentile latency: 50ms
- Replication lag: < 100ms
- Failover time: 15 seconds
Availability:
- Uptime: 99.99% (52 minutes downtime/year)
- Unplanned downtime: 2 incidents, 15 minutes total
- Planned maintenance: 0 downtime (rolling updates)
Cost:
- Infrastructure: $15K/month (AWS)
- Staffing: 2 DBAs + 1 SRE
- Total: ~$50K/month
2. Case Study 2: SaaS Application (Multi-tenant)
2.1. Company profile
Company: Project Management SaaS
Scale: 100K tenants, 5M users
Traffic: 2K queries/second average
Data size: 2TB
Industry: SaaS
2.2. Architecture
Multi-tenant Strategy:
├─ Shared database, separate schemas per tenant
├─ Row-level security (RLS) for data isolation
├─ Connection pooling per tenant
└─ Automated backup per tenant
Cluster Setup:
├─ 3-node Patroni cluster
├─ PostgreSQL 18 on GCP Cloud SQL equivalent (self-managed)
├─ Compute Engine n2-highmem-8 (8 vCPU, 64GB RAM)
├─ Persistent SSD, 10K IOPS
└─ Automated daily backups to GCS
2.3. Multi-tenancy implementation
-- Schema per tenant
CREATE SCHEMA tenant_12345;
CREATE SCHEMA tenant_12346;
-- Row-level security
CREATE POLICY tenant_isolation ON users
USING (tenant_id = current_setting('app.current_tenant')::bigint);
ALTER TABLE users ENABLE ROW LEVEL SECURITY;
-- Application sets tenant context
SET app.current_tenant = '12345';
-- Query automatically filtered by RLS
SELECT * FROM users; -- Only sees tenant 12345's data
2.4. Challenges and solutions
Challenge 1: Large tenant impact
Problem:
- One tenant (10% of data) caused high CPU usage
- Impacted all other tenants
- "Noisy neighbor" problem
Solution:
- Implemented query timeout per tenant
ALTER ROLE tenant_12345 SET statement_timeout = '30s';
- Added work_mem limit per tenant
ALTER ROLE tenant_12345 SET work_mem = '8MB';
- Moved largest tenants to dedicated instances
- Implemented fair queuing with pg_cron
Challenge 2: Backup/restore for specific tenant
Problem:
- Needed to restore one tenant's data
- Full restore would impact all tenants
Solution:
- Implemented per-schema backup script:
#!/bin/bash
TENANT_ID=$1
pg_dump -n tenant_${TENANT_ID} myapp > tenant_${TENANT_ID}_backup.sql
- Logical backup to S3 per tenant, daily
- PITR for full database, per-tenant granular restore
Challenge 3: Schema migration across 100K tenants
Problem:
- Need to add column to table
- 100K schemas = 100K migrations
- Can't hold lock that long
Solution:
- Multi-phase migration:
1. Add column as nullable (fast, no rewrite)
2. Backfill data in batches (chunked updates)
3. Add default value (after backfill)
4. Add NOT NULL constraint (after validation)
-- Phase 1: Add column (instant)
ALTER TABLE users ADD COLUMN last_login_at TIMESTAMP;
-- Phase 2: Backfill (chunked)
DO $$
DECLARE
tenant RECORD;
BEGIN
FOR tenant IN SELECT schema_name FROM information_schema.schemata
WHERE schema_name LIKE 'tenant_%'
LOOP
EXECUTE format('UPDATE %I.users SET last_login_at = created_at
WHERE last_login_at IS NULL', tenant.schema_name);
COMMIT; -- Commit per tenant
END LOOP;
END $$;
-- Phase 3: Add default (after backfill)
ALTER TABLE users ALTER COLUMN last_login_at SET DEFAULT now();
-- Phase 4: Add NOT NULL (after validation)
ALTER TABLE users ALTER COLUMN last_login_at SET NOT NULL;
2.5. Key metrics
Performance:
- Query throughput: 2K qps average
- Average query latency: 15ms
- Replication lag: < 50ms
- Largest tenant: 50GB (isolated)
Availability:
- Uptime: 99.95%
- Failover time: 20 seconds
Cost:
- Infrastructure: $5K/month (GCP)
- Staffing: 1 DBA
- Cost per tenant: $0.05/month
3. Case Study 3: Financial Services (Compliance-heavy)
3.1. Company profile
Company: Online Banking Platform
Scale: 1M users, $100M transactions/day
Traffic: 500 queries/second
Data size: 10TB
Industry: Financial Services (heavily regulated)
3.2. Architecture
Compliance-focused Setup:
├─ PostgreSQL 18 + Patroni (on-premises)
├─ 5-node cluster + 2 DR site nodes
├─ HPE servers (bare metal, 32-core, 256GB RAM)
├─ Enterprise SSD RAID 10
├─ Full encryption at rest (LUKS)
├─ SSL/TLS for all connections
├─ pgAudit enabled (log all queries)
├─ Backup retention: 7 years (compliance)
└─ Disaster recovery tested quarterly
Security measures:
├─ Network: Air-gapped from internet
├─ Authentication: Client certificates + SCRAM-SHA-256
├─ Authorization: Row-level security for sensitive data
├─ Auditing: All queries logged to SIEM
└─ Monitoring: 24/7 SOC
3.3. Compliance configuration
-- Enable pgAudit
CREATE EXTENSION pgaudit;
ALTER SYSTEM SET pgaudit.log = 'all';
ALTER SYSTEM SET pgaudit.log_catalog = off;
ALTER SYSTEM SET pgaudit.log_parameter = on;
ALTER SYSTEM SET pgaudit.log_relation = on;
-- Immutable audit table
CREATE TABLE audit_log (
id BIGSERIAL PRIMARY KEY,
timestamp TIMESTAMPTZ NOT NULL DEFAULT now(),
user_name TEXT NOT NULL,
query TEXT NOT NULL,
client_ip INET NOT NULL
) WITH (fillfactor=100); -- No updates, append-only
-- Prevent deletion (compliance)
CREATE RULE no_delete AS ON DELETE TO audit_log DO INSTEAD NOTHING;
CREATE RULE no_update AS ON UPDATE TO audit_log DO INSTEAD NOTHING;
-- Separate tablespace on WORM (Write Once Read Many) storage
CREATE TABLESPACE audit_ts LOCATION '/mnt/worm_storage/audit';
ALTER TABLE audit_log SET TABLESPACE audit_ts;
3.4. Challenges and solutions
Challenge 1: 7-year backup retention
Problem:
- Compliance requires 7 years of backups
- 10TB database = 365 x 7 = 2,555 daily backups
- Storage costs astronomical
Solution:
- Implemented tiered backup strategy:
- Daily full backups: 30 days (hot storage)
- Weekly full backups: 1 year (warm storage)
- Monthly full backups: 7 years (cold storage - tape)
- Compression with pgBackRest
- Result: Reduced storage from 25PB to 5PB
Challenge 2: Zero-tolerance for data loss (RPO = 0)
Problem:
- Banking regulations require no data loss
- Async replication has lag window
Solution:
- Synchronous replication to 2 replicas
ALTER SYSTEM SET synchronous_standby_names = 'ANY 2 (node2, node3, node4)';
ALTER SYSTEM SET synchronous_commit = 'remote_apply';
- Trade-off: 10ms additional latency
- Acceptable for financial transactions
Challenge 3: Disaster recovery drills
Problem:
- Quarterly DR drills required by auditors
- Can't disrupt production
Solution:
- Automated DR failover testing:
1. Clone production to DR site (logical replication)
2. Promote DR site to primary
3. Run smoke tests (read-only queries)
4. Measure RTO (target: < 1 hour)
5. Restore production primary
6. Document results for audit
- Implemented with Ansible playbooks
- Full drill takes 2 hours (outside business hours)
3.5. Key metrics
Performance:
- Query throughput: 500 qps
- Average query latency: 20ms (with sync replication)
- Replication lag: 0ms (synchronous)
- Failover time: 30 seconds
Availability:
- Uptime: 99.999% (5 minutes downtime/year)
- Unplanned downtime: 0 (in last 2 years)
Compliance:
- Audit log retention: 7 years
- Backup retention: 7 years
- DR drills: Quarterly (100% success rate)
Cost:
- Infrastructure: $30K/month (on-prem)
- Staffing: 3 DBAs + 2 security engineers
- Total: ~$100K/month
4. Case Study 4: Social Media Platform (Read-heavy)
4.1. Company profile
Company: Social Media App
Scale: 500M users, 10B posts
Traffic: 50K queries/second (95% reads)
Data size: 50TB
Industry: Social Media
4.2. Architecture
Read-heavy Optimization:
├─ 1 Leader (writes only)
├─ 20 Read replicas (geographically distributed)
├─ CDN for static content
├─ Redis for session/cache
├─ Elasticsearch for search
└─ S3 for media files
Database sharding:
├─ Shard by user_id (hash-based)
├─ 50 shards (1TB each)
├─ Each shard: 1 leader + 5 replicas
└─ Vitess for shard management
4.3. Read scaling strategy
-- Read queries routed to replicas
-- Application logic:
if query_type == 'SELECT':
conn = connect_to_replica()
else:
conn = connect_to_leader()
-- Geographic routing
if user_location == 'us-west':
replica = 'pg-us-west-replica-1'
elif user_location == 'eu-central':
replica = 'pg-eu-central-replica-1'
else:
replica = 'pg-us-east-replica-1'
4.4. Challenges and solutions
Challenge 1: Replication lag visible to users
Problem:
- User posts content, immediately refreshes page
- Content not visible (read from lagging replica)
- User thinks post failed
Solution:
- Sticky sessions after write:
1. User writes to leader
2. Application stores LSN in session cookie
3. Next read checks replica LSN >= session LSN
4. If replica behind, route to leader temporarily
5. After replica catches up, route back to replica
-- PostgreSQL 10+ function
SELECT pg_last_wal_replay_lsn(); -- On replica
SELECT pg_current_wal_lsn(); -- On leader
-- App logic
if replica_lsn < session_lsn:
route_to_leader()
Challenge 2: Hot partition (celebrity posts)
Problem:
- Celebrity with 100M followers posts content
- Single partition overwhelmed
- Query latency spikes to 10 seconds
Solution:
- Identify hot users (> 1M followers)
- Replicate hot user data to all shards
- Denormalize celebrity posts to separate table
- Use materialized views for timeline generation
CREATE MATERIALIZED VIEW celebrity_timeline AS
SELECT * FROM posts
WHERE user_id IN (SELECT user_id FROM celebrities)
ORDER BY created_at DESC;
-- Refresh every 5 minutes
REFRESH MATERIALIZED VIEW CONCURRENTLY celebrity_timeline;
Challenge 3: Managing 20 read replicas
Problem:
- Manual management of 20 replicas is error-prone
- Need to add/remove replicas dynamically
Solution:
- Kubernetes + Zalando Postgres Operator
- Auto-scaling based on CPU/query load
- Example: Scale from 20 to 30 replicas during peak hours
apiVersion: acid.zalan.do/v1
kind: postgresql
spec:
numberOfInstances: 20 # Auto-scaled by HPA
resources:
requests:
cpu: 4
memory: 16Gi
limits:
cpu: 8
memory: 32Gi
4.5. Key metrics
Performance:
- Query throughput: 50K qps (48K reads, 2K writes)
- Average read latency: 5ms
- Average write latency: 15ms
- Replication lag: 100-500ms (acceptable for social media)
Availability:
- Uptime: 99.9%
- Read replicas can fail without user impact
Cost:
- Infrastructure: $80K/month (AWS)
- 50 shards x (1 leader + 5 replicas) = 300 instances
- Mostly r6g.xlarge (4 vCPU, 32GB RAM)
5. Lessons Learned (Cross-case Analysis)
5.1. Common patterns
✅ What works:
1. Connection pooling (PgBouncer) - Essential for high traffic
2. Read replicas - Cheapest way to scale reads
3. Monitoring with Prometheus - Early problem detection
4. Automated failover (Patroni) - Reduces MTTR
5. Table partitioning - Improves query performance
6. Backup automation - Prevents human error
7. Regular DR drills - Validates procedures
8. Documentation - Critical for incident response
❌ What doesn't work:
1. Over-sharding - Adds complexity without benefit
2. Premature optimization - YAGNI applies to databases too
3. Ignoring replication lag - Causes data consistency issues
4. Manual processes - Error-prone and slow
5. Single point of failure - No HA = no production
5.2. Cost optimization techniques
1. Right-sizing instances:
- Start small, scale up based on metrics
- Use burstable instances (t3/t4g) for dev/staging
- Reserved instances for predictable workloads (40% savings)
2. Storage optimization:
- gp3 instead of io2 for most workloads (60% cheaper)
- Compress old partitions (pg_squeeze)
- Archive to S3 for long-term retention
3. Reduce replica count:
- 2-3 replicas sufficient for most workloads
- Use read cache (Redis) before adding replicas
4. Connection pooling:
- Reduces instance size requirements
- 500 connections → 100 actual DB connections
5. Serverless options:
- AWS RDS Proxy + Aurora Serverless for variable workload
- Pay per request instead of fixed capacity
5.3. When to NOT use Patroni
Consider alternatives if:
1. Single instance is sufficient (< 100 qps)
2. Cloud-managed HA available (RDS, Cloud SQL)
3. Don't have skilled PostgreSQL DBA
4. Budget very limited
5. Development/testing only
Use Patroni when:
1. Need full control over configuration
2. On-premises or hybrid cloud
3. Compliance requires self-managed
4. Cost optimization vs managed services
5. High availability is critical
6. Lab Exercises
Lab 1: Calculate capacity planning
Tasks:
- Estimate queries per second for your use case
- Calculate required connections
- Size instance (CPU, RAM, storage)
- Estimate replication lag for replica count
- Calculate total infrastructure cost
Lab 2: Design multi-tenant architecture
Tasks:
- Choose tenancy model (shared vs dedicated)
- Implement row-level security
- Create backup strategy per tenant
- Design migration procedure
- Test noisy neighbor mitigation
Lab 3: Implement read replica scaling
Tasks:
- Add read replica to cluster
- Implement read/write routing in application
- Measure replication lag
- Test failover with replicas
- Monitor query distribution
Lab 4: Cost optimization analysis
Tasks:
- Audit current infrastructure costs
- Identify optimization opportunities
- Implement connection pooling
- Right-size instances
- Calculate cost savings
7. Tổng kết
Architecture Patterns Summary
| Pattern | Best For | Complexity | Cost |
|---|---|---|---|
| Single Leader + Replicas | Read-heavy | Low | Low |
| Multi-datacenter | Geographic distribution | High | High |
| Sharding | Horizontal scaling | Very High | Medium |
| Multi-tenant | SaaS applications | Medium | Low |
Key Takeaways
1. Connection pooling is non-negotiable at scale
2. Read replicas are the easiest way to scale
3. Monitoring and alerting prevent incidents
4. Backup and restore must be tested regularly
5. Documentation saves time during incidents
6. Automation reduces human error
7. Cost optimization is ongoing effort
8. Right-sizing prevents over-provisioning
Next Steps
Bài 26 sẽ cover Automation with Ansible:
- Ansible playbooks for Patroni deployment
- Configuration management automation
- Automated testing frameworks
- CI/CD integration for database changes
- Infrastructure as Code