Bài 19: Logging và Troubleshooting
Phân tích PostgreSQL logs, Patroni logs, etcd logs, giải quyết common issues và các kỹ thuật debug hiệu quả.
Bài 19: Logging và Troubleshooting
Bài 19: Logging và Troubleshooting
Mục tiêu
Sau bài học này, bạn sẽ:
- Hiểu và phân tích PostgreSQL logs
- Debug Patroni issues qua logs
- Troubleshoot etcd cluster problems
- Identify và fix common HA issues
- Use debugging tools effectively
- Build troubleshooting runbooks
1. PostgreSQL Logging
1.1. Configure logging
-- Essential logging settings
ALTER SYSTEM SET logging_collector = on;
ALTER SYSTEM SET log_directory = 'log';
ALTER SYSTEM SET log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log';
ALTER SYSTEM SET log_rotation_age = '1d';
ALTER SYSTEM SET log_rotation_size = '100MB';
-- What to log
ALTER SYSTEM SET log_min_duration_statement = '1000'; -- Slow queries (>1s)
ALTER SYSTEM SET log_checkpoints = on;
ALTER SYSTEM SET log_connections = off; -- Too verbose in production
ALTER SYSTEM SET log_disconnections = off;
ALTER SYSTEM SET log_lock_waits = on;
ALTER SYSTEM SET log_temp_files = 0; -- Log all temp file usage
ALTER SYSTEM SET log_autovacuum_min_duration = '1s';
-- Log line format
ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ';
ALTER SYSTEM SET log_statement = 'none'; -- 'none', 'ddl', 'mod', 'all'
ALTER SYSTEM SET log_duration = off;
-- Error detail
ALTER SYSTEM SET log_error_verbosity = 'default'; -- 'terse', 'default', 'verbose'
-- Reload
SELECT pg_reload_conf();
1.2. Log file location
# Default location
/var/lib/postgresql/18/data/log/
# Or configured location
sudo -u postgres psql -c "SHOW log_directory;"
sudo -u postgres psql -c "SHOW data_directory;"
# View logs
sudo tail -f /var/lib/postgresql/18/data/log/postgresql-*.log
# Or via journalctl
sudo journalctl -u postgresql -f
1.3. Common log patterns
Connection issues
LOG: connection received: host=10.0.1.50 port=54321
FATAL: password authentication failed for user "app_user"
DETAIL: Connection matched pg_hba.conf line 95: "host all all 0.0.0.0/0 md5"
→ Solution: Check password, pg_hba.conf configuration
Slow queries
LOG: duration: 5432.123 ms statement: SELECT * FROM large_table WHERE ...
→ Solution: Analyze query, add indexes, optimize
Deadlocks
ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 67890; blocked by process 12346.
Process 12346 waits for ShareLock on transaction 67891; blocked by process 12345.
HINT: See server log for query details.
→ Solution: Review transaction order, add proper locking
Checkpoint warnings
LOG: checkpoints are occurring too frequently (24 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".
→ Solution: Increase max_wal_size
Replication lag
WARNING: could not send data to WAL receiver; retrying
DETAIL: Connection to replica timed out
→ Solution: Check network, replica health
Out of disk space
FATAL: could not write to file "base/16384/12345": No space left on device
→ Solution: Free up disk space, increase storage
1.4. Analyze logs with pgBadger
# Install pgBadger
sudo apt-get install -y pgbadger
# Or from source
wget https://github.com/darold/pgbadger/archive/v12.2.tar.gz
tar -xzf v12.2.tar.gz
cd pgbadger-12.2
perl Makefile.PL
make && sudo make install
# Generate report
pgbadger /var/lib/postgresql/18/data/log/postgresql-*.log \
-o /tmp/pgbadger-report.html
# Open in browser
# file:///tmp/pgbadger-report.html
# Report includes:
# - Top slow queries
# - Connection statistics
# - Lock analysis
# - Checkpoint activity
# - Temporary file usage
2. Patroni Logging
2.1. Patroni log levels
# In patroni.yml
log:
level: INFO # DEBUG, INFO, WARNING, ERROR, CRITICAL
format: '%(asctime)s %(levelname)s: %(message)s'
dateformat: '%Y-%m-%d %H:%M:%S'
max_queue_size: 1000
dir: /var/log/patroni
file_num: 10
file_size: 25000000 # 25MB
2.2. View Patroni logs
# Via journalctl (if systemd service)
sudo journalctl -u patroni -f
# Show last 100 lines
sudo journalctl -u patroni -n 100
# Show logs since specific time
sudo journalctl -u patroni --since "1 hour ago"
sudo journalctl -u patroni --since "2024-11-25 10:00:00"
# Search for specific keywords
sudo journalctl -u patroni | grep -i "error\|warning\|fail"
# Export to file
sudo journalctl -u patroni --since today > /tmp/patroni-today.log
2.3. Common Patroni log patterns
Successful bootstrap
2024-11-25 10:00:00 INFO: Selected new etcd server http://10.0.1.11:2379
2024-11-25 10:00:01 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-11-25 10:00:02 INFO: Lock owner: None; I am node1
2024-11-25 10:00:03 INFO: trying to bootstrap a new cluster
2024-11-25 10:00:05 INFO: Running custom bootstrap script: /usr/lib/postgresql/18/bin/initdb
2024-11-25 10:00:15 INFO: postmaster pid=12345
2024-11-25 10:00:16 INFO: initialized a new cluster
2024-11-25 10:00:17 INFO: acquired session lock as a leader ✅
Failed to acquire lock (replica)
2024-11-25 10:05:00 INFO: Lock owner: node1; I am node2
2024-11-25 10:05:01 INFO: does not have lock
2024-11-25 10:05:02 INFO: starting as a secondary
2024-11-25 10:05:05 INFO: replica has been created using basebackup
Failover detected
2024-11-25 11:30:00 WARNING: Request failed to node1:8008: connection refused
2024-11-25 11:30:10 INFO: promoted self to leader by acquiring session lock ✅
2024-11-25 11:30:11 INFO: cleared rewind state after becoming the leader
2024-11-25 11:30:12 INFO: running post_promote script
DCS connection issues
2024-11-25 12:00:00 ERROR: Failed to get list of machines from etcd
2024-11-25 12:00:01 WARNING: Could not activate Linux watchdog device: No such file or directory
2024-11-25 12:00:02 ERROR: watching on etcd failed
2024-11-25 12:00:03 INFO: trying to connect to etcd server http://10.0.1.11:2379
→ Solution: Check etcd cluster health
pg_rewind execution
2024-11-25 13:00:00 INFO: running pg_rewind from node2
2024-11-25 13:00:05 INFO: pg_rewind: servers diverged at WAL position 0/5000000 on timeline 2
2024-11-25 13:00:10 INFO: pg_rewind: rewinding from last common checkpoint at 0/4000000
2024-11-25 13:00:45 INFO: pg_rewind: Done!
2024-11-25 13:00:46 INFO: starting PostgreSQL
Configuration reload
2024-11-25 14:00:00 INFO: Reloading PostgreSQL configuration.
2024-11-25 14:00:01 INFO: Reload configuration: max_connections changed from 100 to 200
2024-11-25 14:00:02 INFO: PostgreSQL configuration reload succeeded
2.4. Enable DEBUG logging
# Temporarily enable debug logging
sudo -u postgres patronictl edit-config postgres
# Add:
log:
level: DEBUG
# Or set via environment variable
sudo systemctl edit patroni
# Add:
[Service]
Environment="PATRONI_LOGLEVEL=DEBUG"
sudo systemctl daemon-reload
sudo systemctl restart patroni
# Warning: DEBUG is VERY verbose!
# Use only for troubleshooting, then disable
3. etcd Logging
3.1. View etcd logs
# Via journalctl
sudo journalctl -u etcd -f
# Last 100 entries
sudo journalctl -u etcd -n 100
# Errors only
sudo journalctl -u etcd -p err
# Specific time range
sudo journalctl -u etcd --since "10:00" --until "11:00"
3.2. etcd log levels
# Configure in etcd service file
# /etc/systemd/system/etcd.service
ExecStart=/usr/local/bin/etcd \
--log-level info \ # debug, info, warn, error, panic, fatal
--logger zap \
--log-outputs stderr
3.3. Common etcd log patterns
Cluster healthy
2024-11-25 10:00:00 INFO: raft.node: elected leader at term 5
2024-11-25 10:00:01 INFO: rafthttp: established a TCP streaming connection with peer node2
2024-11-25 10:00:02 INFO: etcdserver: published member info to cluster
Leader election
2024-11-25 11:00:00 WARN: rafthttp: lost TCP streaming connection with peer node1
2024-11-25 11:00:01 INFO: raft.node: node3 elected leader at term 6
2024-11-25 11:00:02 INFO: rafthttp: peer node1 became active
Network partition
2024-11-25 12:00:00 WARN: rafthttp: health check for peer node2 could not connect: dial tcp: i/o timeout
2024-11-25 12:00:05 ERROR: rafthttp: failed to read from node2 on stream: EOF
2024-11-25 12:00:10 WARN: etcdserver: failed to reach quorum
Slow operations
2024-11-25 13:00:00 WARN: etcdserver: apply request took too long [1.234s] expected less than [100ms]
2024-11-25 13:00:01 WARN: etcdserver: read-only range request took too long to execute [523ms]
→ Solution: Check disk I/O, CPU load
3.4. etcd debugging
# Check cluster health
etcdctl endpoint health --cluster
# Check cluster status
etcdctl endpoint status --cluster --write-out=table
# Check member list
etcdctl member list --write-out=table
# Check alarms
etcdctl alarm list
# Compact history (if too large)
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')
# Defrag
etcdctl defrag --cluster
4. Common Issues and Solutions
4.1. Issue: Patroni won't start
Symptoms: systemctl status patroni shows failed
Diagnosis:
# Check logs
sudo journalctl -u patroni -n 50
# Common causes:
# 1. Configuration error
# Error: "yaml.scanner.ScannerError: mapping values are not allowed here"
→ Fix YAML syntax in patroni.yml
# 2. PostgreSQL already running
# Error: "postmaster.pid already exists"
sudo systemctl stop postgresql
rm /var/lib/postgresql/18/data/postmaster.pid
# 3. DCS not reachable
# Error: "Failed to get list of machines from etcd"
→ Check etcd status: systemctl status etcd
# 4. Permission denied
# Error: "could not open file postgresql.conf: Permission denied"
sudo chown -R postgres:postgres /var/lib/postgresql/18/data
4.2. Issue: No leader in cluster
Symptoms: patronictl list shows no Leader
Diagnosis:
# Check all Patroni instances
for node in node1 node2 node3; do
echo "=== $node ==="
ssh $node "sudo systemctl status patroni"
done
# Check DCS
etcdctl endpoint health --cluster
# Check logs for election issues
sudo journalctl -u patroni | grep -i "leader\|election"
# Common causes:
# 1. DCS has no quorum
→ Ensure at least 2 of 3 etcd nodes running
# 2. All nodes tagged nofailover
→ Check tags in patroni.yml or DCS
# 3. synchronous_mode_strict with no sync replica
→ Disable strict mode or fix replication
4.3. Issue: High replication lag
Symptoms: Replica lag > 100MB
Diagnosis:
-- Check lag
SELECT application_name,
pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
replay_lag
FROM pg_stat_replication;
-- Check replica activity
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;
-- Common causes:
-- 1. Long-running queries on replica
SELECT pid, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 min';
-- Kill if needed: SELECT pg_terminate_backend(pid);
-- 2. Network bandwidth
-- Check with iftop or netstat
-- 3. Disk I/O on replica
-- Check with iostat -x 1
-- 4. max_wal_senders limit
SHOW max_wal_senders; -- Increase if needed
-- 5. wal_keep_size too small
SHOW wal_keep_size; -- Increase to prevent WAL deletion
4.4. Issue: Split-brain detected
Symptoms: Multiple nodes claim to be leader
Diagnosis:
# Check cluster status
patronictl list postgres
# Check each node's view
for node in node1 node2 node3; do
echo "=== $node ==="
curl -s http://$node:8008/patroni | jq '.role'
done
# CRITICAL: Stop all Patroni immediately if split-brain confirmed
for node in node1 node2 node3; do
ssh $node "sudo systemctl stop patroni"
done
# Recovery procedure: See Bài 15
4.5. Issue: Failover not happening
Symptoms: Primary down but no promotion
Diagnosis:
# Check Patroni status
patronictl list postgres
# Check logs
sudo journalctl -u patroni | grep -i "failover\|promote"
# Common causes:
# 1. All replicas have nofailover tag
patronictl show-config postgres | grep nofailover
# 2. Replication lag too high
patronictl show-config postgres | grep maximum_lag_on_failover
# 3. DCS connection lost on all replicas
etcdctl endpoint health --cluster
# 4. synchronous_mode_strict enabled with no sync replica
patronictl show-config postgres | grep synchronous_mode
4.6. Issue: Cannot connect to PostgreSQL
Symptoms: Connection refused or timeout
Diagnosis:
# 1. Check PostgreSQL running
sudo systemctl status postgresql
ps aux | grep postgres
# 2. Check listening port
sudo netstat -tuln | grep 5432
sudo ss -tuln | grep 5432
# 3. Test connection locally
psql -h localhost -U postgres -c "SELECT 1"
# 4. Check pg_hba.conf
sudo cat /var/lib/postgresql/18/data/pg_hba.conf | grep -v "^#"
# 5. Check postgresql.conf
sudo -u postgres psql -c "SHOW listen_addresses;"
sudo -u postgres psql -c "SHOW port;"
# 6. Check firewall
sudo iptables -L -n | grep 5432
sudo ufw status
# 7. Test from remote
telnet 10.0.1.11 5432
nc -zv 10.0.1.11 5432
4.7. Issue: pg_rewind failed
Symptoms: Node cannot rejoin after failover
Error: pg_rewind: error: could not find common ancestor
Diagnosis:
# Check logs
sudo journalctl -u patroni | grep pg_rewind
# Common causes:
# 1. wal_log_hints not enabled
sudo -u postgres psql -c "SHOW wal_log_hints;"
# If off, enable and restart
# 2. Timeline too diverged
# Solution: Full reinit
patronictl reinit postgres node1
# 3. Data directory corrupt
# Solution: Remove and rebuild
sudo systemctl stop patroni
sudo rm -rf /var/lib/postgresql/18/data
sudo systemctl start patroni # Patroni will rebuild
5. Debugging Tools
5.1. patronictl commands
# List cluster
patronictl list postgres
# Show configuration
patronictl show-config postgres
# Edit configuration
patronictl edit-config postgres
# Restart node
patronictl restart postgres node1
# Reinitialize node
patronictl reinit postgres node1
# Reload configuration
patronictl reload postgres node1
# Pause/resume cluster
patronictl pause postgres
patronictl resume postgres
# History
patronictl history postgres
5.2. etcdctl commands
# Cluster health
etcdctl endpoint health --cluster
# Cluster status
etcdctl endpoint status --cluster --write-out=table
# Member list
etcdctl member list --write-out=table
# Get Patroni data
etcdctl get /service/postgres/ --prefix --keys-only
etcdctl get /service/postgres/leader
# Watch for changes
etcdctl watch /service/postgres/ --prefix
# Delete key (dangerous!)
etcdctl del /service/postgres/leader # Force re-election
5.3. PostgreSQL diagnostic queries
-- Current activity
SELECT * FROM pg_stat_activity WHERE state != 'idle';
-- Blocking queries
SELECT blocked_locks.pid AS blocked_pid,
blocked_activity.usename AS blocked_user,
blocking_locks.pid AS blocking_pid,
blocking_activity.usename AS blocking_user,
blocked_activity.query AS blocked_statement,
blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks
ON blocking_locks.locktype = blocked_locks.locktype
AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;
-- Replication status
SELECT * FROM pg_stat_replication;
-- WAL status
SELECT * FROM pg_stat_wal_receiver;
-- Database size
SELECT datname, pg_size_pretty(pg_database_size(datname))
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Table bloat
SELECT schemaname, tablename,
pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;
5.4. System diagnostic commands
# CPU usage
top -bn1 | head -20
htop
# Memory
free -h
ps aux --sort=-%mem | head -20
# Disk I/O
iostat -x 1 10
iotop
# Disk space
df -h
du -sh /var/lib/postgresql/18/data/*
# Network connections
netstat -tuln | grep 5432
ss -tuln | grep 5432
# Network traffic
iftop -i eth0
nethogs
# Check for OOM kills
dmesg | grep -i "out of memory"
journalctl -k | grep -i "killed process"
6. Troubleshooting Runbook
6.1. Primary node down
# 1. Verify primary is down
patronictl list postgres
# 2. Check if automatic failover occurred
# Timeline should increment, new leader elected
# 3. If no failover, check why:
# - DCS healthy?
etcdctl endpoint health --cluster
# - Replicas eligible?
patronictl show-config postgres | grep nofailover
# - Replication lag acceptable?
patronictl show-config postgres | grep maximum_lag_on_failover
# 4. Check old primary logs
sudo journalctl -u patroni -n 200 | grep -i "error\|fatal"
# 5. Attempt manual intervention if needed
patronictl failover postgres --candidate node2
6.2. Replica not replicating
# 1. Check replica status
patronictl list postgres
# 2. On replica, check replication status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"
# 3. Check if timeline matches
sudo -u postgres psql -c "SELECT timeline_id FROM pg_control_checkpoint();"
# 4. Check primary for replication slot
sudo -u postgres psql -h primary -c "SELECT * FROM pg_replication_slots;"
# 5. Check network connectivity
ping primary
telnet primary 5432
# 6. Restart replication
sudo systemctl restart patroni
# 7. If still failing, reinitialize
patronictl reinit postgres replica-node
6.3. etcd cluster unhealthy
# 1. Check cluster health
etcdctl endpoint health --cluster
# 2. Check cluster status
etcdctl endpoint status --cluster --write-out=table
# 3. Check logs
sudo journalctl -u etcd -n 100
# 4. If quorum lost (< 2 of 3 healthy):
# - Fix or restart unhealthy members
sudo systemctl restart etcd
# 5. If member corrupt:
# - Remove and re-add member
etcdctl member remove <member-id>
etcdctl member add node3 --peer-urls=http://10.0.1.13:2380
# Then restart etcd on node3
# 6. Last resort: Rebuild etcd cluster
# (Requires stopping all services)
7. Best Practices
✅ DO
- Enable appropriate logging - Balance detail vs volume
- Centralize logs - Use ELK/Grafana Loki
- Set up alerts - Proactive notification
- Regular log review - Weekly analysis
- Document issues - Build knowledge base
- Test scenarios - Practice troubleshooting
- Keep runbooks updated - Living documents
- Monitor disk space - Logs can fill disk
❌ DON'T
- Don't enable DEBUG in production - Too verbose
- Don't ignore warnings - They become errors
- Don't delete logs immediately - Keep for analysis
- Don't skip log rotation - Prevent disk full
- Don't troubleshoot blind - Check logs first
- Don't make changes without logs - Document actions
8. Lab Exercises
Lab 1: Log analysis
Tasks: 1. Configure PostgreSQL slow query logging 2. Generate workload with slow queries 3. Analyze logs with pgBadger 4. Identify top issues 5. Document findings
Lab 2: Simulate and debug failover
Tasks: 1. Stop primary node 2. Monitor Patroni logs during failover 3. Trace timeline of events 4. Calculate downtime from logs 5. Create timeline diagram
Lab 3: Debug replication lag
Tasks: 1. Simulate high write load on primary 2. Observe lag increase on replica 3. Use diagnostic queries to identify cause 4. Fix the issue 5. Verify lag reduced
Lab 4: Troubleshoot connection issues
Tasks: 1. Misconfigure pg_hba.conf 2. Attempt connections (will fail) 3. Use logs to identify issue 4. Fix configuration 5. Verify connections work
9. Tổng kết
Key Logging Locations
PostgreSQL: /var/lib/postgresql/18/data/log/
Patroni: journalctl -u patroni
etcd: journalctl -u etcd
System: /var/log/syslog, dmesg
Essential Diagnostic Commands
# Cluster status
patronictl list postgres
# PostgreSQL connection
psql -h localhost -U postgres
# Replication status
SELECT * FROM pg_stat_replication;
# etcd health
etcdctl endpoint health --cluster
# Logs
sudo journalctl -u patroni -f
sudo journalctl -u etcd -f
Troubleshooting Workflow
- Identify - What's the symptom?
- Isolate - Which component is failing?
- Investigate - Check logs, metrics
- Diagnose - What's the root cause?
- Fix - Apply solution
- Verify - Confirm resolution
- Document - Record for future
Next Steps
Bài 20 sẽ cover Security Best Practices:
- SSL/TLS encryption
- Authentication methods
- Network security
- Encryption at rest
- Audit logging
- Security hardening