Bài 19: Logging và Troubleshooting

Mục tiêu

Sau bài học này, bạn sẽ:

Hiểu và phân tích PostgreSQL logs
Debug Patroni issues qua logs
Troubleshoot etcd cluster problems
Identify và fix common HA issues
Use debugging tools effectively
Build troubleshooting runbooks

1. PostgreSQL Logging

1.1. Configure logging

-- Essential logging settings
ALTER SYSTEM SET logging_collector = on;
ALTER SYSTEM SET log_directory = 'log';
ALTER SYSTEM SET log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log';
ALTER SYSTEM SET log_rotation_age = '1d';
ALTER SYSTEM SET log_rotation_size = '100MB';

-- What to log
ALTER SYSTEM SET log_min_duration_statement = '1000';  -- Slow queries (>1s)
ALTER SYSTEM SET log_checkpoints = on;
ALTER SYSTEM SET log_connections = off;  -- Too verbose in production
ALTER SYSTEM SET log_disconnections = off;
ALTER SYSTEM SET log_lock_waits = on;
ALTER SYSTEM SET log_temp_files = 0;  -- Log all temp file usage
ALTER SYSTEM SET log_autovacuum_min_duration = '1s';

-- Log line format
ALTER SYSTEM SET log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h ';
ALTER SYSTEM SET log_statement = 'none';  -- 'none', 'ddl', 'mod', 'all'
ALTER SYSTEM SET log_duration = off;

-- Error detail
ALTER SYSTEM SET log_error_verbosity = 'default';  -- 'terse', 'default', 'verbose'

-- Reload
SELECT pg_reload_conf();

1.2. Log file location

# Default location
/var/lib/postgresql/18/data/log/

# Or configured location
sudo -u postgres psql -c "SHOW log_directory;"
sudo -u postgres psql -c "SHOW data_directory;"

# View logs
sudo tail -f /var/lib/postgresql/18/data/log/postgresql-*.log

# Or via journalctl
sudo journalctl -u postgresql -f

1.3. Common log patterns

Connection issues

LOG: connection received: host=10.0.1.50 port=54321
FATAL: password authentication failed for user "app_user"
DETAIL: Connection matched pg_hba.conf line 95: "host all all 0.0.0.0/0 md5"

→ Solution: Check password, pg_hba.conf configuration

Slow queries

LOG: duration: 5432.123 ms  statement: SELECT * FROM large_table WHERE ...

→ Solution: Analyze query, add indexes, optimize

Deadlocks

ERROR: deadlock detected
DETAIL: Process 12345 waits for ShareLock on transaction 67890; blocked by process 12346.
Process 12346 waits for ShareLock on transaction 67891; blocked by process 12345.
HINT: See server log for query details.

→ Solution: Review transaction order, add proper locking

Checkpoint warnings

LOG: checkpoints are occurring too frequently (24 seconds apart)
HINT: Consider increasing the configuration parameter "max_wal_size".

→ Solution: Increase max_wal_size

Replication lag

WARNING: could not send data to WAL receiver; retrying
DETAIL: Connection to replica timed out

→ Solution: Check network, replica health

Out of disk space

FATAL: could not write to file "base/16384/12345": No space left on device

→ Solution: Free up disk space, increase storage

1.4. Analyze logs with pgBadger

# Install pgBadger
sudo apt-get install -y pgbadger

# Or from source
wget https://github.com/darold/pgbadger/archive/v12.2.tar.gz
tar -xzf v12.2.tar.gz
cd pgbadger-12.2
perl Makefile.PL
make && sudo make install

# Generate report
pgbadger /var/lib/postgresql/18/data/log/postgresql-*.log \
  -o /tmp/pgbadger-report.html

# Open in browser
# file:///tmp/pgbadger-report.html

# Report includes:
# - Top slow queries
# - Connection statistics
# - Lock analysis
# - Checkpoint activity
# - Temporary file usage

2. Patroni Logging

2.1. Patroni log levels

# In patroni.yml
log:
  level: INFO  # DEBUG, INFO, WARNING, ERROR, CRITICAL
  format: '%(asctime)s %(levelname)s: %(message)s'
  dateformat: '%Y-%m-%d %H:%M:%S'
  max_queue_size: 1000
  dir: /var/log/patroni
  file_num: 10
  file_size: 25000000  # 25MB

2.2. View Patroni logs

# Via journalctl (if systemd service)
sudo journalctl -u patroni -f

# Show last 100 lines
sudo journalctl -u patroni -n 100

# Show logs since specific time
sudo journalctl -u patroni --since "1 hour ago"
sudo journalctl -u patroni --since "2024-11-25 10:00:00"

# Search for specific keywords
sudo journalctl -u patroni | grep -i "error\|warning\|fail"

# Export to file
sudo journalctl -u patroni --since today > /tmp/patroni-today.log

2.3. Common Patroni log patterns

Successful bootstrap

2024-11-25 10:00:00 INFO: Selected new etcd server http://10.0.1.11:2379
2024-11-25 10:00:01 INFO: No PostgreSQL configuration items changed, nothing to reload.
2024-11-25 10:00:02 INFO: Lock owner: None; I am node1
2024-11-25 10:00:03 INFO: trying to bootstrap a new cluster
2024-11-25 10:00:05 INFO: Running custom bootstrap script: /usr/lib/postgresql/18/bin/initdb
2024-11-25 10:00:15 INFO: postmaster pid=12345
2024-11-25 10:00:16 INFO: initialized a new cluster
2024-11-25 10:00:17 INFO: acquired session lock as a leader ✅

Failed to acquire lock (replica)

2024-11-25 10:05:00 INFO: Lock owner: node1; I am node2
2024-11-25 10:05:01 INFO: does not have lock
2024-11-25 10:05:02 INFO: starting as a secondary
2024-11-25 10:05:05 INFO: replica has been created using basebackup

Failover detected

2024-11-25 11:30:00 WARNING: Request failed to node1:8008: connection refused
2024-11-25 11:30:10 INFO: promoted self to leader by acquiring session lock ✅
2024-11-25 11:30:11 INFO: cleared rewind state after becoming the leader
2024-11-25 11:30:12 INFO: running post_promote script

DCS connection issues

2024-11-25 12:00:00 ERROR: Failed to get list of machines from etcd
2024-11-25 12:00:01 WARNING: Could not activate Linux watchdog device: No such file or directory
2024-11-25 12:00:02 ERROR: watching on etcd failed
2024-11-25 12:00:03 INFO: trying to connect to etcd server http://10.0.1.11:2379

→ Solution: Check etcd cluster health

pg_rewind execution

2024-11-25 13:00:00 INFO: running pg_rewind from node2
2024-11-25 13:00:05 INFO: pg_rewind: servers diverged at WAL position 0/5000000 on timeline 2
2024-11-25 13:00:10 INFO: pg_rewind: rewinding from last common checkpoint at 0/4000000
2024-11-25 13:00:45 INFO: pg_rewind: Done!
2024-11-25 13:00:46 INFO: starting PostgreSQL

Configuration reload

2024-11-25 14:00:00 INFO: Reloading PostgreSQL configuration.
2024-11-25 14:00:01 INFO: Reload configuration: max_connections changed from 100 to 200
2024-11-25 14:00:02 INFO: PostgreSQL configuration reload succeeded

2.4. Enable DEBUG logging

# Temporarily enable debug logging
sudo -u postgres patronictl edit-config postgres

# Add:
log:
  level: DEBUG

# Or set via environment variable
sudo systemctl edit patroni

# Add:
[Service]
Environment="PATRONI_LOGLEVEL=DEBUG"

sudo systemctl daemon-reload
sudo systemctl restart patroni

# Warning: DEBUG is VERY verbose!
# Use only for troubleshooting, then disable

3. etcd Logging

3.1. View etcd logs

# Via journalctl
sudo journalctl -u etcd -f

# Last 100 entries
sudo journalctl -u etcd -n 100

# Errors only
sudo journalctl -u etcd -p err

# Specific time range
sudo journalctl -u etcd --since "10:00" --until "11:00"

3.2. etcd log levels

# Configure in etcd service file
# /etc/systemd/system/etcd.service

ExecStart=/usr/local/bin/etcd \
  --log-level info \  # debug, info, warn, error, panic, fatal
  --logger zap \
  --log-outputs stderr

3.3. Common etcd log patterns

Cluster healthy

2024-11-25 10:00:00 INFO: raft.node: elected leader at term 5
2024-11-25 10:00:01 INFO: rafthttp: established a TCP streaming connection with peer node2
2024-11-25 10:00:02 INFO: etcdserver: published member info to cluster

Leader election

2024-11-25 11:00:00 WARN: rafthttp: lost TCP streaming connection with peer node1
2024-11-25 11:00:01 INFO: raft.node: node3 elected leader at term 6
2024-11-25 11:00:02 INFO: rafthttp: peer node1 became active

Network partition

2024-11-25 12:00:00 WARN: rafthttp: health check for peer node2 could not connect: dial tcp: i/o timeout
2024-11-25 12:00:05 ERROR: rafthttp: failed to read from node2 on stream: EOF
2024-11-25 12:00:10 WARN: etcdserver: failed to reach quorum

Slow operations

2024-11-25 13:00:00 WARN: etcdserver: apply request took too long [1.234s] expected less than [100ms]
2024-11-25 13:00:01 WARN: etcdserver: read-only range request took too long to execute [523ms]

→ Solution: Check disk I/O, CPU load

3.4. etcd debugging

# Check cluster health
etcdctl endpoint health --cluster

# Check cluster status
etcdctl endpoint status --cluster --write-out=table

# Check member list
etcdctl member list --write-out=table

# Check alarms
etcdctl alarm list

# Compact history (if too large)
etcdctl compact $(etcdctl endpoint status --write-out="json" | jq -r '.[0].Status.header.revision')

# Defrag
etcdctl defrag --cluster

4. Common Issues and Solutions

4.1. Issue: Patroni won't start

Symptoms: systemctl status patroni shows failed

Diagnosis:

# Check logs
sudo journalctl -u patroni -n 50

# Common causes:

# 1. Configuration error
# Error: "yaml.scanner.ScannerError: mapping values are not allowed here"
→ Fix YAML syntax in patroni.yml

# 2. PostgreSQL already running
# Error: "postmaster.pid already exists"
sudo systemctl stop postgresql
rm /var/lib/postgresql/18/data/postmaster.pid

# 3. DCS not reachable
# Error: "Failed to get list of machines from etcd"
→ Check etcd status: systemctl status etcd

# 4. Permission denied
# Error: "could not open file postgresql.conf: Permission denied"
sudo chown -R postgres:postgres /var/lib/postgresql/18/data

4.2. Issue: No leader in cluster

Symptoms: patronictl list shows no Leader

Diagnosis:

# Check all Patroni instances
for node in node1 node2 node3; do
  echo "=== $node ==="
  ssh $node "sudo systemctl status patroni"
done

# Check DCS
etcdctl endpoint health --cluster

# Check logs for election issues
sudo journalctl -u patroni | grep -i "leader\|election"

# Common causes:

# 1. DCS has no quorum
→ Ensure at least 2 of 3 etcd nodes running

# 2. All nodes tagged nofailover
→ Check tags in patroni.yml or DCS

# 3. synchronous_mode_strict with no sync replica
→ Disable strict mode or fix replication

4.3. Issue: High replication lag

Symptoms: Replica lag > 100MB

Diagnosis:

-- Check lag
SELECT application_name,
       pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) AS lag_bytes,
       replay_lag
FROM pg_stat_replication;

-- Check replica activity
SELECT * FROM pg_stat_activity WHERE wait_event IS NOT NULL;

-- Common causes:

-- 1. Long-running queries on replica
SELECT pid, query_start, state, query
FROM pg_stat_activity
WHERE state = 'active' AND now() - query_start > interval '5 min';
-- Kill if needed: SELECT pg_terminate_backend(pid);

-- 2. Network bandwidth
-- Check with iftop or netstat

-- 3. Disk I/O on replica
-- Check with iostat -x 1

-- 4. max_wal_senders limit
SHOW max_wal_senders;  -- Increase if needed

-- 5. wal_keep_size too small
SHOW wal_keep_size;  -- Increase to prevent WAL deletion

4.4. Issue: Split-brain detected

Symptoms: Multiple nodes claim to be leader

Diagnosis:

# Check cluster status
patronictl list postgres

# Check each node's view
for node in node1 node2 node3; do
  echo "=== $node ==="
  curl -s http://$node:8008/patroni | jq '.role'
done

# CRITICAL: Stop all Patroni immediately if split-brain confirmed
for node in node1 node2 node3; do
  ssh $node "sudo systemctl stop patroni"
done

# Recovery procedure: See Bài 15

4.5. Issue: Failover not happening

Symptoms: Primary down but no promotion

Diagnosis:

# Check Patroni status
patronictl list postgres

# Check logs
sudo journalctl -u patroni | grep -i "failover\|promote"

# Common causes:

# 1. All replicas have nofailover tag
patronictl show-config postgres | grep nofailover

# 2. Replication lag too high
patronictl show-config postgres | grep maximum_lag_on_failover

# 3. DCS connection lost on all replicas
etcdctl endpoint health --cluster

# 4. synchronous_mode_strict enabled with no sync replica
patronictl show-config postgres | grep synchronous_mode

4.6. Issue: Cannot connect to PostgreSQL

Symptoms: Connection refused or timeout

Diagnosis:

# 1. Check PostgreSQL running
sudo systemctl status postgresql
ps aux | grep postgres

# 2. Check listening port
sudo netstat -tuln | grep 5432
sudo ss -tuln | grep 5432

# 3. Test connection locally
psql -h localhost -U postgres -c "SELECT 1"

# 4. Check pg_hba.conf
sudo cat /var/lib/postgresql/18/data/pg_hba.conf | grep -v "^#"

# 5. Check postgresql.conf
sudo -u postgres psql -c "SHOW listen_addresses;"
sudo -u postgres psql -c "SHOW port;"

# 6. Check firewall
sudo iptables -L -n | grep 5432
sudo ufw status

# 7. Test from remote
telnet 10.0.1.11 5432
nc -zv 10.0.1.11 5432

4.7. Issue: pg_rewind failed

Symptoms: Node cannot rejoin after failover

Error: pg_rewind: error: could not find common ancestor

Diagnosis:

# Check logs
sudo journalctl -u patroni | grep pg_rewind

# Common causes:

# 1. wal_log_hints not enabled
sudo -u postgres psql -c "SHOW wal_log_hints;"
# If off, enable and restart

# 2. Timeline too diverged
# Solution: Full reinit
patronictl reinit postgres node1

# 3. Data directory corrupt
# Solution: Remove and rebuild
sudo systemctl stop patroni
sudo rm -rf /var/lib/postgresql/18/data
sudo systemctl start patroni  # Patroni will rebuild

5. Debugging Tools

5.1. patronictl commands

# List cluster
patronictl list postgres

# Show configuration
patronictl show-config postgres

# Edit configuration
patronictl edit-config postgres

# Restart node
patronictl restart postgres node1

# Reinitialize node
patronictl reinit postgres node1

# Reload configuration
patronictl reload postgres node1

# Pause/resume cluster
patronictl pause postgres
patronictl resume postgres

# History
patronictl history postgres

5.2. etcdctl commands

# Cluster health
etcdctl endpoint health --cluster

# Cluster status
etcdctl endpoint status --cluster --write-out=table

# Member list
etcdctl member list --write-out=table

# Get Patroni data
etcdctl get /service/postgres/ --prefix --keys-only
etcdctl get /service/postgres/leader

# Watch for changes
etcdctl watch /service/postgres/ --prefix

# Delete key (dangerous!)
etcdctl del /service/postgres/leader  # Force re-election

5.3. PostgreSQL diagnostic queries

-- Current activity
SELECT * FROM pg_stat_activity WHERE state != 'idle';

-- Blocking queries
SELECT blocked_locks.pid AS blocked_pid,
       blocked_activity.usename AS blocked_user,
       blocking_locks.pid AS blocking_pid,
       blocking_activity.usename AS blocking_user,
       blocked_activity.query AS blocked_statement,
       blocking_activity.query AS current_statement_in_blocking_process
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.database IS NOT DISTINCT FROM blocked_locks.database
    AND blocking_locks.relation IS NOT DISTINCT FROM blocked_locks.relation
    AND blocking_locks.page IS NOT DISTINCT FROM blocked_locks.page
    AND blocking_locks.tuple IS NOT DISTINCT FROM blocked_locks.tuple
    AND blocking_locks.virtualxid IS NOT DISTINCT FROM blocked_locks.virtualxid
    AND blocking_locks.transactionid IS NOT DISTINCT FROM blocked_locks.transactionid
    AND blocking_locks.classid IS NOT DISTINCT FROM blocked_locks.classid
    AND blocking_locks.objid IS NOT DISTINCT FROM blocked_locks.objid
    AND blocking_locks.objsubid IS NOT DISTINCT FROM blocked_locks.objsubid
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- Replication status
SELECT * FROM pg_stat_replication;

-- WAL status
SELECT * FROM pg_stat_wal_receiver;

-- Database size
SELECT datname, pg_size_pretty(pg_database_size(datname))
FROM pg_database
ORDER BY pg_database_size(datname) DESC;

-- Table bloat
SELECT schemaname, tablename, 
       pg_size_pretty(pg_total_relation_size(schemaname||'.'||tablename)) AS total_size,
       n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC
LIMIT 20;

5.4. System diagnostic commands

# CPU usage
top -bn1 | head -20
htop

# Memory
free -h
ps aux --sort=-%mem | head -20

# Disk I/O
iostat -x 1 10
iotop

# Disk space
df -h
du -sh /var/lib/postgresql/18/data/*

# Network connections
netstat -tuln | grep 5432
ss -tuln | grep 5432

# Network traffic
iftop -i eth0
nethogs

# Check for OOM kills
dmesg | grep -i "out of memory"
journalctl -k | grep -i "killed process"

6. Troubleshooting Runbook

6.1. Primary node down

# 1. Verify primary is down
patronictl list postgres

# 2. Check if automatic failover occurred
# Timeline should increment, new leader elected

# 3. If no failover, check why:
# - DCS healthy?
etcdctl endpoint health --cluster

# - Replicas eligible?
patronictl show-config postgres | grep nofailover

# - Replication lag acceptable?
patronictl show-config postgres | grep maximum_lag_on_failover

# 4. Check old primary logs
sudo journalctl -u patroni -n 200 | grep -i "error\|fatal"

# 5. Attempt manual intervention if needed
patronictl failover postgres --candidate node2

6.2. Replica not replicating

# 1. Check replica status
patronictl list postgres

# 2. On replica, check replication status
sudo -u postgres psql -c "SELECT * FROM pg_stat_wal_receiver;"

# 3. Check if timeline matches
sudo -u postgres psql -c "SELECT timeline_id FROM pg_control_checkpoint();"

# 4. Check primary for replication slot
sudo -u postgres psql -h primary -c "SELECT * FROM pg_replication_slots;"

# 5. Check network connectivity
ping primary
telnet primary 5432

# 6. Restart replication
sudo systemctl restart patroni

# 7. If still failing, reinitialize
patronictl reinit postgres replica-node

6.3. etcd cluster unhealthy

# 1. Check cluster health
etcdctl endpoint health --cluster

# 2. Check cluster status
etcdctl endpoint status --cluster --write-out=table

# 3. Check logs
sudo journalctl -u etcd -n 100

# 4. If quorum lost (< 2 of 3 healthy):
# - Fix or restart unhealthy members
sudo systemctl restart etcd

# 5. If member corrupt:
# - Remove and re-add member
etcdctl member remove <member-id>
etcdctl member add node3 --peer-urls=http://10.0.1.13:2380
# Then restart etcd on node3

# 6. Last resort: Rebuild etcd cluster
# (Requires stopping all services)

7. Best Practices

✅ DO

Enable appropriate logging - Balance detail vs volume
Centralize logs - Use ELK/Grafana Loki
Set up alerts - Proactive notification
Regular log review - Weekly analysis
Document issues - Build knowledge base
Test scenarios - Practice troubleshooting
Keep runbooks updated - Living documents
Monitor disk space - Logs can fill disk

❌ DON'T

Don't enable DEBUG in production - Too verbose
Don't ignore warnings - They become errors
Don't delete logs immediately - Keep for analysis
Don't skip log rotation - Prevent disk full
Don't troubleshoot blind - Check logs first
Don't make changes without logs - Document actions

8. Lab Exercises

Lab 1: Log analysis

Tasks: 1. Configure PostgreSQL slow query logging 2. Generate workload with slow queries 3. Analyze logs with pgBadger 4. Identify top issues 5. Document findings

Lab 2: Simulate and debug failover

Tasks: 1. Stop primary node 2. Monitor Patroni logs during failover 3. Trace timeline of events 4. Calculate downtime from logs 5. Create timeline diagram

Lab 3: Debug replication lag

Tasks: 1. Simulate high write load on primary 2. Observe lag increase on replica 3. Use diagnostic queries to identify cause 4. Fix the issue 5. Verify lag reduced

Lab 4: Troubleshoot connection issues

Tasks: 1. Misconfigure pg_hba.conf 2. Attempt connections (will fail) 3. Use logs to identify issue 4. Fix configuration 5. Verify connections work

9. Tổng kết

Key Logging Locations

PostgreSQL: /var/lib/postgresql/18/data/log/
Patroni: journalctl -u patroni
etcd: journalctl -u etcd
System: /var/log/syslog, dmesg

Essential Diagnostic Commands

# Cluster status
patronictl list postgres

# PostgreSQL connection
psql -h localhost -U postgres

# Replication status
SELECT * FROM pg_stat_replication;

# etcd health
etcdctl endpoint health --cluster

# Logs
sudo journalctl -u patroni -f
sudo journalctl -u etcd -f

Troubleshooting Workflow

Identify - What's the symptom?
Isolate - Which component is failing?
Investigate - Check logs, metrics
Diagnose - What's the root cause?
Fix - Apply solution
Verify - Confirm resolution
Document - Record for future

Next Steps

Bài 20 sẽ cover Security Best Practices:

SSL/TLS encryption
Authentication methods
Network security
Encryption at rest
Audit logging
Security hardening

Menu

Bài 19: Logging và Troubleshooting

DUY TRAN