Bài 17: Monitoring Patroni Cluster
Setup monitoring stack với Prometheus + Grafana, sử dụng postgres_exporter, cấu hình alerting rules cho cluster HA.
Bài 17: Monitoring Patroni Cluster
Mục tiêu
Sau bài học này, bạn sẽ:
- Hiểu các metrics quan trọng của PostgreSQL HA cluster
- Setup Prometheus + Grafana để monitor
- Cấu hình postgres_exporter và patroni_exporter
- Tạo dashboards và alerting rules
- Monitor etcd cluster health
- Implement best practices cho observability
1. Why Monitoring Matters
1.1. Monitoring goals
Visibility:
✅ Know cluster health in real-time
✅ Detect issues before users notice
✅ Track performance trends
✅ Capacity planning data
✅ Audit trail for incidents
Key questions to answer:
- Is the cluster healthy?
- Is replication working?
- What's the lag?
- Is there any failover?
- Are connections saturated?
- What's the query performance?
- Is etcd healthy?
- Are backups running?
1.2. The four golden signals
Latency: How long requests take?
-- Query execution time
SELECT query, mean_exec_time, calls
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 10;
Traffic: How many requests?
-- Connection count
SELECT count(*) FROM pg_stat_activity;
-- Transactions per second
SELECT xact_commit + xact_rollback AS tps
FROM pg_stat_database
WHERE datname = 'mydb';
Errors: What's failing?
-- Failed queries
SELECT query, calls, errors
FROM pg_stat_statements
WHERE errors > 0;
Saturation: How full are resources?
-- Connection usage
SELECT count(*), max_connections
FROM pg_stat_activity,
(SELECT setting::int AS max_connections FROM pg_settings WHERE name = 'max_connections') s;
2. Metrics to Monitor
2.1. Cluster-level metrics
Cluster health
✅ Number of nodes up/down
✅ Current leader
✅ Failover count
✅ Timeline number
✅ Cluster configuration version
Replication health
✅ Replication lag (bytes and time)
✅ WAL sender/receiver status
✅ Sync vs async replica count
✅ Replication slot usage
✅ WAL segment generation rate
2.2. PostgreSQL metrics
Connection metrics
-- Active connections by state
SELECT state, count(*)
FROM pg_stat_activity
GROUP BY state;
-- state | count
-- -------------------+-------
-- active | 15
-- idle | 50
-- idle in transaction| 2
Database size and growth
-- Database sizes
SELECT datname,
pg_size_pretty(pg_database_size(datname)) AS size
FROM pg_database
ORDER BY pg_database_size(datname) DESC;
-- Growth rate (need historical data)
SELECT now(),
pg_database_size('mydb') AS size_bytes;
Transaction rate
-- Transactions per second
SELECT datname,
xact_commit + xact_rollback AS total_xacts,
xact_commit,
xact_rollback
FROM pg_stat_database
WHERE datname = 'mydb';
Cache hit ratio
-- Buffer cache hit ratio (should be > 95%)
SELECT
sum(heap_blks_read) AS heap_read,
sum(heap_blks_hit) AS heap_hit,
sum(heap_blks_hit) / nullif(sum(heap_blks_hit) + sum(heap_blks_read), 0) * 100 AS cache_hit_ratio
FROM pg_statio_user_tables;
Index usage
-- Tables with missing indexes (high seq scans)
SELECT schemaname, tablename, seq_scan, seq_tup_read,
idx_scan, seq_tup_read / nullif(seq_scan, 0) AS avg_seq_tup
FROM pg_stat_user_tables
WHERE seq_scan > 0
ORDER BY seq_tup_read DESC
LIMIT 10;
Vacuum and autovacuum
-- Last vacuum/analyze
SELECT schemaname, tablename,
last_vacuum,
last_autovacuum,
last_analyze,
last_autoanalyze,
n_dead_tup
FROM pg_stat_user_tables
ORDER BY n_dead_tup DESC;
Locks
-- Current locks
SELECT locktype, relation::regclass, mode, granted
FROM pg_locks
WHERE NOT granted; -- Waiting locks
Long-running queries
-- Queries running > 5 minutes
SELECT pid,
now() - query_start AS duration,
state,
query
FROM pg_stat_activity
WHERE state = 'active'
AND now() - query_start > interval '5 minutes'
ORDER BY duration DESC;
2.3. Patroni metrics
Via REST API (http://node:8008/metrics):
# Patroni metrics
patroni_patroni_info{scope="postgres",version="3.2.0"}
patroni_postgres_running{scope="postgres"} 1
patroni_postmaster_start_time{scope="postgres"} 1732531200
patroni_timeline{scope="postgres"} 3
patroni_cluster_unlocked{scope="postgres"} 0
# Replication metrics
patroni_replication_lag_bytes{application_name="node2"} 0
patroni_xlog_location{scope="postgres"} 100663296
patroni_xlog_replayed_location{scope="postgres"} 100663296
patroni_is_leader{scope="postgres"} 1
2.4. etcd metrics
Via etcd metrics endpoint (http://node:2379/metrics):
# etcd health
etcd_server_has_leader 1
etcd_server_is_leader 0
etcd_server_leader_changes_seen_total 2
# Performance
etcd_disk_backend_commit_duration_seconds_bucket
etcd_network_peer_round_trip_time_seconds_bucket
# Cluster size
etcd_cluster_version{cluster_version="3.5"}
etcd_server_id{server_id="node1"}
2.5. System metrics
# CPU usage
top
htop
# Memory
free -h
# Disk I/O
iostat -x 1
# Disk space
df -h
# Network
netstat -s
ss -s
3. Prometheus Setup
3.1. Install Prometheus
# Download
cd /tmp
wget https://github.com/prometheus/prometheus/releases/download/v2.48.0/prometheus-2.48.0.linux-amd64.tar.gz
tar -xzf prometheus-2.48.0.linux-amd64.tar.gz
sudo mv prometheus-2.48.0.linux-amd64 /opt/prometheus
# Create user
sudo useradd --no-create-home --shell /bin/false prometheus
# Create directories
sudo mkdir -p /etc/prometheus /var/lib/prometheus
sudo chown prometheus:prometheus /var/lib/prometheus
3.2. Configure Prometheus
# /etc/prometheus/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'postgres-ha'
environment: 'production'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- localhost:9093
# Load rules
rule_files:
- "alerts/*.yml"
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# PostgreSQL (via postgres_exporter)
- job_name: 'postgresql'
static_configs:
- targets:
- 10.0.1.11:9187 # node1
- 10.0.1.12:9187 # node2
- 10.0.1.13:9187 # node3
relabel_configs:
- source_labels: [__address__]
regex: '([^:]+):.*'
target_label: instance
# Patroni (via REST API)
- job_name: 'patroni'
static_configs:
- targets:
- 10.0.1.11:8008
- 10.0.1.12:8008
- 10.0.1.13:8008
metrics_path: /metrics
# etcd
- job_name: 'etcd'
static_configs:
- targets:
- 10.0.1.11:2379
- 10.0.1.12:2379
- 10.0.1.13:2379
scheme: http
# Node exporter (system metrics)
- job_name: 'node'
static_configs:
- targets:
- 10.0.1.11:9100
- 10.0.1.12:9100
- 10.0.1.13:9100
3.3. Create systemd service
# /etc/systemd/system/prometheus.service
sudo tee /etc/systemd/system/prometheus.service <<EOF
[Unit]
Description=Prometheus
Wants=network-online.target
After=network-online.target
[Service]
User=prometheus
Group=prometheus
Type=simple
ExecStart=/opt/prometheus/prometheus \\
--config.file=/etc/prometheus/prometheus.yml \\
--storage.tsdb.path=/var/lib/prometheus/ \\
--storage.tsdb.retention.time=30d \\
--web.console.templates=/opt/prometheus/consoles \\
--web.console.libraries=/opt/prometheus/console_libraries
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
# Set permissions
sudo chown prometheus:prometheus /etc/prometheus/prometheus.yml
# Start
sudo systemctl daemon-reload
sudo systemctl start prometheus
sudo systemctl enable prometheus
# Verify
sudo systemctl status prometheus
curl http://localhost:9090/metrics
4. Exporters Setup
4.1. postgres_exporter
# Install on each PostgreSQL node
cd /tmp
wget https://github.com/prometheus-community/postgres_exporter/releases/download/v0.15.0/postgres_exporter-0.15.0.linux-amd64.tar.gz
tar -xzf postgres_exporter-0.15.0.linux-amd64.tar.gz
sudo mv postgres_exporter-0.15.0.linux-amd64/postgres_exporter /usr/local/bin/
# Create monitoring user in PostgreSQL
sudo -u postgres psql -c "
CREATE USER postgres_exporter WITH PASSWORD 'exporter_password';
GRANT pg_monitor TO postgres_exporter;
"
# Create connection file
sudo tee /etc/postgres_exporter.env <<EOF
DATA_SOURCE_NAME=postgresql://postgres_exporter:exporter_password@localhost:5432/postgres?sslmode=disable
EOF
sudo chmod 600 /etc/postgres_exporter.env
Custom queries (optional):
# /etc/postgres_exporter/queries.yml
pg_replication:
query: |
SELECT
application_name,
client_addr,
state,
COALESCE(pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn), 0) AS lag_bytes,
EXTRACT(EPOCH FROM replay_lag) AS replay_lag_seconds
FROM pg_stat_replication
metrics:
- application_name:
usage: "LABEL"
description: "Application name"
- client_addr:
usage: "LABEL"
description: "Client address"
- state:
usage: "LABEL"
description: "Replication state"
- lag_bytes:
usage: "GAUGE"
description: "Replication lag in bytes"
- replay_lag_seconds:
usage: "GAUGE"
description: "Replay lag in seconds"
Systemd service:
# /etc/systemd/system/postgres_exporter.service
sudo tee /etc/systemd/system/postgres_exporter.service <<EOF
[Unit]
Description=Prometheus PostgreSQL Exporter
After=network.target
[Service]
Type=simple
User=postgres
EnvironmentFile=/etc/postgres_exporter.env
ExecStart=/usr/local/bin/postgres_exporter \\
--web.listen-address=:9187 \\
--extend.query-path=/etc/postgres_exporter/queries.yml
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
# Start
sudo systemctl daemon-reload
sudo systemctl start postgres_exporter
sudo systemctl enable postgres_exporter
# Verify
curl http://localhost:9187/metrics | grep pg_
4.2. node_exporter
# Install on each node
cd /tmp
wget https://github.com/prometheus/node_exporter/releases/download/v1.7.0/node_exporter-1.7.0.linux-amd64.tar.gz
tar -xzf node_exporter-1.7.0.linux-amd64.tar.gz
sudo mv node_exporter-1.7.0.linux-amd64/node_exporter /usr/local/bin/
# Systemd service
sudo tee /etc/systemd/system/node_exporter.service <<EOF
[Unit]
Description=Prometheus Node Exporter
After=network.target
[Service]
Type=simple
ExecStart=/usr/local/bin/node_exporter \\
--web.listen-address=:9100
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
# Start
sudo systemctl daemon-reload
sudo systemctl start node_exporter
sudo systemctl enable node_exporter
# Verify
curl http://localhost:9100/metrics | head -20
4.3. Patroni metrics endpoint
Already built-in! Patroni exposes metrics at:
# Check Patroni metrics
curl http://localhost:8008/metrics
# Sample output:
# patroni_postgres_running 1
# patroni_postmaster_start_time 1732531200
# patroni_timeline 3
# patroni_cluster_unlocked 0
# patroni_is_leader 1
5. Grafana Setup
5.1. Install Grafana
# Add repository
sudo apt-get install -y software-properties-common
sudo add-apt-repository "deb https://packages.grafana.com/oss/deb stable main"
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
# Install
sudo apt-get update
sudo apt-get install -y grafana
# Start
sudo systemctl start grafana-server
sudo systemctl enable grafana-server
# Access: http://your-server:3000
# Default credentials: admin / admin
5.2. Add Prometheus data source
1. Login to Grafana (http://localhost:3000)
2. Go to Configuration → Data Sources
3. Click "Add data source"
4. Select "Prometheus"
5. URL: http://localhost:9090
6. Click "Save & Test"
5.3. Import dashboards
PostgreSQL dashboard:
1. Go to Dashboards → Import
2. Enter dashboard ID: 9628 (PostgreSQL Database)
3. Select Prometheus data source
4. Click Import
Patroni dashboard (custom):
{
"dashboard": {
"title": "Patroni Cluster Overview",
"panels": [
{
"title": "Cluster Status",
"targets": [
{
"expr": "patroni_postgres_running"
}
]
},
{
"title": "Leader",
"targets": [
{
"expr": "patroni_is_leader"
}
]
},
{
"title": "Timeline",
"targets": [
{
"expr": "patroni_timeline"
}
]
},
{
"title": "Replication Lag",
"targets": [
{
"expr": "patroni_replication_lag_bytes"
}
]
}
]
}
}
etcd dashboard:
Dashboard ID: 3070 (etcd by Prometheus)
Node exporter dashboard:
Dashboard ID: 1860 (Node Exporter Full)
6. Alerting Rules
6.1. PostgreSQL alerts
# /etc/prometheus/alerts/postgresql.yml
groups:
- name: postgresql
interval: 30s
rules:
# PostgreSQL down
- alert: PostgreSQLDown
expr: pg_up == 0
for: 1m
labels:
severity: critical
annotations:
summary: "PostgreSQL instance {{ $labels.instance }} is down"
description: "PostgreSQL on {{ $labels.instance }} has been down for more than 1 minute"
# High replication lag
- alert: PostgreSQLReplicationLag
expr: pg_replication_lag_bytes > 104857600 # 100MB
for: 5m
labels:
severity: warning
annotations:
summary: "High replication lag on {{ $labels.instance }}"
description: "Replication lag is {{ $value }} bytes (>100MB)"
# Too many connections
- alert: PostgreSQLTooManyConnections
expr: |
sum(pg_stat_activity_count) by (instance) /
pg_settings_max_connections * 100 > 80
for: 5m
labels:
severity: warning
annotations:
summary: "Too many connections on {{ $labels.instance }}"
description: "Connection usage is {{ $value }}% (>80%)"
# Replication slot lag
- alert: PostgreSQLReplicationSlotLag
expr: pg_replication_slots_lag_bytes > 1073741824 # 1GB
for: 10m
labels:
severity: warning
annotations:
summary: "Replication slot {{ $labels.slot_name }} lag high"
description: "Slot lag is {{ $value }} bytes (>1GB)"
# Long-running queries
- alert: PostgreSQLLongRunningQueries
expr: pg_stat_activity_max_tx_duration > 3600 # 1 hour
for: 5m
labels:
severity: warning
annotations:
summary: "Long-running query on {{ $labels.instance }}"
description: "Query running for {{ $value }} seconds (>1h)"
# Dead tuples
- alert: PostgreSQLDeadTuples
expr: pg_stat_user_tables_n_dead_tup > 10000
for: 30m
labels:
severity: warning
annotations:
summary: "High dead tuples on table {{ $labels.table }}"
description: "Table has {{ $value }} dead tuples"
# Cache hit ratio low
- alert: PostgreSQLCacheHitRatio
expr: |
sum(pg_stat_database_blks_hit) /
nullif(sum(pg_stat_database_blks_hit + pg_stat_database_blks_read), 0) * 100 < 95
for: 15m
labels:
severity: warning
annotations:
summary: "Low cache hit ratio on {{ $labels.instance }}"
description: "Cache hit ratio is {{ $value }}% (<95%)"
6.2. Patroni alerts
# /etc/prometheus/alerts/patroni.yml
groups:
- name: patroni
interval: 30s
rules:
# Patroni down
- alert: PatroniDown
expr: up{job="patroni"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Patroni on {{ $labels.instance }} is down"
# No leader
- alert: PatroniNoLeader
expr: sum(patroni_is_leader) == 0
for: 30s
labels:
severity: critical
annotations:
summary: "Patroni cluster has no leader"
description: "No node is acting as leader in the cluster"
# Multiple leaders (split-brain)
- alert: PatroniMultipleLeaders
expr: sum(patroni_is_leader) > 1
for: 10s
labels:
severity: critical
annotations:
summary: "Multiple Patroni leaders detected (split-brain)"
description: "{{ $value }} nodes claim to be leader"
# Timeline mismatch
- alert: PatroniTimelineMismatch
expr: count(count by (timeline) (patroni_timeline)) > 1
for: 1m
labels:
severity: critical
annotations:
summary: "Patroni timeline mismatch"
description: "Nodes are on different timelines"
# Failover detected
- alert: PatroniFailover
expr: increase(patroni_timeline[5m]) > 0
labels:
severity: warning
annotations:
summary: "Patroni failover detected"
description: "Timeline changed, indicating a failover occurred"
# Cluster unlocked
- alert: PatroniClusterUnlocked
expr: patroni_cluster_unlocked == 1
for: 30s
labels:
severity: warning
annotations:
summary: "Patroni cluster is unlocked"
6.3. etcd alerts
# /etc/prometheus/alerts/etcd.yml
groups:
- name: etcd
interval: 30s
rules:
# etcd down
- alert: EtcdDown
expr: up{job="etcd"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "etcd node {{ $labels.instance }} is down"
# No leader
- alert: EtcdNoLeader
expr: etcd_server_has_leader == 0
for: 1m
labels:
severity: critical
annotations:
summary: "etcd cluster has no leader"
# High leader changes
- alert: EtcdFrequentLeaderChanges
expr: increase(etcd_server_leader_changes_seen_total[1h]) > 3
labels:
severity: warning
annotations:
summary: "etcd frequent leader changes"
description: "{{ $value }} leader changes in last hour"
# High commit latency
- alert: EtcdHighCommitLatency
expr: |
histogram_quantile(0.99,
rate(etcd_disk_backend_commit_duration_seconds_bucket[5m])
) > 0.25
for: 5m
labels:
severity: warning
annotations:
summary: "etcd high commit latency"
description: "99th percentile commit latency is {{ $value }}s"
7. Alertmanager Setup
7.1. Install Alertmanager
# Download
cd /tmp
wget https://github.com/prometheus/alertmanager/releases/download/v0.26.0/alertmanager-0.26.0.linux-amd64.tar.gz
tar -xzf alertmanager-0.26.0.linux-amd64.tar.gz
sudo mv alertmanager-0.26.0.linux-amd64 /opt/alertmanager
# Create user
sudo useradd --no-create-home --shell /bin/false alertmanager
# Create directories
sudo mkdir -p /etc/alertmanager /var/lib/alertmanager
sudo chown alertmanager:alertmanager /var/lib/alertmanager
7.2. Configure Alertmanager
# /etc/alertmanager/alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical'
- match:
severity: warning
receiver: 'warning'
receivers:
- name: 'default'
slack_configs:
- channel: '#alerts'
title: 'PostgreSQL HA Alert'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical'
slack_configs:
- channel: '#alerts-critical'
title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
email_configs:
- to: 'oncall@example.com'
from: 'alerts@example.com'
smarthost: 'smtp.example.com:587'
auth_username: 'alerts@example.com'
auth_password: 'password'
- name: 'warning'
slack_configs:
- channel: '#alerts'
title: '⚠️ Warning: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
inhibit_rules:
- source_match:
severity: 'critical'
target_match:
severity: 'warning'
equal: ['alertname', 'instance']
7.3. Start Alertmanager
# Systemd service
sudo tee /etc/systemd/system/alertmanager.service <<EOF
[Unit]
Description=Prometheus Alertmanager
After=network.target
[Service]
User=alertmanager
Group=alertmanager
Type=simple
ExecStart=/opt/alertmanager/alertmanager \\
--config.file=/etc/alertmanager/alertmanager.yml \\
--storage.path=/var/lib/alertmanager/ \\
--web.listen-address=:9093
Restart=on-failure
RestartSec=5
[Install]
WantedBy=multi-user.target
EOF
# Start
sudo systemctl daemon-reload
sudo systemctl start alertmanager
sudo systemctl enable alertmanager
# Access: http://localhost:9093
8. Best Practices
✅ DO
- Monitor proactively - Don't wait for users to report issues
- Set meaningful thresholds - Based on your workload
- Test alerts - Ensure notifications work
- Document runbooks - Link alerts to resolution steps
- Keep metrics retention - 30 days minimum, 1 year recommended
- Use labels wisely - For filtering and grouping
- Monitor the monitors - Alert if Prometheus/Grafana down
- Regular dashboard reviews - Update as needs change
- Track SLOs/SLIs - Define and measure service levels
- Correlate metrics - CPU + disk + query time together
❌ DON'T
- Don't over-alert - Alert fatigue is real
- Don't ignore warnings - They become criticals
- Don't forget to update - Dashboards and alerts evolve
- Don't expose metrics publicly - Security risk
- Don't rely on single monitoring - Have backups
- Don't collect everything - Focus on what matters
- Don't ignore baselines - Know your normal
- Don't skip testing - Test failover detection
9. Lab Exercises
Lab 1: Setup monitoring stack
Tasks: 1. Install Prometheus on monitoring server 2. Install postgres_exporter on all nodes 3. Install node_exporter on all nodes 4. Configure scrape targets 5. Verify metrics collection 6. Install Grafana 7. Add Prometheus data source 8. Import PostgreSQL dashboard
Lab 2: Create custom dashboard
Tasks: 1. Create new dashboard in Grafana 2. Add panel for replication lag 3. Add panel for connection count 4. Add panel for TPS 5. Add panel for cache hit ratio 6. Create variables for node selection 7. Save and share dashboard
Lab 3: Configure alerting
Tasks: 1. Install Alertmanager 2. Create alert rules for PostgreSQL 3. Create alert rules for Patroni 4. Configure Slack notifications 5. Test alerts by triggering conditions 6. Verify notification delivery
Lab 4: Simulate and monitor failover
Tasks: 1. Open Grafana dashboard 2. Stop primary node 3. Watch metrics during failover 4. Verify alerts triggered 5. Document timeline 6. Calculate downtime from metrics
10. Tổng kết
Key Metrics Summary
| Category | Metric | Threshold |
|---|---|---|
| Replication | Lag bytes | < 10MB |
| Replication | Lag time | < 10s |
| Connections | Usage % | < 80% |
| Cache | Hit ratio | > 95% |
| Queries | Long-running | < 1 hour |
| Disk | Usage % | < 85% |
| CPU | Usage % | < 80% sustained |
Monitoring Stack
Prometheus: Metrics collection and storage
├─ postgres_exporter: PostgreSQL metrics
├─ node_exporter: System metrics
├─ Patroni: HA cluster metrics (built-in)
└─ etcd: DCS metrics (built-in)
Grafana: Visualization and dashboards
Alertmanager: Alert routing and notifications
Next Steps
Bài 18 sẽ cover Performance Tuning:
- PostgreSQL configuration optimization
- Connection pooling with PgBouncer
- Load balancing with HAProxy
- Query optimization techniques
- Read replica scaling strategies