Bài 11: Patroni Callbacks
Tạo callback scripts (on_start, on_stop, on_role_change), viết custom scripts cho notifications và tích hợp với monitoring systems.
Bài 11: Patroni Callbacks
Mục tiêu
Sau bài học này, bạn sẽ:
- Hiểu Patroni callbacks là gì và khi nào chúng được trigger
- Implement custom scripts cho lifecycle events
- Configure callbacks cho automation tasks
- Handle role changes (primary ↔ replica)
- Setup notifications và monitoring hooks
- Troubleshoot callback failures
1. Callbacks Overview
1.1. Callbacks là gì?
Callbacks = Custom scripts được Patroni execute tại các lifecycle events của cluster.
Use cases:
- 🔔 Notifications: Alert team khi failover xảy ra
- 🔧 Automation: Update DNS, load balancer configs
- 📊 Monitoring: Push metrics to monitoring system
- 🚦 Traffic management: Redirect application traffic
- 🔐 Security: Rotate credentials, update firewall rules
- 📝 Logging: Custom audit logs
1.2. Available callbacks
Patroni cung cấp các callback events:
| Callback | Trigger | Use Case |
|---|---|---|
on_start | Before PostgreSQL starts | Pre-start checks, mount volumes |
on_stop | Before PostgreSQL stops | Cleanup, notify applications |
on_restart | Before PostgreSQL restarts | Log restart event |
on_reload | After PostgreSQL config reload | Verify config changes |
on_role_change | Role changes (primary ↔ replica) | Most important - update DNS, LB |
pre_promote | Before replica promoted to primary | Final checks before promotion |
post_promote | After replica promoted to primary | Update monitoring, send alerts |
1.3. Callback execution flow
Example: Failover scenario
Old Primary crashes
↓
Patroni detects failure (after TTL expires)
↓
Patroni selects best replica (node2)
↓
pre_promote callback runs on node2
↓
PostgreSQL promoted to primary (pg_promote)
↓
post_promote callback runs on node2
↓
on_role_change callback runs on node2 (role=master)
↓
Other replicas detect new leader
↓
on_role_change callback runs on replicas (role=replica)
↓
Failover complete
1.4. Callback environment variables
Patroni passes environment variables to scripts:
| Variable | Description | Example |
|---|---|---|
PATRONI_ROLE | Current role after change | master, replica |
PATRONI_SCOPE | Cluster name | postgres |
PATRONI_NAME | Node name | node1 |
PATRONI_CLUSTER_NAME | Cluster name (alias) | postgres |
PATRONI_VERSION | Patroni version | 3.2.0 |
For on_role_change:
| Variable | Value |
|---|---|
PATRONI_NEW_ROLE | New role: master or replica |
PATRONI_OLD_ROLE | Previous role |
2. Configure Callbacks in Patroni
2.1. Basic configuration
In patroni.yml:
scope: postgres
name: node1
postgresql:
callbacks:
on_start: /var/lib/postgresql/callbacks/on_start.sh
on_stop: /var/lib/postgresql/callbacks/on_stop.sh
on_restart: /var/lib/postgresql/callbacks/on_restart.sh
on_reload: /var/lib/postgresql/callbacks/on_reload.sh
on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh
Key points:
- Paths must be absolute
- Scripts must be executable (
chmod +x) - Owned by postgres user
- Should complete quickly (<30 seconds)
- Non-zero exit code = callback failed (logged but doesn't block operation)
2.2. Create callback directory
# On all 3 nodes
sudo mkdir -p /var/lib/postgresql/callbacks
sudo chown postgres:postgres /var/lib/postgresql/callbacks
sudo chmod 750 /var/lib/postgresql/callbacks
3. Implement Callback Scripts
3.1. on_start callback
Use case: Pre-start validation, mount checks.
Script: /var/lib/postgresql/callbacks/on_start.sh
#!/bin/bash
# on_start.sh - Runs before PostgreSQL starts
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
# Logging function
log() {
echo "[$TIMESTAMP] [ON_START] $1" | tee -a "$LOG_FILE"
}
log "Starting PostgreSQL on $PATRONI_NAME"
log "Role: $PATRONI_ROLE"
log "Cluster: $PATRONI_SCOPE"
# Check disk space
DISK_USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
log "ERROR: Disk usage is ${DISK_USAGE}% - critically high!"
exit 1
fi
log "Disk usage: ${DISK_USAGE}%"
# Check if data directory is mounted
if ! mountpoint -q /var/lib/postgresql/18/data; then
log "WARNING: Data directory is not a mount point"
fi
# Check network connectivity to etcd
for ETCD_HOST in 10.0.1.11 10.0.1.12 10.0.1.13; do
if ! nc -zw3 "$ETCD_HOST" 2379 2>/dev/null; then
log "ERROR: Cannot reach etcd at $ETCD_HOST:2379"
exit 1
fi
done
log "etcd connectivity verified"
log "Pre-start checks passed"
exit 0
Create script:
# On all nodes
sudo tee /var/lib/postgresql/callbacks/on_start.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_START] $1" | tee -a "$LOG_FILE"; }
log "Starting PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"
# Disk space check
DISK_USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
log "ERROR: Disk usage ${DISK_USAGE}% too high"
exit 1
fi
log "Disk usage: ${DISK_USAGE}%"
log "Pre-start checks passed"
exit 0
EOF
sudo chmod +x /var/lib/postgresql/callbacks/on_start.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_start.sh
3.2. on_stop callback
Use case: Graceful shutdown notifications.
Script: /var/lib/postgresql/callbacks/on_stop.sh
#!/bin/bash
# on_stop.sh - Runs before PostgreSQL stops
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() {
echo "[$TIMESTAMP] [ON_STOP] $1" | tee -a "$LOG_FILE"
}
log "Stopping PostgreSQL on $PATRONI_NAME"
log "Role: $PATRONI_ROLE"
# Notify monitoring system
if command -v curl >/dev/null 2>&1; then
curl -s -X POST http://monitoring.example.com/api/events \
-H "Content-Type: application/json" \
-d "{
\"event\": \"postgresql_stop\",
\"node\": \"$PATRONI_NAME\",
\"role\": \"$PATRONI_ROLE\",
\"timestamp\": \"$TIMESTAMP\"
}" || log "WARNING: Failed to notify monitoring"
fi
log "PostgreSQL stop initiated"
exit 0
Create script:
sudo tee /var/lib/postgresql/callbacks/on_stop.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_STOP] $1" | tee -a "$LOG_FILE"; }
log "Stopping PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"
exit 0
EOF
sudo chmod +x /var/lib/postgresql/callbacks/on_stop.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_stop.sh
3.3. on_role_change callback (Most Important!)
Use case: Update DNS, load balancers, send notifications.
Script: /var/lib/postgresql/callbacks/on_role_change.sh
#!/bin/bash
# on_role_change.sh - Runs when role changes (master ↔ replica)
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() {
echo "[$TIMESTAMP] [ROLE_CHANGE] $1" | tee -a "$LOG_FILE"
}
log "=========================================="
log "Role change detected on $PATRONI_NAME"
log "Cluster: $PATRONI_SCOPE"
log "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "New role: $PATRONI_ROLE"
log "=========================================="
# Function: Update DNS
update_dns() {
local NEW_PRIMARY_IP="$1"
log "Updating DNS record for primary.postgres.local -> $NEW_PRIMARY_IP"
# Example using nsupdate (BIND DNS)
# nsupdate -k /etc/dns/Kpostgres.+157+12345.key << EOF
# server dns-server.local
# zone postgres.local
# update delete primary.postgres.local A
# update add primary.postgres.local 60 A $NEW_PRIMARY_IP
# send
# EOF
# Or using API (e.g., Route53, Cloudflare)
# aws route53 change-resource-record-sets --hosted-zone-id Z1234 ...
log "DNS update completed"
}
# Function: Update HAProxy
update_haproxy() {
local NEW_PRIMARY_IP="$1"
log "Notifying HAProxy about new primary: $NEW_PRIMARY_IP"
# Use HAProxy stats socket
# echo "set server postgres/primary addr $NEW_PRIMARY_IP" | \
# socat stdio /var/run/haproxy.sock
log "HAProxy updated"
}
# Function: Send Slack notification
send_notification() {
local MESSAGE="$1"
local WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
log "Sending notification: $MESSAGE"
curl -s -X POST "$WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{
\"text\": \"🔄 PostgreSQL Role Change\",
\"attachments\": [{
\"color\": \"warning\",
\"fields\": [
{\"title\": \"Node\", \"value\": \"$PATRONI_NAME\", \"short\": true},
{\"title\": \"Cluster\", \"value\": \"$PATRONI_SCOPE\", \"short\": true},
{\"title\": \"Old Role\", \"value\": \"${PATRONI_OLD_ROLE:-N/A}\", \"short\": true},
{\"title\": \"New Role\", \"value\": \"$PATRONI_ROLE\", \"short\": true},
{\"title\": \"Time\", \"value\": \"$TIMESTAMP\", \"short\": false}
]
}]
}" || log "WARNING: Notification failed"
}
# Main logic
case "$PATRONI_ROLE" in
master)
log "This node is now PRIMARY"
# Get this node's IP
NODE_IP=$(hostname -I | awk '{print $1}')
log "Node IP: $NODE_IP"
# Update DNS to point to new primary
update_dns "$NODE_IP"
# Update load balancer
update_haproxy "$NODE_IP"
# Send notification
send_notification "Node $PATRONI_NAME promoted to PRIMARY"
# Set marker file for applications
touch /tmp/postgres_is_primary
rm -f /tmp/postgres_is_replica
log "Primary promotion tasks completed"
;;
replica)
log "This node is now REPLICA"
# Remove primary marker
rm -f /tmp/postgres_is_primary
touch /tmp/postgres_is_replica
# Send notification if demoted from primary
if [ "${PATRONI_OLD_ROLE}" = "master" ]; then
send_notification "Node $PATRONI_NAME demoted to REPLICA"
fi
log "Replica role tasks completed"
;;
*)
log "Unknown role: $PATRONI_ROLE"
exit 1
;;
esac
log "Role change handling completed successfully"
exit 0
Create production-ready script:
sudo tee /var/lib/postgresql/callbacks/on_role_change.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ROLE_CHANGE] $1" | tee -a "$LOG_FILE"; }
log "=========================================="
log "Role change: $PATRONI_NAME"
log "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "New role: $PATRONI_ROLE"
log "=========================================="
case "$PATRONI_ROLE" in
master)
log "This node is now PRIMARY"
NODE_IP=$(hostname -I | awk '{print $1}')
log "Node IP: $NODE_IP"
# TODO: Update DNS, load balancer, etc.
# update_dns "$NODE_IP"
touch /tmp/postgres_is_primary
rm -f /tmp/postgres_is_replica
;;
replica)
log "This node is now REPLICA"
rm -f /tmp/postgres_is_primary
touch /tmp/postgres_is_replica
;;
*)
log "Unknown role: $PATRONI_ROLE"
exit 1
;;
esac
log "Role change completed"
exit 0
EOF
sudo chmod +x /var/lib/postgresql/callbacks/on_role_change.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_role_change.sh
3.4. on_restart callback
Use case: Log restarts, notify about planned maintenance.
sudo tee /var/lib/postgresql/callbacks/on_restart.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_RESTART] $1" | tee -a "$LOG_FILE"; }
log "Restarting PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"
exit 0
EOF
sudo chmod +x /var/lib/postgresql/callbacks/on_restart.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_restart.sh
3.5. on_reload callback
Use case: Verify configuration changes were applied.
sudo tee /var/lib/postgresql/callbacks/on_reload.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_RELOAD] $1" | tee -a "$LOG_FILE"; }
log "Configuration reloaded on $PATRONI_NAME"
# Verify critical settings
MAX_CONN=$(sudo -u postgres psql -t -c "SHOW max_connections;")
log "max_connections = $MAX_CONN"
exit 0
EOF
sudo chmod +x /var/lib/postgresql/callbacks/on_reload.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_reload.sh
3.6. Create log directory
# On all nodes
sudo mkdir -p /var/log/patroni
sudo chown postgres:postgres /var/log/patroni
sudo chmod 750 /var/log/patroni
4. Update Patroni Configuration
4.1. Add callbacks to patroni.yml
On all 3 nodes, edit /etc/patroni/patroni.yml:
scope: postgres
namespace: /service/
name: node1 # node2, node3 for other nodes
restapi:
listen: 0.0.0.0:8008
connect_address: 10.0.1.11:8008 # Change per node
etcd3:
hosts: 10.0.1.11:2379,10.0.1.12:2379,10.0.1.13:2379
bootstrap:
dcs:
ttl: 30
loop_wait: 10
retry_timeout: 10
maximum_lag_on_failover: 1048576
synchronous_mode: true
synchronous_mode_strict: false
postgresql:
parameters:
max_connections: 100
shared_buffers: 256MB
wal_level: replica
max_wal_senders: 10
max_replication_slots: 10
postgresql:
listen: 0.0.0.0:5432
connect_address: 10.0.1.11:5432 # Change per node
data_dir: /var/lib/postgresql/18/data
bin_dir: /usr/lib/postgresql/18/bin
authentication:
replication:
username: replicator
password: replicator_password
superuser:
username: postgres
password: postgres_password
parameters:
unix_socket_directories: '/var/run/postgresql'
# ✅ Add callbacks section
callbacks:
on_start: /var/lib/postgresql/callbacks/on_start.sh
on_stop: /var/lib/postgresql/callbacks/on_stop.sh
on_restart: /var/lib/postgresql/callbacks/on_restart.sh
on_reload: /var/lib/postgresql/callbacks/on_reload.sh
on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh
tags:
nofailover: false
noloadbalance: false
clonefrom: false
nosync: false
4.2. Reload Patroni configuration
# On all 3 nodes
sudo systemctl reload patroni
# Verify callbacks configured
patronictl show-config postgres
5. Test Callbacks
5.1. Test on_restart
# Restart a node
patronictl restart postgres node2
# Check logs
sudo tail -f /var/log/patroni/callbacks.log
# Expected output:
# [2024-11-25 10:30:15] [ON_RESTART] Restarting PostgreSQL on node2
5.2. Test on_reload
# Reload configuration
patronictl reload postgres node2
# Check logs
sudo tail /var/log/patroni/callbacks.log
# Expected:
# [2024-11-25 10:32:45] [ON_RELOAD] Configuration reloaded on node2
5.3. Test on_role_change (Failover)
⚠️ IMPORTANT: Test in non-production!
# 1. Check current primary
patronictl list postgres
# node1 is Leader
# 2. Stop primary
sudo systemctl stop patroni # On node1
# 3. Watch logs on node2 (will become new primary)
sudo tail -f /var/log/patroni/callbacks.log
# Expected output:
# [2024-11-25 10:35:10] [ROLE_CHANGE] ==========================================
# [2024-11-25 10:35:10] [ROLE_CHANGE] Role change: node2
# [2024-11-25 10:35:10] [ROLE_CHANGE] Old role: replica
# [2024-11-25 10:35:10] [ROLE_CHANGE] New role: master
# [2024-11-25 10:35:10] [ROLE_CHANGE] This node is now PRIMARY
# [2024-11-25 10:35:10] [ROLE_CHANGE] Node IP: 10.0.1.12
# [2024-11-25 10:35:10] [ROLE_CHANGE] Role change completed
# 4. Verify marker file
ls -la /tmp/postgres_is_*
# -rw-r--r-- 1 postgres postgres 0 Nov 25 10:35 /tmp/postgres_is_primary
# 5. Restart node1 (will rejoin as replica)
sudo systemctl start patroni # On node1
# 6. Check node1 logs
sudo tail /var/log/patroni/callbacks.log
# [2024-11-25 10:36:30] [ROLE_CHANGE] Old role: master
# [2024-11-25 10:36:30] [ROLE_CHANGE] New role: replica
# [2024-11-25 10:36:30] [ROLE_CHANGE] This node is now REPLICA
6. Advanced Callback Examples
6.1. DNS update using nsupdate
Prerequisites: BIND DNS server với DDNS enabled.
#!/bin/bash
# Update DNS via nsupdate
update_dns() {
local NEW_PRIMARY_IP="$1"
local DNS_KEY="/etc/dns/Kpostgres.+157+12345.key"
local DNS_SERVER="dns.example.com"
local ZONE="postgres.local"
local RECORD="primary.postgres.local"
log "Updating DNS: $RECORD -> $NEW_PRIMARY_IP"
nsupdate -k "$DNS_KEY" << EOF
server $DNS_SERVER
zone $ZONE
update delete $RECORD A
update add $RECORD 60 A $NEW_PRIMARY_IP
send
EOF
if [ $? -eq 0 ]; then
log "DNS updated successfully"
else
log "ERROR: DNS update failed"
return 1
fi
}
# In on_role_change.sh
if [ "$PATRONI_ROLE" = "master" ]; then
NODE_IP=$(hostname -I | awk '{print $1}')
update_dns "$NODE_IP"
fi
6.2. HAProxy backend update
Via stats socket:
update_haproxy() {
local NEW_PRIMARY_IP="$1"
local HAPROXY_SOCKET="/var/run/haproxy.sock"
log "Updating HAProxy: primary backend -> $NEW_PRIMARY_IP"
echo "set server postgres-primary/node addr $NEW_PRIMARY_IP port 5432" | \
socat stdio "$HAPROXY_SOCKET"
echo "set server postgres-primary/node state ready" | \
socat stdio "$HAPROXY_SOCKET"
log "HAProxy backend updated"
}
6.3. Consul service registration
register_in_consul() {
local ROLE="$1"
local NODE_IP="$2"
log "Registering in Consul: $PATRONI_NAME as $ROLE"
curl -s -X PUT "http://consul.local:8500/v1/agent/service/register" \
-H "Content-Type: application/json" \
-d "{
\"Name\": \"postgres-$ROLE\",
\"ID\": \"postgres-$PATRONI_NAME\",
\"Address\": \"$NODE_IP\",
\"Port\": 5432,
\"Tags\": [\"$ROLE\", \"patroni\"],
\"Check\": {
\"TCP\": \"$NODE_IP:5432\",
\"Interval\": \"10s\",
\"Timeout\": \"2s\"
}
}"
log "Consul registration completed"
}
# Usage
NODE_IP=$(hostname -I | awk '{print $1}')
register_in_consul "$PATRONI_ROLE" "$NODE_IP"
6.4. Email notification
send_email_alert() {
local SUBJECT="$1"
local BODY="$2"
local RECIPIENT="ops-team@example.com"
log "Sending email alert: $SUBJECT"
echo "$BODY" | mail -s "$SUBJECT" "$RECIPIENT"
log "Email sent to $RECIPIENT"
}
# In on_role_change.sh
if [ "$PATRONI_ROLE" = "master" ]; then
send_email_alert \
"[ALERT] PostgreSQL Failover: $PATRONI_NAME promoted to PRIMARY" \
"Cluster: $PATRONI_SCOPE
Node: $PATRONI_NAME
Old Role: ${PATRONI_OLD_ROLE}
New Role: $PATRONI_ROLE
Time: $TIMESTAMP
Action required: Verify cluster health"
fi
6.5. Slack/Teams webhook
Detailed Slack notification:
send_slack_alert() {
local WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
local COLOR="$1" # good, warning, danger
local TITLE="$2"
local MESSAGE="$3"
curl -s -X POST "$WEBHOOK_URL" \
-H "Content-Type: application/json" \
-d "{
\"username\": \"Patroni Monitor\",
\"icon_emoji\": \": database:\",
\"attachments\": [{
\"color\": \"$COLOR\",
\"title\": \"$TITLE\",
\"text\": \"$MESSAGE\",
\"fields\": [
{\"title\": \"Cluster\", \"value\": \"$PATRONI_SCOPE\", \"short\": true},
{\"title\": \"Node\", \"value\": \"$PATRONI_NAME\", \"short\": true},
{\"title\": \"Old Role\", \"value\": \"${PATRONI_OLD_ROLE:-N/A}\", \"short\": true},
{\"title\": \"New Role\", \"value\": \"$PATRONI_ROLE\", \"short\": true},
{\"title\": \"Timestamp\", \"value\": \"$TIMESTAMP\", \"short\": false}
],
\"footer\": \"PostgreSQL HA\",
\"footer_icon\": \"https://www.postgresql.org/media/img/about/press/elephant.png\"
}]
}"
}
# Usage
if [ "$PATRONI_ROLE" = "master" ]; then
send_slack_alert "warning" \
"🚨 Failover Event" \
"Node $PATRONI_NAME has been promoted to PRIMARY"
fi
6.6. Metrics push to monitoring
Push to Prometheus Pushgateway:
push_metrics() {
local PUSHGATEWAY="http://pushgateway.local:9091"
local JOB="patroni_callbacks"
log "Pushing metrics to Prometheus"
cat << EOF | curl -s --data-binary @- "$PUSHGATEWAY/metrics/job/$JOB/instance/$PATRONI_NAME"
# TYPE patroni_role_change counter
# HELP patroni_role_change Number of role changes
patroni_role_change{cluster="$PATRONI_SCOPE",node="$PATRONI_NAME",new_role="$PATRONI_ROLE"} 1
# TYPE patroni_role_change_timestamp gauge
# HELP patroni_role_change_timestamp Timestamp of last role change
patroni_role_change_timestamp{cluster="$PATRONI_SCOPE",node="$PATRONI_NAME"} $(date +%s)
EOF
log "Metrics pushed"
}
7. Callback Best Practices
✅ DO
- Keep callbacks fast
- Complete within 10-30 seconds
- Long tasks → background jobs
- Use proper logging
- Log all actions
- Include timestamps
- Rotate logs
- Handle errors gracefully
- Use
set -ecarefully - Catch errors, log, continue
- Non-zero exit = warning, not failure
- Use
- Test thoroughly
- Test in staging
- Simulate all scenarios
- Verify idempotency
- Make scripts idempotent
- Can run multiple times safely
- Check before modify
- Use absolute paths
- Don't rely on PATH
- Specify full paths
- Secure credentials
- Don't hardcode passwords
- Use environment variables or secrets manager
- Monitor callback execution
- Alert on failures
- Track execution time
❌ DON'T
- Don't block for long time
- Patroni waits for callbacks
- Long delays → slower failover
- Don't rely on network during failover
- Network may be partitioned
- Have fallback mechanisms
- Don't fail the callback unnecessarily
- Exit 0 even if notification fails
- Log errors but continue
- Don't run database queries in callbacks
- PostgreSQL may not be ready
- Can cause deadlocks
- Don't modify PostgreSQL configuration
- Let Patroni manage config
- Use Patroni's parameters
- Don't use interactive commands
- No user input
- Must run unattended
8. Troubleshoot Callback Issues
8.1. Callback not executing
Check:
# 1. Verify script exists
ls -la /var/lib/postgresql/callbacks/on_role_change.sh
# 2. Check executable permissions
# Should be: -rwxr-xr-x postgres postgres
sudo chmod +x /var/lib/postgresql/callbacks/on_role_change.sh
# 3. Check ownership
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_role_change.sh
# 4. Verify path in patroni.yml
grep -A5 "callbacks:" /etc/patroni/patroni.yml
# 5. Check Patroni logs
sudo journalctl -u patroni -n 100 | grep -i callback
8.2. Callback failing
Check logs:
# Patroni logs
sudo journalctl -u patroni | grep "callback.*failed"
# Callback logs
sudo tail -f /var/log/patroni/callbacks.log
# Test script manually
sudo -u postgres /var/lib/postgresql/callbacks/on_role_change.sh
Common issues:
- Syntax error: Run
bash -n script.shto check - Missing dependency: Install required tools (curl, nc, etc.)
- Permission denied: Check file/directory permissions
- Timeout: Script taking too long
8.3. Callback causing slow failover
Measure callback execution time:
# Add timing to script
START_TIME=$(date +%s)
# ... your callback logic ...
END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
log "Callback completed in ${DURATION} seconds"
# If DURATION > 30, investigate and optimize
9. Production Callback Template
Complete production-ready template:
#!/bin/bash
# Patroni callback template
# File: /var/lib/postgresql/callbacks/on_role_change.sh
set -euo pipefail # Exit on error, undefined vars, pipe failures
# Configuration
readonly LOG_FILE="/var/log/patroni/callbacks.log"
readonly LOCK_FILE="/tmp/callback_role_change.lock"
readonly TIMEOUT=30
readonly SLACK_WEBHOOK="${SLACK_WEBHOOK_URL:-}"
# Logging function
log() {
local LEVEL="$1"
shift
local MESSAGE="$*"
local TIMESTAMP
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
echo "[$TIMESTAMP] [$LEVEL] [ROLE_CHANGE] $MESSAGE" | tee -a "$LOG_FILE"
}
# Error handler
error_exit() {
log "ERROR" "$1"
cleanup
exit 1
}
# Cleanup function
cleanup() {
rm -f "$LOCK_FILE"
}
# Ensure only one instance runs
if ! mkdir "$LOCK_FILE" 2>/dev/null; then
log "WARN" "Another callback instance is running, exiting"
exit 0
fi
trap cleanup EXIT
# Set timeout
timeout "$TIMEOUT" bash << 'SCRIPT' || error_exit "Callback timed out after ${TIMEOUT}s"
log "INFO" "=========================================="
log "INFO" "Role change detected"
log "INFO" "Cluster: ${PATRONI_SCOPE:-unknown}"
log "INFO" "Node: ${PATRONI_NAME:-unknown}"
log "INFO" "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "INFO" "New role: ${PATRONI_ROLE:-unknown}"
log "INFO" "=========================================="
# Main logic
case "${PATRONI_ROLE:-}" in
master)
log "INFO" "Handling promotion to PRIMARY"
# Get node IP
NODE_IP=$(hostname -I | awk '{print $1}')
log "INFO" "Node IP: $NODE_IP"
# Update DNS (implement your logic)
# update_dns "$NODE_IP" || log "WARN" "DNS update failed"
# Update load balancer (implement your logic)
# update_load_balancer "$NODE_IP" || log "WARN" "LB update failed"
# Send notification
if [ -n "$SLACK_WEBHOOK" ]; then
curl -s -X POST "$SLACK_WEBHOOK" \
-H "Content-Type: application/json" \
-d "{\"text\": \"🚨 Failover: $PATRONI_NAME promoted to PRIMARY\"}" \
|| log "WARN" "Slack notification failed"
fi
# Set marker files
touch /tmp/postgres_is_primary
rm -f /tmp/postgres_is_replica
log "INFO" "PRIMARY promotion tasks completed"
;;
replica)
log "INFO" "Handling demotion to REPLICA"
# Remove primary marker
rm -f /tmp/postgres_is_primary
touch /tmp/postgres_is_replica
# Notify if demoted from primary
if [ "${PATRONI_OLD_ROLE:-}" = "master" ]; then
log "WARN" "Node demoted from PRIMARY to REPLICA"
# Send alert
fi
log "INFO" "REPLICA tasks completed"
;;
*)
error_exit "Unknown role: ${PATRONI_ROLE:-unknown}"
;;
esac
log "INFO" "Callback completed successfully"
exit 0
SCRIPT
10. Lab Exercises
Lab 1: Setup basic callbacks
Tasks:
- Create callback directory and scripts
- Add callbacks to patroni.yml
- Reload Patroni
- Test with
patronictl restart
Lab 2: Test failover callbacks
Tasks:
- Monitor callback logs:
tail -f /var/log/patroni/callbacks.log - Stop primary:
sudo systemctl stop patroni - Verify on_role_change executed on new primary
- Check marker files:
/tmp/postgres_is_* - Restart old primary, verify it rejoins as replica
Lab 3: Implement Slack notifications
Tasks:
- Get Slack webhook URL
- Add notification to on_role_change.sh
- Test by triggering failover
- Verify message received in Slack
Lab 4: Measure callback performance
Tasks:
- Add timing to all callbacks
- Trigger various events (restart, reload, failover)
- Analyze callback execution times
- Optimize slow callbacks
11. Tổng kết
Key Takeaways
✅ Callbacks = Custom automation at lifecycle events
✅ on_role_change = Most critical callback for failover automation
✅ Keep callbacks fast (<30s) for quick failover
✅ Log everything for debugging
✅ Test thoroughly before production
✅ Handle errors gracefully - don't block operations
Common Use Cases
| Callback | Common Actions |
|---|---|
on_start | Pre-flight checks, mount verification |
on_stop | Cleanup, notifications |
on_role_change | Update DNS, LB, send alerts |
on_restart | Log maintenance events |
on_reload | Verify config changes |
Architecture hiện tại
✅ 3 VMs prepared (Bài 4)
✅ PostgreSQL 18 installed (Bài 5)
✅ etcd cluster running (Bài 6)
✅ Patroni installed (Bài 7)
✅ Patroni configured (Bài 8)
✅ Cluster bootstrapped (Bài 9)
✅ Replication configured (Bài 10)
✅ Callbacks implemented (Bài 11)
Next: REST API usage
Chuẩn bị cho Bài 12
Bài 12 sẽ cover Patroni REST API:
- Health check endpoints
- Cluster status queries
- Configuration management via API
- Integration với load balancers
- Monitoring và metrics