Bài 11: Patroni Callbacks

Tạo callback scripts (on_start, on_stop, on_role_change), viết custom scripts cho notifications và tích hợp với monitoring systems.

13 min read
XDEV ASIA

Bài 11: Patroni Callbacks

Mục tiêu

Sau bài học này, bạn sẽ:

  • Hiểu Patroni callbacks là gì và khi nào chúng được trigger
  • Implement custom scripts cho lifecycle events
  • Configure callbacks cho automation tasks
  • Handle role changes (primary ↔ replica)
  • Setup notifications và monitoring hooks
  • Troubleshoot callback failures

1. Callbacks Overview

1.1. Callbacks là gì?

Callbacks = Custom scripts được Patroni execute tại các lifecycle events của cluster.

Use cases:

  • 🔔 Notifications: Alert team khi failover xảy ra
  • 🔧 Automation: Update DNS, load balancer configs
  • 📊 Monitoring: Push metrics to monitoring system
  • 🚦 Traffic management: Redirect application traffic
  • 🔐 Security: Rotate credentials, update firewall rules
  • 📝 Logging: Custom audit logs

1.2. Available callbacks

Patroni cung cấp các callback events:

CallbackTriggerUse Case
on_startBefore PostgreSQL startsPre-start checks, mount volumes
on_stopBefore PostgreSQL stopsCleanup, notify applications
on_restartBefore PostgreSQL restartsLog restart event
on_reloadAfter PostgreSQL config reloadVerify config changes
on_role_changeRole changes (primary ↔ replica)Most important - update DNS, LB
pre_promoteBefore replica promoted to primaryFinal checks before promotion
post_promoteAfter replica promoted to primaryUpdate monitoring, send alerts

1.3. Callback execution flow

Example: Failover scenario

Old Primary crashes
       ↓
Patroni detects failure (after TTL expires)
       ↓
Patroni selects best replica (node2)
       ↓
pre_promote callback runs on node2
       ↓
PostgreSQL promoted to primary (pg_promote)
       ↓
post_promote callback runs on node2
       ↓
on_role_change callback runs on node2 (role=master)
       ↓
Other replicas detect new leader
       ↓
on_role_change callback runs on replicas (role=replica)
       ↓
Failover complete

1.4. Callback environment variables

Patroni passes environment variables to scripts:

VariableDescriptionExample
PATRONI_ROLECurrent role after changemaster, replica
PATRONI_SCOPECluster namepostgres
PATRONI_NAMENode namenode1
PATRONI_CLUSTER_NAMECluster name (alias)postgres
PATRONI_VERSIONPatroni version3.2.0

For on_role_change:

VariableValue
PATRONI_NEW_ROLENew role: master or replica
PATRONI_OLD_ROLEPrevious role

2. Configure Callbacks in Patroni

2.1. Basic configuration

In patroni.yml:

scope: postgres
name: node1

postgresql:
  callbacks:
    on_start: /var/lib/postgresql/callbacks/on_start.sh
    on_stop: /var/lib/postgresql/callbacks/on_stop.sh
    on_restart: /var/lib/postgresql/callbacks/on_restart.sh
    on_reload: /var/lib/postgresql/callbacks/on_reload.sh
    on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh

Key points:

  • Paths must be absolute
  • Scripts must be executable (chmod +x)
  • Owned by postgres user
  • Should complete quickly (<30 seconds)
  • Non-zero exit code = callback failed (logged but doesn't block operation)

2.2. Create callback directory

# On all 3 nodes
sudo mkdir -p /var/lib/postgresql/callbacks
sudo chown postgres:postgres /var/lib/postgresql/callbacks
sudo chmod 750 /var/lib/postgresql/callbacks

3. Implement Callback Scripts

3.1. on_start callback

Use case: Pre-start validation, mount checks.

Script/var/lib/postgresql/callbacks/on_start.sh

#!/bin/bash
# on_start.sh - Runs before PostgreSQL starts

set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

# Logging function
log() {
    echo "[$TIMESTAMP] [ON_START] $1" | tee -a "$LOG_FILE"
}

log "Starting PostgreSQL on $PATRONI_NAME"
log "Role: $PATRONI_ROLE"
log "Cluster: $PATRONI_SCOPE"

# Check disk space
DISK_USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
    log "ERROR: Disk usage is ${DISK_USAGE}% - critically high!"
    exit 1
fi
log "Disk usage: ${DISK_USAGE}%"

# Check if data directory is mounted
if ! mountpoint -q /var/lib/postgresql/18/data; then
    log "WARNING: Data directory is not a mount point"
fi

# Check network connectivity to etcd
for ETCD_HOST in 10.0.1.11 10.0.1.12 10.0.1.13; do
    if ! nc -zw3 "$ETCD_HOST" 2379 2>/dev/null; then
        log "ERROR: Cannot reach etcd at $ETCD_HOST:2379"
        exit 1
    fi
done
log "etcd connectivity verified"

log "Pre-start checks passed"
exit 0

Create script:

# On all nodes
sudo tee /var/lib/postgresql/callbacks/on_start.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_START] $1" | tee -a "$LOG_FILE"; }

log "Starting PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"

# Disk space check
DISK_USAGE=$(df -h /var/lib/postgresql | awk 'NR==2 {print $5}' | sed 's/%//')
if [ "$DISK_USAGE" -gt 90 ]; then
    log "ERROR: Disk usage ${DISK_USAGE}% too high"
    exit 1
fi
log "Disk usage: ${DISK_USAGE}%"

log "Pre-start checks passed"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_start.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_start.sh

3.2. on_stop callback

Use case: Graceful shutdown notifications.

Script/var/lib/postgresql/callbacks/on_stop.sh

#!/bin/bash
# on_stop.sh - Runs before PostgreSQL stops

set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

log() {
    echo "[$TIMESTAMP] [ON_STOP] $1" | tee -a "$LOG_FILE"
}

log "Stopping PostgreSQL on $PATRONI_NAME"
log "Role: $PATRONI_ROLE"

# Notify monitoring system
if command -v curl >/dev/null 2>&1; then
    curl -s -X POST http://monitoring.example.com/api/events \
        -H "Content-Type: application/json" \
        -d "{
            \"event\": \"postgresql_stop\",
            \"node\": \"$PATRONI_NAME\",
            \"role\": \"$PATRONI_ROLE\",
            \"timestamp\": \"$TIMESTAMP\"
        }" || log "WARNING: Failed to notify monitoring"
fi

log "PostgreSQL stop initiated"
exit 0

Create script:

sudo tee /var/lib/postgresql/callbacks/on_stop.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_STOP] $1" | tee -a "$LOG_FILE"; }

log "Stopping PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_stop.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_stop.sh

3.3. on_role_change callback (Most Important!)

Use case: Update DNS, load balancers, send notifications.

Script/var/lib/postgresql/callbacks/on_role_change.sh

#!/bin/bash
# on_role_change.sh - Runs when role changes (master ↔ replica)

set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

log() {
    echo "[$TIMESTAMP] [ROLE_CHANGE] $1" | tee -a "$LOG_FILE"
}

log "=========================================="
log "Role change detected on $PATRONI_NAME"
log "Cluster: $PATRONI_SCOPE"
log "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "New role: $PATRONI_ROLE"
log "=========================================="

# Function: Update DNS
update_dns() {
    local NEW_PRIMARY_IP="$1"
    
    log "Updating DNS record for primary.postgres.local -> $NEW_PRIMARY_IP"
    
    # Example using nsupdate (BIND DNS)
    # nsupdate -k /etc/dns/Kpostgres.+157+12345.key << EOF
    # server dns-server.local
    # zone postgres.local
    # update delete primary.postgres.local A
    # update add primary.postgres.local 60 A $NEW_PRIMARY_IP
    # send
    # EOF
    
    # Or using API (e.g., Route53, Cloudflare)
    # aws route53 change-resource-record-sets --hosted-zone-id Z1234 ...
    
    log "DNS update completed"
}

# Function: Update HAProxy
update_haproxy() {
    local NEW_PRIMARY_IP="$1"
    
    log "Notifying HAProxy about new primary: $NEW_PRIMARY_IP"
    
    # Use HAProxy stats socket
    # echo "set server postgres/primary addr $NEW_PRIMARY_IP" | \
    #     socat stdio /var/run/haproxy.sock
    
    log "HAProxy updated"
}

# Function: Send Slack notification
send_notification() {
    local MESSAGE="$1"
    local WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    
    log "Sending notification: $MESSAGE"
    
    curl -s -X POST "$WEBHOOK_URL" \
        -H "Content-Type: application/json" \
        -d "{
            \"text\": \"🔄 PostgreSQL Role Change\",
            \"attachments\": [{
                \"color\": \"warning\",
                \"fields\": [
                    {\"title\": \"Node\", \"value\": \"$PATRONI_NAME\", \"short\": true},
                    {\"title\": \"Cluster\", \"value\": \"$PATRONI_SCOPE\", \"short\": true},
                    {\"title\": \"Old Role\", \"value\": \"${PATRONI_OLD_ROLE:-N/A}\", \"short\": true},
                    {\"title\": \"New Role\", \"value\": \"$PATRONI_ROLE\", \"short\": true},
                    {\"title\": \"Time\", \"value\": \"$TIMESTAMP\", \"short\": false}
                ]
            }]
        }" || log "WARNING: Notification failed"
}

# Main logic
case "$PATRONI_ROLE" in
    master)
        log "This node is now PRIMARY"
        
        # Get this node's IP
        NODE_IP=$(hostname -I | awk '{print $1}')
        log "Node IP: $NODE_IP"
        
        # Update DNS to point to new primary
        update_dns "$NODE_IP"
        
        # Update load balancer
        update_haproxy "$NODE_IP"
        
        # Send notification
        send_notification "Node $PATRONI_NAME promoted to PRIMARY"
        
        # Set marker file for applications
        touch /tmp/postgres_is_primary
        rm -f /tmp/postgres_is_replica
        
        log "Primary promotion tasks completed"
        ;;
        
    replica)
        log "This node is now REPLICA"
        
        # Remove primary marker
        rm -f /tmp/postgres_is_primary
        touch /tmp/postgres_is_replica
        
        # Send notification if demoted from primary
        if [ "${PATRONI_OLD_ROLE}" = "master" ]; then
            send_notification "Node $PATRONI_NAME demoted to REPLICA"
        fi
        
        log "Replica role tasks completed"
        ;;
        
    *)
        log "Unknown role: $PATRONI_ROLE"
        exit 1
        ;;
esac

log "Role change handling completed successfully"
exit 0

Create production-ready script:

sudo tee /var/lib/postgresql/callbacks/on_role_change.sh > /dev/null << 'EOF'
#!/bin/bash
set -e

LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ROLE_CHANGE] $1" | tee -a "$LOG_FILE"; }

log "=========================================="
log "Role change: $PATRONI_NAME"
log "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "New role: $PATRONI_ROLE"
log "=========================================="

case "$PATRONI_ROLE" in
    master)
        log "This node is now PRIMARY"
        NODE_IP=$(hostname -I | awk '{print $1}')
        log "Node IP: $NODE_IP"
        
        # TODO: Update DNS, load balancer, etc.
        # update_dns "$NODE_IP"
        
        touch /tmp/postgres_is_primary
        rm -f /tmp/postgres_is_replica
        ;;
        
    replica)
        log "This node is now REPLICA"
        rm -f /tmp/postgres_is_primary
        touch /tmp/postgres_is_replica
        ;;
        
    *)
        log "Unknown role: $PATRONI_ROLE"
        exit 1
        ;;
esac

log "Role change completed"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_role_change.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_role_change.sh

3.4. on_restart callback

Use case: Log restarts, notify about planned maintenance.

sudo tee /var/lib/postgresql/callbacks/on_restart.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_RESTART] $1" | tee -a "$LOG_FILE"; }

log "Restarting PostgreSQL on $PATRONI_NAME (Role: $PATRONI_ROLE)"
exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_restart.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_restart.sh

3.5. on_reload callback

Use case: Verify configuration changes were applied.

sudo tee /var/lib/postgresql/callbacks/on_reload.sh > /dev/null << 'EOF'
#!/bin/bash
set -e
LOG_FILE="/var/log/patroni/callbacks.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
log() { echo "[$TIMESTAMP] [ON_RELOAD] $1" | tee -a "$LOG_FILE"; }

log "Configuration reloaded on $PATRONI_NAME"

# Verify critical settings
MAX_CONN=$(sudo -u postgres psql -t -c "SHOW max_connections;")
log "max_connections = $MAX_CONN"

exit 0
EOF

sudo chmod +x /var/lib/postgresql/callbacks/on_reload.sh
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_reload.sh

3.6. Create log directory

# On all nodes
sudo mkdir -p /var/log/patroni
sudo chown postgres:postgres /var/log/patroni
sudo chmod 750 /var/log/patroni

4. Update Patroni Configuration

4.1. Add callbacks to patroni.yml

On all 3 nodes, edit /etc/patroni/patroni.yml:

scope: postgres
namespace: /service/
name: node1  # node2, node3 for other nodes

restapi:
  listen: 0.0.0.0:8008
  connect_address: 10.0.1.11:8008  # Change per node

etcd3:
  hosts: 10.0.1.11:2379,10.0.1.12:2379,10.0.1.13:2379

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    synchronous_mode: true
    synchronous_mode_strict: false
    
    postgresql:
      parameters:
        max_connections: 100
        shared_buffers: 256MB
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10

postgresql:
  listen: 0.0.0.0:5432
  connect_address: 10.0.1.11:5432  # Change per node
  data_dir: /var/lib/postgresql/18/data
  bin_dir: /usr/lib/postgresql/18/bin
  
  authentication:
    replication:
      username: replicator
      password: replicator_password
    superuser:
      username: postgres
      password: postgres_password
  
  parameters:
    unix_socket_directories: '/var/run/postgresql'
  
  # ✅ Add callbacks section
  callbacks:
    on_start: /var/lib/postgresql/callbacks/on_start.sh
    on_stop: /var/lib/postgresql/callbacks/on_stop.sh
    on_restart: /var/lib/postgresql/callbacks/on_restart.sh
    on_reload: /var/lib/postgresql/callbacks/on_reload.sh
    on_role_change: /var/lib/postgresql/callbacks/on_role_change.sh

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

4.2. Reload Patroni configuration

# On all 3 nodes
sudo systemctl reload patroni

# Verify callbacks configured
patronictl show-config postgres

5. Test Callbacks

5.1. Test on_restart

# Restart a node
patronictl restart postgres node2

# Check logs
sudo tail -f /var/log/patroni/callbacks.log

# Expected output:
# [2024-11-25 10:30:15] [ON_RESTART] Restarting PostgreSQL on node2

5.2. Test on_reload

# Reload configuration
patronictl reload postgres node2

# Check logs
sudo tail /var/log/patroni/callbacks.log

# Expected:
# [2024-11-25 10:32:45] [ON_RELOAD] Configuration reloaded on node2

5.3. Test on_role_change (Failover)

⚠️ IMPORTANT: Test in non-production!

# 1. Check current primary
patronictl list postgres
# node1 is Leader

# 2. Stop primary
sudo systemctl stop patroni  # On node1

# 3. Watch logs on node2 (will become new primary)
sudo tail -f /var/log/patroni/callbacks.log

# Expected output:
# [2024-11-25 10:35:10] [ROLE_CHANGE] ==========================================
# [2024-11-25 10:35:10] [ROLE_CHANGE] Role change: node2
# [2024-11-25 10:35:10] [ROLE_CHANGE] Old role: replica
# [2024-11-25 10:35:10] [ROLE_CHANGE] New role: master
# [2024-11-25 10:35:10] [ROLE_CHANGE] This node is now PRIMARY
# [2024-11-25 10:35:10] [ROLE_CHANGE] Node IP: 10.0.1.12
# [2024-11-25 10:35:10] [ROLE_CHANGE] Role change completed

# 4. Verify marker file
ls -la /tmp/postgres_is_*
# -rw-r--r-- 1 postgres postgres 0 Nov 25 10:35 /tmp/postgres_is_primary

# 5. Restart node1 (will rejoin as replica)
sudo systemctl start patroni  # On node1

# 6. Check node1 logs
sudo tail /var/log/patroni/callbacks.log
# [2024-11-25 10:36:30] [ROLE_CHANGE] Old role: master
# [2024-11-25 10:36:30] [ROLE_CHANGE] New role: replica
# [2024-11-25 10:36:30] [ROLE_CHANGE] This node is now REPLICA

6. Advanced Callback Examples

6.1. DNS update using nsupdate

Prerequisites: BIND DNS server với DDNS enabled.

#!/bin/bash
# Update DNS via nsupdate

update_dns() {
    local NEW_PRIMARY_IP="$1"
    local DNS_KEY="/etc/dns/Kpostgres.+157+12345.key"
    local DNS_SERVER="dns.example.com"
    local ZONE="postgres.local"
    local RECORD="primary.postgres.local"
    
    log "Updating DNS: $RECORD -> $NEW_PRIMARY_IP"
    
    nsupdate -k "$DNS_KEY" << EOF
server $DNS_SERVER
zone $ZONE
update delete $RECORD A
update add $RECORD 60 A $NEW_PRIMARY_IP
send
EOF
    
    if [ $? -eq 0 ]; then
        log "DNS updated successfully"
    else
        log "ERROR: DNS update failed"
        return 1
    fi
}

# In on_role_change.sh
if [ "$PATRONI_ROLE" = "master" ]; then
    NODE_IP=$(hostname -I | awk '{print $1}')
    update_dns "$NODE_IP"
fi

6.2. HAProxy backend update

Via stats socket:

update_haproxy() {
    local NEW_PRIMARY_IP="$1"
    local HAPROXY_SOCKET="/var/run/haproxy.sock"
    
    log "Updating HAProxy: primary backend -> $NEW_PRIMARY_IP"
    
    echo "set server postgres-primary/node addr $NEW_PRIMARY_IP port 5432" | \
        socat stdio "$HAPROXY_SOCKET"
    
    echo "set server postgres-primary/node state ready" | \
        socat stdio "$HAPROXY_SOCKET"
    
    log "HAProxy backend updated"
}

6.3. Consul service registration

register_in_consul() {
    local ROLE="$1"
    local NODE_IP="$2"
    
    log "Registering in Consul: $PATRONI_NAME as $ROLE"
    
    curl -s -X PUT "http://consul.local:8500/v1/agent/service/register" \
        -H "Content-Type: application/json" \
        -d "{
            \"Name\": \"postgres-$ROLE\",
            \"ID\": \"postgres-$PATRONI_NAME\",
            \"Address\": \"$NODE_IP\",
            \"Port\": 5432,
            \"Tags\": [\"$ROLE\", \"patroni\"],
            \"Check\": {
                \"TCP\": \"$NODE_IP:5432\",
                \"Interval\": \"10s\",
                \"Timeout\": \"2s\"
            }
        }"
    
    log "Consul registration completed"
}

# Usage
NODE_IP=$(hostname -I | awk '{print $1}')
register_in_consul "$PATRONI_ROLE" "$NODE_IP"

6.4. Email notification

send_email_alert() {
    local SUBJECT="$1"
    local BODY="$2"
    local RECIPIENT="ops-team@example.com"
    
    log "Sending email alert: $SUBJECT"
    
    echo "$BODY" | mail -s "$SUBJECT" "$RECIPIENT"
    
    log "Email sent to $RECIPIENT"
}

# In on_role_change.sh
if [ "$PATRONI_ROLE" = "master" ]; then
    send_email_alert \
        "[ALERT] PostgreSQL Failover: $PATRONI_NAME promoted to PRIMARY" \
        "Cluster: $PATRONI_SCOPE
Node: $PATRONI_NAME
Old Role: ${PATRONI_OLD_ROLE}
New Role: $PATRONI_ROLE
Time: $TIMESTAMP

Action required: Verify cluster health"
fi

6.5. Slack/Teams webhook

Detailed Slack notification:

send_slack_alert() {
    local WEBHOOK_URL="https://hooks.slack.com/services/YOUR/WEBHOOK/URL"
    local COLOR="$1"  # good, warning, danger
    local TITLE="$2"
    local MESSAGE="$3"
    
    curl -s -X POST "$WEBHOOK_URL" \
        -H "Content-Type: application/json" \
        -d "{
            \"username\": \"Patroni Monitor\",
            \"icon_emoji\": \": database:\",
            \"attachments\": [{
                \"color\": \"$COLOR\",
                \"title\": \"$TITLE\",
                \"text\": \"$MESSAGE\",
                \"fields\": [
                    {\"title\": \"Cluster\", \"value\": \"$PATRONI_SCOPE\", \"short\": true},
                    {\"title\": \"Node\", \"value\": \"$PATRONI_NAME\", \"short\": true},
                    {\"title\": \"Old Role\", \"value\": \"${PATRONI_OLD_ROLE:-N/A}\", \"short\": true},
                    {\"title\": \"New Role\", \"value\": \"$PATRONI_ROLE\", \"short\": true},
                    {\"title\": \"Timestamp\", \"value\": \"$TIMESTAMP\", \"short\": false}
                ],
                \"footer\": \"PostgreSQL HA\",
                \"footer_icon\": \"https://www.postgresql.org/media/img/about/press/elephant.png\"
            }]
        }"
}

# Usage
if [ "$PATRONI_ROLE" = "master" ]; then
    send_slack_alert "warning" \
        "🚨 Failover Event" \
        "Node $PATRONI_NAME has been promoted to PRIMARY"
fi

6.6. Metrics push to monitoring

Push to Prometheus Pushgateway:

push_metrics() {
    local PUSHGATEWAY="http://pushgateway.local:9091"
    local JOB="patroni_callbacks"
    
    log "Pushing metrics to Prometheus"
    
    cat << EOF | curl -s --data-binary @- "$PUSHGATEWAY/metrics/job/$JOB/instance/$PATRONI_NAME"
# TYPE patroni_role_change counter
# HELP patroni_role_change Number of role changes
patroni_role_change{cluster="$PATRONI_SCOPE",node="$PATRONI_NAME",new_role="$PATRONI_ROLE"} 1

# TYPE patroni_role_change_timestamp gauge
# HELP patroni_role_change_timestamp Timestamp of last role change
patroni_role_change_timestamp{cluster="$PATRONI_SCOPE",node="$PATRONI_NAME"} $(date +%s)
EOF
    
    log "Metrics pushed"
}

7. Callback Best Practices

✅ DO

  1. Keep callbacks fast
    • Complete within 10-30 seconds
    • Long tasks → background jobs
  2. Use proper logging
    • Log all actions
    • Include timestamps
    • Rotate logs
  3. Handle errors gracefully
    • Use set -e carefully
    • Catch errors, log, continue
    • Non-zero exit = warning, not failure
  4. Test thoroughly
    • Test in staging
    • Simulate all scenarios
    • Verify idempotency
  5. Make scripts idempotent
    • Can run multiple times safely
    • Check before modify
  6. Use absolute paths
    • Don't rely on PATH
    • Specify full paths
  7. Secure credentials
    • Don't hardcode passwords
    • Use environment variables or secrets manager
  8. Monitor callback execution
    • Alert on failures
    • Track execution time

❌ DON'T

  1. Don't block for long time
    • Patroni waits for callbacks
    • Long delays → slower failover
  2. Don't rely on network during failover
    • Network may be partitioned
    • Have fallback mechanisms
  3. Don't fail the callback unnecessarily
    • Exit 0 even if notification fails
    • Log errors but continue
  4. Don't run database queries in callbacks
    • PostgreSQL may not be ready
    • Can cause deadlocks
  5. Don't modify PostgreSQL configuration
    • Let Patroni manage config
    • Use Patroni's parameters
  6. Don't use interactive commands
    • No user input
    • Must run unattended

8. Troubleshoot Callback Issues

8.1. Callback not executing

Check:

# 1. Verify script exists
ls -la /var/lib/postgresql/callbacks/on_role_change.sh

# 2. Check executable permissions
# Should be: -rwxr-xr-x postgres postgres
sudo chmod +x /var/lib/postgresql/callbacks/on_role_change.sh

# 3. Check ownership
sudo chown postgres:postgres /var/lib/postgresql/callbacks/on_role_change.sh

# 4. Verify path in patroni.yml
grep -A5 "callbacks:" /etc/patroni/patroni.yml

# 5. Check Patroni logs
sudo journalctl -u patroni -n 100 | grep -i callback

8.2. Callback failing

Check logs:

# Patroni logs
sudo journalctl -u patroni | grep "callback.*failed"

# Callback logs
sudo tail -f /var/log/patroni/callbacks.log

# Test script manually
sudo -u postgres /var/lib/postgresql/callbacks/on_role_change.sh

Common issues:

  • Syntax error: Run bash -n script.sh to check
  • Missing dependency: Install required tools (curl, nc, etc.)
  • Permission denied: Check file/directory permissions
  • Timeout: Script taking too long

8.3. Callback causing slow failover

Measure callback execution time:

# Add timing to script
START_TIME=$(date +%s)

# ... your callback logic ...

END_TIME=$(date +%s)
DURATION=$((END_TIME - START_TIME))
log "Callback completed in ${DURATION} seconds"

# If DURATION > 30, investigate and optimize

9. Production Callback Template

Complete production-ready template:

#!/bin/bash
# Patroni callback template
# File: /var/lib/postgresql/callbacks/on_role_change.sh

set -euo pipefail  # Exit on error, undefined vars, pipe failures

# Configuration
readonly LOG_FILE="/var/log/patroni/callbacks.log"
readonly LOCK_FILE="/tmp/callback_role_change.lock"
readonly TIMEOUT=30
readonly SLACK_WEBHOOK="${SLACK_WEBHOOK_URL:-}"

# Logging function
log() {
    local LEVEL="$1"
    shift
    local MESSAGE="$*"
    local TIMESTAMP
    TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')
    echo "[$TIMESTAMP] [$LEVEL] [ROLE_CHANGE] $MESSAGE" | tee -a "$LOG_FILE"
}

# Error handler
error_exit() {
    log "ERROR" "$1"
    cleanup
    exit 1
}

# Cleanup function
cleanup() {
    rm -f "$LOCK_FILE"
}

# Ensure only one instance runs
if ! mkdir "$LOCK_FILE" 2>/dev/null; then
    log "WARN" "Another callback instance is running, exiting"
    exit 0
fi

trap cleanup EXIT

# Set timeout
timeout "$TIMEOUT" bash << 'SCRIPT' || error_exit "Callback timed out after ${TIMEOUT}s"

log "INFO" "=========================================="
log "INFO" "Role change detected"
log "INFO" "Cluster: ${PATRONI_SCOPE:-unknown}"
log "INFO" "Node: ${PATRONI_NAME:-unknown}"
log "INFO" "Old role: ${PATRONI_OLD_ROLE:-unknown}"
log "INFO" "New role: ${PATRONI_ROLE:-unknown}"
log "INFO" "=========================================="

# Main logic
case "${PATRONI_ROLE:-}" in
    master)
        log "INFO" "Handling promotion to PRIMARY"
        
        # Get node IP
        NODE_IP=$(hostname -I | awk '{print $1}')
        log "INFO" "Node IP: $NODE_IP"
        
        # Update DNS (implement your logic)
        # update_dns "$NODE_IP" || log "WARN" "DNS update failed"
        
        # Update load balancer (implement your logic)
        # update_load_balancer "$NODE_IP" || log "WARN" "LB update failed"
        
        # Send notification
        if [ -n "$SLACK_WEBHOOK" ]; then
            curl -s -X POST "$SLACK_WEBHOOK" \
                -H "Content-Type: application/json" \
                -d "{\"text\": \"🚨 Failover: $PATRONI_NAME promoted to PRIMARY\"}" \
                || log "WARN" "Slack notification failed"
        fi
        
        # Set marker files
        touch /tmp/postgres_is_primary
        rm -f /tmp/postgres_is_replica
        
        log "INFO" "PRIMARY promotion tasks completed"
        ;;
        
    replica)
        log "INFO" "Handling demotion to REPLICA"
        
        # Remove primary marker
        rm -f /tmp/postgres_is_primary
        touch /tmp/postgres_is_replica
        
        # Notify if demoted from primary
        if [ "${PATRONI_OLD_ROLE:-}" = "master" ]; then
            log "WARN" "Node demoted from PRIMARY to REPLICA"
            # Send alert
        fi
        
        log "INFO" "REPLICA tasks completed"
        ;;
        
    *)
        error_exit "Unknown role: ${PATRONI_ROLE:-unknown}"
        ;;
esac

log "INFO" "Callback completed successfully"
exit 0

SCRIPT

10. Lab Exercises

Lab 1: Setup basic callbacks

Tasks:

  1. Create callback directory and scripts
  2. Add callbacks to patroni.yml
  3. Reload Patroni
  4. Test with patronictl restart

Lab 2: Test failover callbacks

Tasks:

  1. Monitor callback logs: tail -f /var/log/patroni/callbacks.log
  2. Stop primary: sudo systemctl stop patroni
  3. Verify on_role_change executed on new primary
  4. Check marker files: /tmp/postgres_is_*
  5. Restart old primary, verify it rejoins as replica

Lab 3: Implement Slack notifications

Tasks:

  1. Get Slack webhook URL
  2. Add notification to on_role_change.sh
  3. Test by triggering failover
  4. Verify message received in Slack

Lab 4: Measure callback performance

Tasks:

  1. Add timing to all callbacks
  2. Trigger various events (restart, reload, failover)
  3. Analyze callback execution times
  4. Optimize slow callbacks

11. Tổng kết

Key Takeaways

✅ Callbacks = Custom automation at lifecycle events

✅ on_role_change = Most critical callback for failover automation

✅ Keep callbacks fast (<30s) for quick failover

✅ Log everything for debugging

✅ Test thoroughly before production

✅ Handle errors gracefully - don't block operations

Common Use Cases

CallbackCommon Actions
on_startPre-flight checks, mount verification
on_stopCleanup, notifications
on_role_changeUpdate DNS, LB, send alerts
on_restartLog maintenance events
on_reloadVerify config changes

Architecture hiện tại

✅ 3 VMs prepared (Bài 4)
✅ PostgreSQL 18 installed (Bài 5)
✅ etcd cluster running (Bài 6)
✅ Patroni installed (Bài 7)
✅ Patroni configured (Bài 8)
✅ Cluster bootstrapped (Bài 9)
✅ Replication configured (Bài 10)
✅ Callbacks implemented (Bài 11)

Next: REST API usage

Chuẩn bị cho Bài 12

Bài 12 sẽ cover Patroni REST API:

  • Health check endpoints
  • Cluster status queries
  • Configuration management via API
  • Integration với load balancers
  • Monitoring và metrics
callbacks hooks automation lab monitoring-integration

Đánh dấu hoàn thành (Bài 11: Patroni Callbacks)