Bài 26: Automation với Ansible

Tạo Ansible playbooks cho deployment, configuration management, automated testing và tích hợp CI/CD cho PostgreSQL HA cluster.

8 min read
XDEV ASIA

Bài 26: Automation với Ansible

Mục tiêu

Sau bài học này, bạn sẽ:

  • Automate Patroni cluster deployment with Ansible
  • Manage configuration with playbooks
  • Implement automated testing
  • Integrate database changes into CI/CD
  • Use Infrastructure as Code principles

1. Ansible Basics for PostgreSQL

1.1. Install Ansible

# Install Ansible
sudo apt-get update
sudo apt-get install -y ansible

# Or via pip
pip3 install ansible

# Verify
ansible --version
# ansible [core 2.15.5]

1.2. Inventory file

# inventory.ini
[postgres_cluster]
pg-node1 ansible_host=10.0.1.11 ansible_user=ubuntu
pg-node2 ansible_host=10.0.1.12 ansible_user=ubuntu
pg-node3 ansible_host=10.0.1.13 ansible_user=ubuntu

[etcd_cluster]
etcd-node1 ansible_host=10.0.1.11 ansible_user=ubuntu
etcd-node2 ansible_host=10.0.1.12 ansible_user=ubuntu
etcd-node3 ansible_host=10.0.1.13 ansible_user=ubuntu

[all:vars]
ansible_python_interpreter=/usr/bin/python3
ansible_ssh_private_key_file=~/.ssh/id_rsa

1.3. Ansible configuration

# ansible.cfg
[defaults]
inventory = inventory.ini
host_key_checking = False
retry_files_enabled = False
gathering = smart
fact_caching = jsonfile
fact_caching_connection = /tmp/ansible_facts
fact_caching_timeout = 86400

[ssh_connection]
ssh_args = -o ControlMaster=auto -o ControlPersist=60s
pipelining = True

2. Complete Patroni Deployment Playbook

2.1. Main playbook

# site.yml
---
- name: Deploy PostgreSQL HA Cluster with Patroni
  hosts: all
  become: yes
  vars_files:
    - vars/main.yml
  roles:
    - common
    - etcd
    - postgresql
    - patroni
    - haproxy
    - monitoring

2.2. Variables

# vars/main.yml
---
# PostgreSQL
postgresql_version: 18
postgresql_data_dir: /var/lib/postgresql/{{ postgresql_version }}/data
postgresql_bin_dir: /usr/lib/postgresql/{{ postgresql_version }}/bin

# Patroni
patroni_scope: postgres-cluster
patroni_namespace: /service/

# etcd
etcd_version: 3.5.11
etcd_data_dir: /var/lib/etcd
etcd_initial_cluster_token: etcd-cluster-token

# Cluster
cluster_nodes:
  - { name: pg-node1, ip: 10.0.1.11 }
  - { name: pg-node2, ip: 10.0.1.12 }
  - { name: pg-node3, ip: 10.0.1.13 }

# Passwords (use Ansible Vault in production!)
postgres_password: !vault |
  $ANSIBLE_VAULT;1.1;AES256
  ...encrypted...
replicator_password: !vault |
  $ANSIBLE_VAULT;1.1;AES256
  ...encrypted...

2.3. Common role

# roles/common/tasks/main.yml
---
- name: Update apt cache
  apt:
    update_cache: yes
    cache_valid_time: 3600

- name: Install common packages
  apt:
    name:
      - curl
      - wget
      - vim
      - git
      - htop
      - net-tools
      - python3
      - python3-pip
    state: present

- name: Set timezone
  timezone:
    name: Asia/Ho_Chi_Minh

- name: Configure sysctl for PostgreSQL
  sysctl:
    name: "{{ item.name }}"
    value: "{{ item.value }}"
    state: present
    reload: yes
  loop:
    - { name: 'vm.swappiness', value: '1' }
    - { name: 'vm.overcommit_memory', value: '2' }
    - { name: 'vm.dirty_ratio', value: '10' }
    - { name: 'vm.dirty_background_ratio', value: '3' }
    - { name: 'net.ipv4.tcp_keepalive_time', value: '60' }
    - { name: 'net.ipv4.tcp_keepalive_intvl', value: '10' }
    - { name: 'net.ipv4.tcp_keepalive_probes', value: '6' }

- name: Set system limits
  pam_limits:
    domain: postgres
    limit_type: "{{ item.type }}"
    limit_item: "{{ item.item }}"
    value: "{{ item.value }}"
  loop:
    - { type: 'soft', item: 'nofile', value: '65536' }
    - { type: 'hard', item: 'nofile', value: '65536' }
    - { type: 'soft', item: 'nproc', value: '8192' }
    - { type: 'hard', item: 'nproc', value: '8192' }

2.4. etcd role

# roles/etcd/tasks/main.yml
---
- name: Create etcd user
  user:
    name: etcd
    shell: /bin/false
    system: yes
    home: "{{ etcd_data_dir }}"

- name: Download etcd
  get_url:
    url: "https://github.com/etcd-io/etcd/releases/download/v{{ etcd_version }}/etcd-v{{ etcd_version }}-linux-amd64.tar.gz"
    dest: /tmp/etcd.tar.gz

- name: Extract etcd
  unarchive:
    src: /tmp/etcd.tar.gz
    dest: /tmp
    remote_src: yes

- name: Install etcd binaries
  copy:
    src: "/tmp/etcd-v{{ etcd_version }}-linux-amd64/{{ item }}"
    dest: /usr/local/bin/{{ item }}
    mode: '0755'
    remote_src: yes
  loop:
    - etcd
    - etcdctl

- name: Create etcd data directory
  file:
    path: "{{ etcd_data_dir }}"
    state: directory
    owner: etcd
    group: etcd
    mode: '0755'

- name: Template etcd systemd service
  template:
    src: etcd.service.j2
    dest: /etc/systemd/system/etcd.service
  notify: restart etcd

- name: Start and enable etcd
  systemd:
    name: etcd
    state: started
    enabled: yes
    daemon_reload: yes
{# roles/etcd/templates/etcd.service.j2 #}
[Unit]
Description=etcd key-value store
Documentation=https://github.com/etcd-io/etcd
After=network.target

[Service]
Type=notify
User=etcd
ExecStart=/usr/local/bin/etcd \
  --name {{ ansible_hostname }} \
  --data-dir {{ etcd_data_dir }} \
  --initial-advertise-peer-urls http://{{ ansible_default_ipv4.address }}:2380 \
  --listen-peer-urls http://{{ ansible_default_ipv4.address }}:2380 \
  --listen-client-urls http://{{ ansible_default_ipv4.address }}:2379,http://127.0.0.1:2379 \
  --advertise-client-urls http://{{ ansible_default_ipv4.address }}:2379 \
  --initial-cluster-token {{ etcd_initial_cluster_token }} \
  --initial-cluster {% for node in cluster_nodes %}{{ node.name }}=http://{{ node.ip }}:2380{% if not loop.last %},{% endif %}{% endfor %} \
  --initial-cluster-state new
Restart=on-failure
RestartSec=5

[Install]
WantedBy=multi-user.target

2.5. PostgreSQL role

# roles/postgresql/tasks/main.yml
---
- name: Add PostgreSQL apt key
  apt_key:
    url: https://www.postgresql.org/media/keys/ACCC4CF8.asc
    state: present

- name: Add PostgreSQL repository
  apt_repository:
    repo: "deb http://apt.postgresql.org/pub/repos/apt/ {{ ansible_distribution_release }}-pgdg main"
    state: present

- name: Install PostgreSQL
  apt:
    name:
      - "postgresql-{{ postgresql_version }}"
      - "postgresql-contrib-{{ postgresql_version }}"
      - "postgresql-server-dev-{{ postgresql_version }}"
    state: present
    update_cache: yes

- name: Stop and disable PostgreSQL (managed by Patroni)
  systemd:
    name: "postgresql@{{ postgresql_version }}-main"
    state: stopped
    enabled: no
  ignore_errors: yes

- name: Create PostgreSQL directories
  file:
    path: "{{ item }}"
    state: directory
    owner: postgres
    group: postgres
    mode: '0700'
  loop:
    - "{{ postgresql_data_dir }}"
    - /var/lib/postgresql/wal_archive
    - /var/lib/postgresql/backups

2.6. Patroni role

# roles/patroni/tasks/main.yml
---
- name: Install Python dependencies
  pip:
    name:
      - patroni[etcd]
      - psycopg2-binary
      - python-etcd
    state: present
    executable: pip3

- name: Create Patroni configuration directory
  file:
    path: /etc/patroni
    state: directory
    owner: postgres
    group: postgres
    mode: '0755'

- name: Template Patroni configuration
  template:
    src: patroni.yml.j2
    dest: /etc/patroni/patroni.yml
    owner: postgres
    group: postgres
    mode: '0600'
  notify: restart patroni

- name: Template Patroni systemd service
  template:
    src: patroni.service.j2
    dest: /etc/systemd/system/patroni.service
  notify: restart patroni

- name: Start and enable Patroni
  systemd:
    name: patroni
    state: started
    enabled: yes
    daemon_reload: yes

- name: Wait for Patroni to be ready
  wait_for:
    port: 8008
    timeout: 60
{# roles/patroni/templates/patroni.yml.j2 #}
scope: {{ patroni_scope }}
name: {{ ansible_hostname }}

restapi:
  listen: {{ ansible_default_ipv4.address }}:8008
  connect_address: {{ ansible_default_ipv4.address }}:8008

etcd:
  hosts: {% for node in cluster_nodes %}{{ node.ip }}:2379{% if not loop.last %},{% endif %}{% endfor %}

bootstrap:
  dcs:
    ttl: 30
    loop_wait: 10
    retry_timeout: 10
    maximum_lag_on_failover: 1048576
    postgresql:
      use_pg_rewind: true
      use_slots: true
      parameters:
        max_connections: 100
        shared_buffers: 256MB
        effective_cache_size: 1GB
        maintenance_work_mem: 64MB
        checkpoint_completion_target: 0.9
        wal_buffers: 16MB
        default_statistics_target: 100
        random_page_cost: 1.1
        effective_io_concurrency: 200
        work_mem: 2621kB
        min_wal_size: 1GB
        max_wal_size: 4GB
        max_worker_processes: 4
        max_parallel_workers_per_gather: 2
        max_parallel_workers: 4
        max_parallel_maintenance_workers: 2
        wal_level: replica
        max_wal_senders: 10
        max_replication_slots: 10
        hot_standby: on
        archive_mode: on
        archive_command: 'test ! -f /var/lib/postgresql/wal_archive/%f && cp %p /var/lib/postgresql/wal_archive/%f'

  initdb:
    - encoding: UTF8
    - data-checksums

  pg_hba:
    - host replication replicator 0.0.0.0/0 scram-sha-256
    - host all all 0.0.0.0/0 scram-sha-256

postgresql:
  listen: 0.0.0.0:5432
  connect_address: {{ ansible_default_ipv4.address }}:5432
  data_dir: {{ postgresql_data_dir }}
  bin_dir: {{ postgresql_bin_dir }}
  authentication:
    replication:
      username: replicator
      password: {{ replicator_password }}
    superuser:
      username: postgres
      password: {{ postgres_password }}

tags:
  nofailover: false
  noloadbalance: false
  clonefrom: false
  nosync: false

3. Deployment

3.1. Run playbook

# Dry run (check mode)
ansible-playbook site.yml --check

# Execute
ansible-playbook site.yml

# With verbose output
ansible-playbook site.yml -vvv

# Specific tags
ansible-playbook site.yml --tags "postgresql,patroni"

3.2. Verify deployment

# verify.yml
---
- name: Verify Patroni cluster
  hosts: postgres_cluster
  tasks:
    - name: Check Patroni service
      systemd:
        name: patroni
        state: started
      register: patroni_status

    - name: Get cluster status
      command: patronictl -c /etc/patroni/patroni.yml list
      register: cluster_status
      changed_when: false

    - name: Display cluster status
      debug:
        var: cluster_status.stdout_lines

    - name: Check PostgreSQL connectivity
      postgresql_ping:
        db: postgres
        login_host: localhost
        login_user: postgres
        login_password: "{{ postgres_password }}"
      become_user: postgres
ansible-playbook verify.yml

4. Configuration Management

4.1. Dynamic configuration update

# update_config.yml
---
- name: Update Patroni configuration
  hosts: postgres_cluster
  become: yes
  vars:
    new_max_connections: 200
  tasks:
    - name: Update DCS configuration
      shell: |
        patronictl -c /etc/patroni/patroni.yml edit-config --apply - <<EOF
        postgresql:
          parameters:
            max_connections: {{ new_max_connections }}
        EOF
      run_once: true
      register: config_update

    - name: Restart nodes if needed
      shell: patronictl -c /etc/patroni/patroni.yml restart {{ patroni_scope }} {{ ansible_hostname }} --force
      when: "'Pending restart' in config_update.stdout"

4.2. Backup automation

# backup.yml
---
- name: Perform PostgreSQL backup
  hosts: postgres_cluster[0]  # Only on leader
  become: yes
  become_user: postgres
  vars:
    backup_dir: /var/lib/postgresql/backups
    backup_retention_days: 7
  tasks:
    - name: Create backup directory
      file:
        path: "{{ backup_dir }}"
        state: directory
        mode: '0700'

    - name: Run pg_basebackup
      shell: |
        pg_basebackup -D {{ backup_dir }}/backup_$(date +%Y%m%d_%H%M%S) \
          -Ft -z -Xs -P
      args:
        creates: "{{ backup_dir }}/backup_*"

    - name: Remove old backups
      find:
        paths: "{{ backup_dir }}"
        age: "{{ backup_retention_days }}d"
        recurse: yes
      register: old_backups

    - name: Delete old backups
      file:
        path: "{{ item.path }}"
        state: absent
      loop: "{{ old_backups.files }}"

5. Testing Automation

5.1. Molecule for testing

# Install Molecule
pip3 install molecule molecule-plugins[docker]

# Initialize Molecule scenario
cd roles/patroni
molecule init scenario --driver-name docker
# roles/patroni/molecule/default/molecule.yml
---
dependency:
  name: galaxy
driver:
  name: docker
platforms:
  - name: pg-node1
    image: ubuntu:22.04
    pre_build_image: true
  - name: pg-node2
    image: ubuntu:22.04
    pre_build_image: true
  - name: pg-node3
    image: ubuntu:22.04
    pre_build_image: true
provisioner:
  name: ansible
verifier:
  name: ansible
# roles/patroni/molecule/default/verify.yml
---
- name: Verify
  hosts: all
  tasks:
    - name: Check Patroni is running
      systemd:
        name: patroni
        state: started
      register: result
      failed_when: result.status.ActiveState != 'active'

    - name: Check cluster has leader
      shell: patronictl -c /etc/patroni/patroni.yml list | grep Leader
      register: leader_check
      failed_when: leader_check.rc != 0
# Run tests
molecule test

5.2. Testinfra for validation

pip3 install testinfra
# tests/test_patroni.py
import testinfra

def test_patroni_service(host):
    """Test Patroni service is running"""
    service = host.service("patroni")
    assert service.is_running
    assert service.is_enabled

def test_postgresql_port(host):
    """Test PostgreSQL port is listening"""
    assert host.socket("tcp://0.0.0.0:5432").is_listening

def test_patroni_rest_api(host):
    """Test Patroni REST API"""
    assert host.socket("tcp://0.0.0.0:8008").is_listening

def test_etcd_connectivity(host):
    """Test etcd cluster health"""
    cmd = host.run("etcdctl endpoint health")
    assert cmd.rc == 0
    assert "healthy" in cmd.stdout

def test_cluster_has_leader(host):
    """Test cluster has exactly one leader"""
    cmd = host.run("patronictl -c /etc/patroni/patroni.yml list")
    assert cmd.rc == 0
    assert cmd.stdout.count("Leader") == 1

def test_replication_lag(host):
    """Test replication lag is low"""
    cmd = host.run("sudo -u postgres psql -Atc \"SELECT pg_wal_lsn_diff(pg_current_wal_lsn(), replay_lsn) FROM pg_stat_replication;\"")
    if cmd.rc == 0 and cmd.stdout:
        lag = int(cmd.stdout.strip())
        assert lag < 1048576  # < 1MB lag
# Run tests
pytest tests/test_patroni.py -v

6. CI/CD Integration

6.1. GitLab CI example

# .gitlab-ci.yml
stages:
  - lint
  - test
  - deploy_staging
  - deploy_production

variables:
  ANSIBLE_FORCE_COLOR: "true"

lint:
  stage: lint
  image: python:3.11
  before_script:
    - pip install ansible-lint yamllint
  script:
    - ansible-lint site.yml
    - yamllint .
  only:
    - merge_requests
    - main

test:
  stage: test
  image: python:3.11
  before_script:
    - pip install molecule molecule-plugins[docker] testinfra
  script:
    - molecule test
  only:
    - merge_requests
    - main

deploy_staging:
  stage: deploy_staging
  image: python:3.11
  before_script:
    - pip install ansible
    - eval $(ssh-agent -s)
    - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
  script:
    - ansible-playbook -i inventory/staging.ini site.yml
  only:
    - main
  environment:
    name: staging

deploy_production:
  stage: deploy_production
  image: python:3.11
  before_script:
    - pip install ansible
    - eval $(ssh-agent -s)
    - echo "$SSH_PRIVATE_KEY" | tr -d '\r' | ssh-add -
  script:
    - ansible-playbook -i inventory/production.ini site.yml
  only:
    - tags
  when: manual
  environment:
    name: production

6.2. GitHub Actions example

# .github/workflows/deploy.yml
name: Deploy Patroni Cluster

on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install dependencies
        run: |
          pip install ansible-lint yamllint
      - name: Run linters
        run: |
          ansible-lint site.yml
          yamllint .

  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install Molecule
        run: |
          pip install molecule molecule-plugins[docker]
      - name: Run Molecule tests
        run: |
          molecule test

  deploy_staging:
    needs: [lint, test]
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'
      - name: Install Ansible
        run: pip install ansible
      - name: Deploy to staging
        env:
          ANSIBLE_HOST_KEY_CHECKING: False
        run: |
          echo "${{ secrets.SSH_PRIVATE_KEY }}" > private_key
          chmod 600 private_key
          ansible-playbook -i inventory/staging.ini site.yml --private-key=private_key

7. Best Practices

✅ DO

  1. Use Ansible Vault - Encrypt secrets
  2. Idempotent playbooks - Can run multiple times
  3. Test in Molecule - Before production
  4. Version control - Git for all playbooks
  5. Document variables - Clear README
  6. Use roles - Modular organization
  7. Tag tasks - Selective execution
  8. CI/CD integration - Automated testing
  9. Dry runs - Always --check first
  10. Backup before changes - Safety net

❌ DON'T

  1. Don't hardcode secrets - Use Vault
  2. Don't skip testing - Staging first
  3. Don't use shell when module exists - Use PostgreSQL modules
  4. Don't ignore failed tasks - Handle errors
  5. Don't run without backups - Always backup first

8. Lab Exercises

Lab 1: Deploy cluster with Ansible

Tasks:

  1. Setup inventory for 3 nodes
  2. Create playbook with roles
  3. Deploy etcd cluster
  4. Deploy PostgreSQL + Patroni
  5. Verify cluster health

Lab 2: Configuration management

Tasks:

  1. Update max_connections via playbook
  2. Automate nightly backups
  3. Create playbook for DR failover
  4. Test configuration rollback
  5. Document all playbooks

Lab 3: CI/CD pipeline

Tasks:

  1. Setup GitLab/GitHub Actions
  2. Add linting stage
  3. Add Molecule testing
  4. Deploy to staging automatically
  5. Manual approval for production

Lab 4: Testing with Molecule

Tasks:

  1. Initialize Molecule scenario
  2. Write verification tests
  3. Test role in Docker containers
  4. Validate cluster functionality
  5. Integrate into CI pipeline

9. Tổng kết

Automation Benefits

Manual vs Automated:
- Deployment time: 4 hours → 15 minutes
- Error rate: 30% → < 1%
- Consistency: Variable → 100%
- Documentation: Outdated → Self-documenting
- Repeatability: Difficult → Trivial

Key Ansible Concepts

Inventory: Define hosts
Playbooks: Define tasks
Roles: Modular organization
Variables: Configuration data
Vault: Secret management
Modules: Reusable components
Handlers: Triggered actions
Tags: Selective execution

Next Steps

Bài 27 sẽ cover Disaster Recovery Drills:

  • DR planning procedures
  • Testing methodologies
  • Incident response workflows
  • Post-mortem analysis
  • Full DR simulation labs
ansible automation infrastructure-as-code ci-cd configuration-management lab

Đánh dấu hoàn thành (Bài 26: Automation với Ansible)