PGVIP — Аварийное переключение на standby PostgreSQL
Более подробно инcтрумент описан на отдельном сайте pgvip.org


Описание

PGVIP - служба аварийного однократного переключения (auto failover) с основной(master) на резервную(standby) базу данных PostgreSQL с использованием виртуального(плавающего) IP-адреса

Служба устанавливается на один сервер, являющийся единой точной отказа (nginx proxy, сервер приложений 1С), или на два любых сервера (далее по тексту web-сервера). На серверах баз данных устанавливать что-либо не требуется, web-сервер подключается к ним по ssh и выполняет команды из конфигурационного файла. Первичную настройку резервной базы данных и обратное переключение администратор выполняет вручную.

Схемы аварийного переключения:
для одного web-сервера
Failover schema for one web server


для нескольких web-серверов
Failover schema for multiple web servers


Технически служба PGVIP реализована как обертка(wrapper), которая выполняет команды операционной системы из конфигурационного файла. Изначально была реализована как bash-скрипт, потом переписана на Си.

Условия, положенные в основу разработки:
более подробно в разделе Ключевые особенности

Основные действия, выполняемые службой pgvip: более подробно в разделе Действия


Пример переключения

Состояние (pgvip status)
[root@web-server ~]# pgvip status
status: SUCCESS (duration: 0 00:00:32)

+---------+--------------+-------------------------+---------------+---------------+
| Role    | IP address   | Database state          | 192.168.56.10 | VIP auto down |
+---------+--------------+-------------------------+---------------+---------------+
| master  | 192.168.56.1 | read-write              | up on enp0s3  | executing     |
| standby | 192.168.56.2 | in recovery, lag 0 min. | down          | executing     |
+---------+--------------+-------------------------+---------------+---------------+


[root@web-server ~]# pgvip status
status: ERROR (duration: 0 00:00:14, cause: master db is not read-write)

+---------+--------------+-------------------------+---------------+---------------+
| Role    | IP address   | Database state          | 192.168.56.10 | VIP auto down |
+---------+--------------+-------------------------+---------------+---------------+
| master  | 192.168.56.1 | not available           | not available | not available |
| standby | 192.168.56.2 | in recovery, lag 0 min. | down          | executing     |
+---------+--------------+-------------------------+---------------+---------------+


[root@web-server ~]# pgvip status
status: STANDBY_PROMOTED (duration: 0 00:00:06)

+---------+--------------+----------------+---------------+---------------+
| Role    | IP address   | Database state | 192.168.56.10 | VIP auto down |
+---------+--------------+----------------+---------------+---------------+
| master  | 192.168.56.1 | not available  | not available | not available |
| standby | 192.168.56.2 | read-write     | up on enp0s3  | executing     |
+---------+--------------+----------------+---------------+---------------+


Системный журнал (journalctl -fu pgvip or tail -f /var/log/pgsuite/pgvip.log)
[root@web-server ~]# journalctl -fu pgvip
Feb 08 09:49:17 web-server systemd[1]: Started pgvip (PostgreSQL master-standby auto failover).
Feb 08 09:49:17 web-server systemd[1]: Starting pgvip (PostgreSQL master-standby auto failover)...
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      thread started, thread_id: 2718
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ADMINISTRATION       thread started, thread_id: 2719
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ADMINISTRATION       binded to unix socket "/tmp/.s.PGVIP.0"
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ADMINISTRATION       listening
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_MASTER_DB    thread started, thread_id: 2714
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_MASTER_VIP   thread started, thread_id: 2715
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_STANDBY_DB   thread started, thread_id: 2716
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_STANDBY_VIP  thread started, thread_id: 2717
...
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    PGSUITE-905 OS command executed with error (errno: 124)
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    command:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    timeout 5 ssh 192.168.56.1 "su - postgres -c "psql -Aqtc 'select pg_is_in_recovery() or pg_current_xact_id() is null, extract(epoch from now()-pg_last_xact_replay_timestamp())'""
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    output:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   PGSUITE-905 OS command executed with error (errno: 124)
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   command:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   timeout 5 ssh 192.168.56.1 "flock /tmp/pgvip.time -c 'date +%s > /tmp/pgvip.time' ; flock -n /tmp/pgvip.vip_auto_down -c '' ; echo -n $?, ; ip address show | awk '/ inet 192.168.56.10\// {print $NF}'"
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   output:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP
...
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing action "action_standby_db_promote"
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      condition: (time_now()-status.value_time)>g_time_standby_promote_delay_int && status.master_db_state!=G_DB_STATE_READ_WRITE && status.standby_db_state==G_DB_STATE_IN_RECOVERY && status.standby_db_lag<=g_time_standby_allowable_lag_int
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      status: ERROR (duration: 0 00:00:41, cause: master db is not read-write)
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+-------------------------+---------------+---------------+
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | Role    | IP address   | Database state          | 192.168.56.10 | VIP auto down |
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+-------------------------+---------------+---------------+
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | master  | 192.168.56.1 | not available           | not available | not available |
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | standby | 192.168.56.2 | in recovery, lag 0 min. | down          | executing     |
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+-------------------------+---------------+---------------+
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.1 "ip address del 192.168.56.10/24 dev `ip address show | awk '/ inet 192.168.56.10\// {print $NF}'`"
Feb 08 09:50:59 web-server pgvip[2713]: WARN  ACTION_EXECUTOR      PGSUITE-905 OS command executed with error (errno: 255)
Feb 08 09:50:59 web-server pgvip[2713]: WARN  ACTION_EXECUTOR      output:
Feb 08 09:50:59 web-server pgvip[2713]: WARN  ACTION_EXECUTOR
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.2 "su - postgres -c "\$(pg_config --bindir)/pg_ctl promote -W -D \$(psql -Aqtc 'show data_directory')""
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      output:
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      server promoting
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      command executed successfully
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.2 "ip address add 192.168.56.10/24 dev `ip address show | awk '/ inet 192.168.56.2\// {print $NF}'`"
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      output:
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      command executed successfully
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      action "action_standby_db_promote" executed successfully
...
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing action "action_master_db_break"
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      condition: status.master_db_state==G_DB_STATE_READ_WRITE && status.standby_db_state==G_DB_STATE_READ_WRITE
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      status: STANDBY_PROMOTED (duration: 0 00:01:25)
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+----------------+---------------+---------------+
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | Role    | IP address   | Database state | 192.168.56.10 | VIP auto down |
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+----------------+---------------+---------------+
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | master  | 192.168.56.1 | read-write     | down          | executing     |
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | standby | 192.168.56.2 | read-write     | up on enp0s3  | executing     |
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+----------------+---------------+---------------+
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.1 "su - postgres -c "psql -c 'alter system set listen_addresses=pgvip_broken'" ; ps -e | awk '$4=="postmaster" || $4=="postgres" {print $1}' | xargs -r kill -9"
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      output:
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      ALTER SYSTEM
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      command executed successfully
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      action "action_master_db_break" executed successfully