PGVIP — Auto failover to standby PostgreSQL server
Tool is described in more detail on a separate website pgvip.org


Description

PGVIP - auto failover service from master to standby PostgreSQL database using virtual (floating) IP address

The service is installed on one server, which is a single point of failure (example, nginx proxy), or on any two servers (hereinafter referred to as web servers). There is no need to install anything on the database servers; the web server connects to them via ssh and executes commands from the configuration file. Administrator performs initial setup of the standby database and reverse switch manually.

Failover schemas:
for one web server
Failover schema for one web server


for multiple web servers
Failover schema for multiple web servers


Technically, the PGVIP service is implemented as a wrapper that executes operating system commands from a configuration file. It was originally developed as a bash script, then rewritten in C language.

Conditions underling the development:
more details in the section Important qualities

Main actions performed by the pgvip service: more details in the section Actions


Failover example

Status (pgvip status)
[root@web-server ~]# pgvip status
status: SUCCESS (duration: 0 00:00:32)

+---------+--------------+-------------------------+---------------+---------------+
| Role    | IP address   | Database state          | 192.168.56.10 | VIP auto down |
+---------+--------------+-------------------------+---------------+---------------+
| master  | 192.168.56.1 | read-write              | up on enp0s3  | executing     |
| standby | 192.168.56.2 | in recovery, lag 0 min. | down          | executing     |
+---------+--------------+-------------------------+---------------+---------------+


[root@web-server ~]# pgvip status
status: ERROR (duration: 0 00:00:14, cause: master db is not read-write)

+---------+--------------+-------------------------+---------------+---------------+
| Role    | IP address   | Database state          | 192.168.56.10 | VIP auto down |
+---------+--------------+-------------------------+---------------+---------------+
| master  | 192.168.56.1 | not available           | not available | not available |
| standby | 192.168.56.2 | in recovery, lag 0 min. | down          | executing     |
+---------+--------------+-------------------------+---------------+---------------+


[root@web-server ~]# pgvip status
status: STANDBY_PROMOTED (duration: 0 00:00:06)

+---------+--------------+----------------+---------------+---------------+
| Role    | IP address   | Database state | 192.168.56.10 | VIP auto down |
+---------+--------------+----------------+---------------+---------------+
| master  | 192.168.56.1 | not available  | not available | not available |
| standby | 192.168.56.2 | read-write     | up on enp0s3  | executing     |
+---------+--------------+----------------+---------------+---------------+


System journal (journalctl -fu pgvip or tail -f /var/log/pgsuite/pgvip.log)
[root@web-server ~]# journalctl -fu pgvip
Feb 08 09:49:17 web-server systemd[1]: Started pgvip (PostgreSQL master-standby auto failover).
Feb 08 09:49:17 web-server systemd[1]: Starting pgvip (PostgreSQL master-standby auto failover)...
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      thread started, thread_id: 2718
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ADMINISTRATION       thread started, thread_id: 2719
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ADMINISTRATION       binded to unix socket "/tmp/.s.PGVIP.0"
Feb 08 09:49:17 web-server pgvip[2713]: INFO  ADMINISTRATION       listening
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_MASTER_DB    thread started, thread_id: 2714
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_MASTER_VIP   thread started, thread_id: 2715
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_STANDBY_DB   thread started, thread_id: 2716
Feb 08 09:49:17 web-server pgvip[2713]: INFO  CHECKER_STANDBY_VIP  thread started, thread_id: 2717
...
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    PGSUITE-905 OS command executed with error (errno: 124)
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    command:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    timeout 5 ssh 192.168.56.1 "su - postgres -c "psql -Aqtc 'select pg_is_in_recovery() or pg_current_xact_id() is null, extract(epoch from now()-pg_last_xact_replay_timestamp())'""
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB    output:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_DB
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   PGSUITE-905 OS command executed with error (errno: 124)
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   command:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   timeout 5 ssh 192.168.56.1 "flock /tmp/pgvip.time -c 'date +%s > /tmp/pgvip.time' ; flock -n /tmp/pgvip.vip_auto_down -c '' ; echo -n $?, ; ip address show | awk '/ inet 192.168.56.10\// {print $NF}'"
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP   output:
Feb 08 09:50:15 web-server pgvip[2713]: WARN  CHECKER_MASTER_VIP
...
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing action "action_standby_db_promote"
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      condition: (time_now()-status.value_time)>g_time_standby_promote_delay_int && status.master_db_state!=G_DB_STATE_READ_WRITE && status.standby_db_state==G_DB_STATE_IN_RECOVERY && status.standby_db_lag<=g_time_standby_allowable_lag_int
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      status: ERROR (duration: 0 00:00:41, cause: master db is not read-write)
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+-------------------------+---------------+---------------+
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | Role    | IP address   | Database state          | 192.168.56.10 | VIP auto down |
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+-------------------------+---------------+---------------+
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | master  | 192.168.56.1 | not available           | not available | not available |
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | standby | 192.168.56.2 | in recovery, lag 0 min. | down          | executing     |
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+-------------------------+---------------+---------------+
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:50:56 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.1 "ip address del 192.168.56.10/24 dev `ip address show | awk '/ inet 192.168.56.10\// {print $NF}'`"
Feb 08 09:50:59 web-server pgvip[2713]: WARN  ACTION_EXECUTOR      PGSUITE-905 OS command executed with error (errno: 255)
Feb 08 09:50:59 web-server pgvip[2713]: WARN  ACTION_EXECUTOR      output:
Feb 08 09:50:59 web-server pgvip[2713]: WARN  ACTION_EXECUTOR
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.2 "su - postgres -c "\$(pg_config --bindir)/pg_ctl promote -W -D \$(psql -Aqtc 'show data_directory')""
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      output:
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      server promoting
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      command executed successfully
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:50:59 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.2 "ip address add 192.168.56.10/24 dev `ip address show | awk '/ inet 192.168.56.2\// {print $NF}'`"
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      output:
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      command executed successfully
Feb 08 09:51:00 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      action "action_standby_db_promote" executed successfully
...
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing action "action_master_db_break"
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      condition: status.master_db_state==G_DB_STATE_READ_WRITE && status.standby_db_state==G_DB_STATE_READ_WRITE
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      status: STANDBY_PROMOTED (duration: 0 00:01:25)
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+----------------+---------------+---------------+
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | Role    | IP address   | Database state | 192.168.56.10 | VIP auto down |
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+----------------+---------------+---------------+
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | master  | 192.168.56.1 | read-write     | down          | executing     |
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      | standby | 192.168.56.2 | read-write     | up on enp0s3  | executing     |
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      +---------+--------------+----------------+---------------+---------------+
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      executing command:
Feb 08 09:52:29 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      timeout 5 ssh 192.168.56.1 "su - postgres -c "psql -c 'alter system set listen_addresses=pgvip_broken'" ; ps -e | awk '$4=="postmaster" || $4=="postgres" {print $1}' | xargs -r kill -9"
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      output:
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      ALTER SYSTEM
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      command executed successfully
Feb 08 09:52:30 web-server pgvip[2713]: INFO  ACTION_EXECUTOR      action "action_master_db_break" executed successfully