PGVIP — Auto failover to standby PostgreSQL server
Tool is described in more detail on a separate website pgvip.org
PGVIP - auto failover service from master to standby PostgreSQL database using virtual (floating) IP address
The service is installed on one server, which is a single point of failure (example, nginx proxy), or on any two servers (hereinafter referred to as web servers). There is no need to install anything on the database servers; the web server connects to them via ssh and executes commands from the configuration file. Administrator performs initial setup of the standby database and reverse switch manually.
Failover schemas:
for one web server

for multiple web servers

Technically, the PGVIP service is implemented as a wrapper that executes operating system commands from a configuration file. It was originally developed as a bash script, then rewritten in C language.
Conditions underling the development:
- small projects in which installing a cluster (for example, Patroni) is impractical
- no requirements for a standby database
- the administrator manually creates a standby database and after failover switches it back using standard tools (pg_basebackup)
- all commands are customizable; the administrator can run any command on the command line to check it
- direct connection to the master database without intermediate layers, VIP is activated (up) on the network interface with a physical address
- compilation with one command without configuration and libraries
- actions are performed only depending on the current state, this allows to avoid saving information about the state of databases and not synchronizing web servers
Main actions performed by the pgvip service:
- checking the availability of the master database and the lag of the standby database, the status of the virtual IP address (up/down)
- managing a virtual IP address, running a script to automatically turn it off (necessary if the connection is lost)
- activation (promotion) of a standby database and transfer of a virtual IP address
Status (pgvip status)
[root@web-server ~]# pgvip status status: SUCCESS (duration: 0 00:00:32) +---------+--------------+-------------------------+---------------+---------------+ | Role | IP address | Database state | 192.168.56.10 | VIP auto down | +---------+--------------+-------------------------+---------------+---------------+ | master | 192.168.56.1 | read-write | up on enp0s3 | executing | | standby | 192.168.56.2 | in recovery, lag 0 min. | down | executing | +---------+--------------+-------------------------+---------------+---------------+ [root@web-server ~]# pgvip status status: ERROR (duration: 0 00:00:14, cause: master db is not read-write) +---------+--------------+-------------------------+---------------+---------------+ | Role | IP address | Database state | 192.168.56.10 | VIP auto down | +---------+--------------+-------------------------+---------------+---------------+ | master | 192.168.56.1 | not available | not available | not available | | standby | 192.168.56.2 | in recovery, lag 0 min. | down | executing | +---------+--------------+-------------------------+---------------+---------------+ [root@web-server ~]# pgvip status status: STANDBY_PROMOTED (duration: 0 00:00:06) +---------+--------------+----------------+---------------+---------------+ | Role | IP address | Database state | 192.168.56.10 | VIP auto down | +---------+--------------+----------------+---------------+---------------+ | master | 192.168.56.1 | not available | not available | not available | | standby | 192.168.56.2 | read-write | up on enp0s3 | executing | +---------+--------------+----------------+---------------+---------------+
System journal (journalctl -fu pgvip or tail -f /var/log/pgsuite/pgvip.log)
[root@web-server ~]# journalctl -fu pgvip Feb 08 09:49:17 web-server systemd[1]: Started pgvip (PostgreSQL master-standby auto failover). Feb 08 09:49:17 web-server systemd[1]: Starting pgvip (PostgreSQL master-standby auto failover)... Feb 08 09:49:17 web-server pgvip[2713]: INFO ACTION_EXECUTOR thread started, thread_id: 2718 Feb 08 09:49:17 web-server pgvip[2713]: INFO ADMINISTRATION thread started, thread_id: 2719 Feb 08 09:49:17 web-server pgvip[2713]: INFO ADMINISTRATION binded to unix socket "/tmp/.s.PGVIP.0" Feb 08 09:49:17 web-server pgvip[2713]: INFO ADMINISTRATION listening Feb 08 09:49:17 web-server pgvip[2713]: INFO CHECKER_MASTER_DB thread started, thread_id: 2714 Feb 08 09:49:17 web-server pgvip[2713]: INFO CHECKER_MASTER_VIP thread started, thread_id: 2715 Feb 08 09:49:17 web-server pgvip[2713]: INFO CHECKER_STANDBY_DB thread started, thread_id: 2716 Feb 08 09:49:17 web-server pgvip[2713]: INFO CHECKER_STANDBY_VIP thread started, thread_id: 2717 ... Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_DB PGSUITE-905 OS command executed with error (errno: 124) Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_DB command: Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_DB timeout 5 ssh 192.168.56.1 "su - postgres -c "psql -Aqtc 'select pg_is_in_recovery() or pg_current_xact_id() is null, extract(epoch from now()-pg_last_xact_replay_timestamp())'"" Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_DB output: Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_DB Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_VIP PGSUITE-905 OS command executed with error (errno: 124) Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_VIP command: Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_VIP timeout 5 ssh 192.168.56.1 "flock /tmp/pgvip.time -c 'date +%s > /tmp/pgvip.time' ; flock -n /tmp/pgvip.vip_auto_down -c '' ; echo -n $?, ; ip address show | awk '/ inet 192.168.56.10\// {print $NF}'" Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_VIP output: Feb 08 09:50:15 web-server pgvip[2713]: WARN CHECKER_MASTER_VIP ... Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR executing action "action_standby_db_promote" Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR condition: (time_now()-status.value_time)>g_time_standby_promote_delay_int && status.master_db_state!=G_DB_STATE_READ_WRITE && status.standby_db_state==G_DB_STATE_IN_RECOVERY && status.standby_db_lag<=g_time_standby_allowable_lag_int Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR status: ERROR (duration: 0 00:00:41, cause: master db is not read-write) Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR +---------+--------------+-------------------------+---------------+---------------+ Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR | Role | IP address | Database state | 192.168.56.10 | VIP auto down | Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR +---------+--------------+-------------------------+---------------+---------------+ Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR | master | 192.168.56.1 | not available | not available | not available | Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR | standby | 192.168.56.2 | in recovery, lag 0 min. | down | executing | Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR +---------+--------------+-------------------------+---------------+---------------+ Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR executing command: Feb 08 09:50:56 web-server pgvip[2713]: INFO ACTION_EXECUTOR timeout 5 ssh 192.168.56.1 "ip address del 192.168.56.10/24 dev `ip address show | awk '/ inet 192.168.56.10\// {print $NF}'`" Feb 08 09:50:59 web-server pgvip[2713]: WARN ACTION_EXECUTOR PGSUITE-905 OS command executed with error (errno: 255) Feb 08 09:50:59 web-server pgvip[2713]: WARN ACTION_EXECUTOR output: Feb 08 09:50:59 web-server pgvip[2713]: WARN ACTION_EXECUTOR Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR executing command: Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR timeout 5 ssh 192.168.56.2 "su - postgres -c "\$(pg_config --bindir)/pg_ctl promote -W -D \$(psql -Aqtc 'show data_directory')"" Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR output: Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR server promoting Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR command executed successfully Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR executing command: Feb 08 09:50:59 web-server pgvip[2713]: INFO ACTION_EXECUTOR timeout 5 ssh 192.168.56.2 "ip address add 192.168.56.10/24 dev `ip address show | awk '/ inet 192.168.56.2\// {print $NF}'`" Feb 08 09:51:00 web-server pgvip[2713]: INFO ACTION_EXECUTOR output: Feb 08 09:51:00 web-server pgvip[2713]: INFO ACTION_EXECUTOR Feb 08 09:51:00 web-server pgvip[2713]: INFO ACTION_EXECUTOR command executed successfully Feb 08 09:51:00 web-server pgvip[2713]: INFO ACTION_EXECUTOR action "action_standby_db_promote" executed successfully ... Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR executing action "action_master_db_break" Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR condition: status.master_db_state==G_DB_STATE_READ_WRITE && status.standby_db_state==G_DB_STATE_READ_WRITE Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR status: STANDBY_PROMOTED (duration: 0 00:01:25) Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR +---------+--------------+----------------+---------------+---------------+ Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR | Role | IP address | Database state | 192.168.56.10 | VIP auto down | Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR +---------+--------------+----------------+---------------+---------------+ Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR | master | 192.168.56.1 | read-write | down | executing | Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR | standby | 192.168.56.2 | read-write | up on enp0s3 | executing | Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR +---------+--------------+----------------+---------------+---------------+ Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR executing command: Feb 08 09:52:29 web-server pgvip[2713]: INFO ACTION_EXECUTOR timeout 5 ssh 192.168.56.1 "su - postgres -c "psql -c 'alter system set listen_addresses=pgvip_broken'" ; ps -e | awk '$4=="postmaster" || $4=="postgres" {print $1}' | xargs -r kill -9" Feb 08 09:52:30 web-server pgvip[2713]: INFO ACTION_EXECUTOR output: Feb 08 09:52:30 web-server pgvip[2713]: INFO ACTION_EXECUTOR ALTER SYSTEM Feb 08 09:52:30 web-server pgvip[2713]: INFO ACTION_EXECUTOR Feb 08 09:52:30 web-server pgvip[2713]: INFO ACTION_EXECUTOR command executed successfully Feb 08 09:52:30 web-server pgvip[2713]: INFO ACTION_EXECUTOR action "action_master_db_break" executed successfully