Databases 20 min read

Implementing High‑Availability PostgreSQL with Keepalived: Architecture, Setup, and Failover Procedures

This article explains how to use Keepalived together with PostgreSQL to build a two‑node high‑availability cluster, covering Keepalived's VRRP mechanism, host planning, installation steps, asynchronous master‑slave replication configuration, monitoring scripts, and detailed failover drills.

Full-Stack Internet Architecture
Full-Stack Internet Architecture
Full-Stack Internet Architecture
Implementing High‑Availability PostgreSQL with Keepalived: Architecture, Setup, and Failover Procedures

1. Keepalived Working Principle

Keepalived provides VRRP and health‑check functions, enabling a floating virtual IP (VIP) for a two‑node high‑availability solution.

It is based on the Virtual Router Redundancy Protocol (VRRP) and creates a router group with one master and multiple backups; the master advertises the VIP and backups take over if the master fails.

Download links: https://www.keepalived.org/download.html https://github.com/acassen/keepalived

2. Architecture

2.1 Host Planning

2.2 Architecture Diagram

3. Environment Preparation

3.1 Disable Firewall and SELinux

[root@localhost ~]# systemctl stop firewalld
[root@localhost keepalived]# vi /etc/selinux/config
SELINUX=disabled
[root@localhost keepalived]# getenforce
Disabled

3.2 Install Keepalived

[root@localhost ~]# yum install gcc openssl openssl‑devel libnl libnl‑devel ipvsadm -y
[root@localhost ~]# cd /usr/local/src/
[root@localhost src]# tar -zxf keepalived-2.2.4.tar.gz
[root@localhost src]# cd keepalived-2.2.4
[root@localhost keepalived-2.2.4] ./configure --prefix=/usr/local/keepalived
[root@localhost keepalived-2.2.4] make && make install
[root@localhost keepalived]# ln -s /usr/local/src/keepalived-2.2.4/keepalived/etc/init.d/keepalived /etc/init.d/
[root@localhost keepalived]# ln -s /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
[root@localhost keepalived]# ln -s /usr/local/keepalived/sbin/keepalived /usr/local/sbin/
[root@localhost keepalived]# mkdir /etc/keepalived/
[root@localhost keepalived]# ln -s /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
[root@localhost ~]# cat /etc/keepalived/keepalived.conf
[root@localhost keepalived-2.2.4]# systemctl start keepalived
[root@localhost keepalived-2.2.4]# systemctl status keepalived

3.3 PostgreSQL Asynchronous Master‑Slave Replication Setup

On the primary server (192.168.254.128):

postgres=# create user replia replication superuser password '123qwert';
postgres=# create database keepalived owner replia;
postgres=# \c keepalived replia
keepalived=> create table sr_delay(id int4, last_alive timestamp(0) without time zone);
keepalived=> INSERT INTO sr_delay VALUES(1, now());

Configure pg_hba.conf to allow replication:

host    replication     all             192.168.254.1/24      md5

Set environment variables (e.g., in ~/.bash_profile ) and reload PostgreSQL.

3.4 PostgreSQL Database Configuration

postgres=# create user keepalived password '123qwert' CONNECTION LIMIT 4;
CREATE ROLE
postgres=# create database keepalived owner keepalived;
CREATE DATABASE
postgres=# \c keepalived keepalived
keepalived=> create table sr_delay(id int4, last_alive timestamp(0) without time zone);
CREATE TABLE

Create a trigger function to prevent deletion of the single row in sr_delay :

CREATE FUNCTION cannt_delete ()
RETURNS trigger
LANGUAGE plpgsql AS $$
BEGIN
  RAISE EXCEPTION 'Table sr_delay can not delete !';
END;
$$;

CREATE TRIGGER trigger_sr_delay_del BEFORE DELETE ON sr_delay FOR EACH ROW EXECUTE PROCEDURE cannt_delete();
CREATE TRIGGER trigger_sr_delay_tru BEFORE TRUNCATE ON sr_delay FOR STATEMENT EXECUTE PROCEDURE cannt_delete();

3.5 Keepalived Configuration

[root@localhost ~]# cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
   notification_email { [email protected] }
   notification_email_from [email protected]
   smtp_server 192.168.254.128
   smtp_connect_timeout 30
   router_id LVS_DEVEL
   vrrp_skip_check_adv_addr
   vrrp_strict
   vrrp_garp_interval 0
   vrrp_gna_interval 0
}

vrrp_script check_pg_alived {
   script "/data/scripts/pg_monitor.sh"
   interval 10
   fall 3
}

vrrp_instance VI_1 {
   state MASTER
   nopreempt
   interface ens33
   virtual_router_id 51
   priority 100
   advert_int 1
   authentication {
       auth_type PASS
       auth_pass 1111
   }
   virtual_ipaddress { 192.168.254.11 }
   track_script { check_pg_alived }
   notify_master "/data/scripts/failover.sh"
   notify_fault  "/data/scripts/fault.sh"
}

On the backup node, change priority to 90, state to BACKUP , and adjust smtp_server to 192.168.254.129 .

3.6 Database Monitoring Scripts

pg_monitor.sh

#!/bin/bash
pgport=5432
pguser=keepalived
pgdb=keepalived
pgpwd='123qwert'
LANG=en_US.utf8
PGHOME=/usr/local/pgsql/
LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib
PATH=$PGHOME/bin:$PATH:.
MONITOR_LOG="/data/scripts/pg_monitor.log"
SQL1="UPDATE sr_delay SET last_alive= now();"
SQL2='SELECT 1;'
keeplognums=30000

standby_flg=`psql -p$pgport -Upostgres -At -c "SELECT pg_is_in_recovery();"`
if [ ${standby_flg} == 't' ]; then
  echo -e "$(date +%F\ %T): This is a standby database, exit!" > $MONITOR_LOG
  exit 0
fi
export PGPASSWORD=$pgpwd

echo $SQL1 | psql -At -p $pgport -U $pguser -d $pgdb >> $MONITOR_LOG

echo $SQL2 | psql -At -p $pgport -U $pguser -d $pgdb
if [ $? -eq 0 ]; then
  echo -e "$(date +%F\ %T): Primary db is health." >> $MONITOR_LOG
  exit 0
else
  echo -e "$(date +%F\ %T): Attention: Primary db is not health!" >> $MONITOR_LOG
  exit 1
fi

if [ ! -f ${MONITOR_LOG} ]; then touch ${MONITOR_LOG}; fi
lognums=`cat ${MONITOR_LOG} | wc -l`
catnum=$((${lognums}-${keeplognums}))
if [ $lognums -gt ${keeplognums} ]; then sed -i "1,${catnum}d" ${MONITOR_LOG}; fi

failover.sh

#!/bin/bash
export PGPORT=5432
export PGUSER=keepalived
export PG_OS_USER=postgres
export PGDBNAME=keepalived
export LANG=en_US.UTF-8
export PGMIP=127.0.0.1
LOGFILE='/data/scripts/failover.log'
sr_allowed_delay_time=100

SQL1='select pg_is_in_recovery from pg_is_in_recovery();'
SQL2="select count(1) as delay_time from sr_delay where now()< (last_alive + interval '${sr_allowed_delay_time} seconds');"
sleep $sr_allowed_delay_time

db_role=`echo $SQL1 | psql -At -p $PGPORT -U $PGUSER -d $PGDBNAME -w`

db_sr_delaytime=`echo $SQL2 | psql -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w`
SWITCH_COMMAND='pg_ctl promote -D /data/pg_data/'

if [ ${db_role} == 'f' ]; then
  echo -e "$(date +%F\ %T) Attention:The current postgreSQL DB is master database, cannot switched!" >> $LOGFILE
  exit 0
fi

if [ $db_sr_delaytime -gt 0 ]; then
  echo -e "$(date +%F\ %T) Attention:The current master database is health, the standby DB cannot switched!" >> $LOGFILE
  exit 0
fi

if [ ! $db_sr_delaytime ]; then
  echo -e "$(date +%F\ %T) Attention:The current database is standby, ready to switch master database!" >> $LOGFILE
  su - $PG_OS_USER -c "$SWITCH_COMMAND"
elif [ $? -eq 0 ]; then
  echo -e "$(date +%F\ %T) success:The current standby database succeeded to switch the primary PG database!" >> $LOGFILE
  exit 0
else
  echo -e "$(date +%F\ %T) Error: the standby database failed to switch the primary PG database!" >> $LOGFILE
  exit 1
fi

fault.sh

#!/bin/bash
GFILE=/data/scripts/pg_db_fault.log
PGPORT=5432
PGMIP=192.168.254.128
LOGFILE=/data/scripts/pg_db_fault.log

PGPID="`head -n1 /data/pg_data/postmaster.pid`"

systemctl stop keepalived
kill -9 $PGPID
if [ $? -eq 0 ]; then
  echo -e "$(date +%F\ %T) Error: Because the primary DB encountered an unknown problem, the PostgreSQL database is stopped!" >> $LOGFILE
  systemctl stop keepalived
  exit 1
fi

Make scripts executable:

chmod 755 /data/scripts/pg_monitor.sh /data/scripts/failover.sh /data/scripts/fault.sh

4. Switch Drill

4.1 First Switch

Stop the primary database on 192.168.254.128:

[root@localhost ~]# su - postgres
[postgres@localhost ~]$ pg_ctl -D /data/pg_data/ stop

Verify the standby (192.168.254.129) is still serving requests, then use pg_rewind to synchronize the new primary back to the old primary:

pg_rewind -R --target-pgdata '/data/pg_data' --source-server 'host=192.168.254.129 port=5432 user=replia password=123qwert dbname=postgres'

If WAL files have been archived, copy them back:

scp [email protected]:/data/pg_archive/00000002* /data/pg_archive/

Update postgresql.conf with:

restore_command='cp /data/pg_archive/%f %p'

Start PostgreSQL and confirm the walreceiver is running:

[postgres@localhost ~]$ pg_ctl -D /data/pg_data/ start
[postgres@localhost ~]$ ps axu | grep walreceiver

Start Keepalived on the primary node and check its status:

[root@localhost ~]# systemctl start keepalived
[root@localhost ~]# systemctl status keepalived

4.2 Second Switch

Stop the database on the standby (192.168.254.129) and repeat the rewind process to make the original primary become the new standby.

[postgres@localhost ~]$ pg_ctl -D /data/pg_data/ stop -mf
pg_rewind -R --target-pgdata '/data/pg_data' --source-server 'host=192.168.254.128 port=5432 user=replia password=123qwert dbname=postgres'

Copy any missing WAL files if necessary and restart PostgreSQL, then start Keepalived on both nodes and verify the VIP has moved to the new master.

5. Summary

1. The failover delay logic can be refined for more precise switching.

2. Automating pg_rewind with scripts reduces manual intervention.

3. Keepalived is highly flexible, allowing custom health checks and rules to be added via scripts.

High AvailabilityPostgreSQLDatabase ReplicationfailovervrrpKeepalived
Full-Stack Internet Architecture
Written by

Full-Stack Internet Architecture

Introducing full-stack Internet architecture technologies centered on Java

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.