Implementing High‑Availability PostgreSQL with Keepalived: Architecture, Setup, and Failover Procedures
This article explains how to use Keepalived together with PostgreSQL to build a two‑node high‑availability cluster, covering Keepalived's VRRP mechanism, host planning, installation steps, asynchronous master‑slave replication configuration, monitoring scripts, and detailed failover drills.
1. Keepalived Working Principle
Keepalived provides VRRP and health‑check functions, enabling a floating virtual IP (VIP) for a two‑node high‑availability solution.
It is based on the Virtual Router Redundancy Protocol (VRRP) and creates a router group with one master and multiple backups; the master advertises the VIP and backups take over if the master fails.
Download links: https://www.keepalived.org/download.html https://github.com/acassen/keepalived
2. Architecture
2.1 Host Planning
2.2 Architecture Diagram
3. Environment Preparation
3.1 Disable Firewall and SELinux
[root@localhost ~]# systemctl stop firewalld
[root@localhost keepalived]# vi /etc/selinux/config
SELINUX=disabled
[root@localhost keepalived]# getenforce
Disabled3.2 Install Keepalived
[root@localhost ~]# yum install gcc openssl openssl‑devel libnl libnl‑devel ipvsadm -y
[root@localhost ~]# cd /usr/local/src/
[root@localhost src]# tar -zxf keepalived-2.2.4.tar.gz
[root@localhost src]# cd keepalived-2.2.4
[root@localhost keepalived-2.2.4] ./configure --prefix=/usr/local/keepalived
[root@localhost keepalived-2.2.4] make && make install
[root@localhost keepalived]# ln -s /usr/local/src/keepalived-2.2.4/keepalived/etc/init.d/keepalived /etc/init.d/
[root@localhost keepalived]# ln -s /usr/local/keepalived/etc/sysconfig/keepalived /etc/sysconfig/
[root@localhost keepalived]# ln -s /usr/local/keepalived/sbin/keepalived /usr/local/sbin/
[root@localhost keepalived]# mkdir /etc/keepalived/
[root@localhost keepalived]# ln -s /usr/local/keepalived/etc/keepalived/keepalived.conf /etc/keepalived/
[root@localhost ~]# cat /etc/keepalived/keepalived.conf
[root@localhost keepalived-2.2.4]# systemctl start keepalived
[root@localhost keepalived-2.2.4]# systemctl status keepalived3.3 PostgreSQL Asynchronous Master‑Slave Replication Setup
On the primary server (192.168.254.128):
postgres=# create user replia replication superuser password '123qwert';
postgres=# create database keepalived owner replia;
postgres=# \c keepalived replia
keepalived=> create table sr_delay(id int4, last_alive timestamp(0) without time zone);
keepalived=> INSERT INTO sr_delay VALUES(1, now());Configure pg_hba.conf to allow replication:
host replication all 192.168.254.1/24 md5Set environment variables (e.g., in ~/.bash_profile ) and reload PostgreSQL.
3.4 PostgreSQL Database Configuration
postgres=# create user keepalived password '123qwert' CONNECTION LIMIT 4;
CREATE ROLE
postgres=# create database keepalived owner keepalived;
CREATE DATABASE
postgres=# \c keepalived keepalived
keepalived=> create table sr_delay(id int4, last_alive timestamp(0) without time zone);
CREATE TABLECreate a trigger function to prevent deletion of the single row in sr_delay :
CREATE FUNCTION cannt_delete ()
RETURNS trigger
LANGUAGE plpgsql AS $$
BEGIN
RAISE EXCEPTION 'Table sr_delay can not delete !';
END;
$$;
CREATE TRIGGER trigger_sr_delay_del BEFORE DELETE ON sr_delay FOR EACH ROW EXECUTE PROCEDURE cannt_delete();
CREATE TRIGGER trigger_sr_delay_tru BEFORE TRUNCATE ON sr_delay FOR STATEMENT EXECUTE PROCEDURE cannt_delete();3.5 Keepalived Configuration
[root@localhost ~]# cat /etc/keepalived/keepalived.conf
! Configuration File for keepalived
global_defs {
notification_email { [email protected] }
notification_email_from [email protected]
smtp_server 192.168.254.128
smtp_connect_timeout 30
router_id LVS_DEVEL
vrrp_skip_check_adv_addr
vrrp_strict
vrrp_garp_interval 0
vrrp_gna_interval 0
}
vrrp_script check_pg_alived {
script "/data/scripts/pg_monitor.sh"
interval 10
fall 3
}
vrrp_instance VI_1 {
state MASTER
nopreempt
interface ens33
virtual_router_id 51
priority 100
advert_int 1
authentication {
auth_type PASS
auth_pass 1111
}
virtual_ipaddress { 192.168.254.11 }
track_script { check_pg_alived }
notify_master "/data/scripts/failover.sh"
notify_fault "/data/scripts/fault.sh"
}On the backup node, change priority to 90, state to BACKUP , and adjust smtp_server to 192.168.254.129 .
3.6 Database Monitoring Scripts
pg_monitor.sh
#!/bin/bash
pgport=5432
pguser=keepalived
pgdb=keepalived
pgpwd='123qwert'
LANG=en_US.utf8
PGHOME=/usr/local/pgsql/
LD_LIBRARY_PATH=$PGHOME/lib:/lib64:/usr/lib64:/usr/local/lib64:/lib:/usr/lib:/usr/local/lib
PATH=$PGHOME/bin:$PATH:.
MONITOR_LOG="/data/scripts/pg_monitor.log"
SQL1="UPDATE sr_delay SET last_alive= now();"
SQL2='SELECT 1;'
keeplognums=30000
standby_flg=`psql -p$pgport -Upostgres -At -c "SELECT pg_is_in_recovery();"`
if [ ${standby_flg} == 't' ]; then
echo -e "$(date +%F\ %T): This is a standby database, exit!" > $MONITOR_LOG
exit 0
fi
export PGPASSWORD=$pgpwd
echo $SQL1 | psql -At -p $pgport -U $pguser -d $pgdb >> $MONITOR_LOG
echo $SQL2 | psql -At -p $pgport -U $pguser -d $pgdb
if [ $? -eq 0 ]; then
echo -e "$(date +%F\ %T): Primary db is health." >> $MONITOR_LOG
exit 0
else
echo -e "$(date +%F\ %T): Attention: Primary db is not health!" >> $MONITOR_LOG
exit 1
fi
if [ ! -f ${MONITOR_LOG} ]; then touch ${MONITOR_LOG}; fi
lognums=`cat ${MONITOR_LOG} | wc -l`
catnum=$((${lognums}-${keeplognums}))
if [ $lognums -gt ${keeplognums} ]; then sed -i "1,${catnum}d" ${MONITOR_LOG}; fifailover.sh
#!/bin/bash
export PGPORT=5432
export PGUSER=keepalived
export PG_OS_USER=postgres
export PGDBNAME=keepalived
export LANG=en_US.UTF-8
export PGMIP=127.0.0.1
LOGFILE='/data/scripts/failover.log'
sr_allowed_delay_time=100
SQL1='select pg_is_in_recovery from pg_is_in_recovery();'
SQL2="select count(1) as delay_time from sr_delay where now()< (last_alive + interval '${sr_allowed_delay_time} seconds');"
sleep $sr_allowed_delay_time
db_role=`echo $SQL1 | psql -At -p $PGPORT -U $PGUSER -d $PGDBNAME -w`
db_sr_delaytime=`echo $SQL2 | psql -p $PGPORT -d $PGDBNAME -U $PGUSER -At -w`
SWITCH_COMMAND='pg_ctl promote -D /data/pg_data/'
if [ ${db_role} == 'f' ]; then
echo -e "$(date +%F\ %T) Attention:The current postgreSQL DB is master database, cannot switched!" >> $LOGFILE
exit 0
fi
if [ $db_sr_delaytime -gt 0 ]; then
echo -e "$(date +%F\ %T) Attention:The current master database is health, the standby DB cannot switched!" >> $LOGFILE
exit 0
fi
if [ ! $db_sr_delaytime ]; then
echo -e "$(date +%F\ %T) Attention:The current database is standby, ready to switch master database!" >> $LOGFILE
su - $PG_OS_USER -c "$SWITCH_COMMAND"
elif [ $? -eq 0 ]; then
echo -e "$(date +%F\ %T) success:The current standby database succeeded to switch the primary PG database!" >> $LOGFILE
exit 0
else
echo -e "$(date +%F\ %T) Error: the standby database failed to switch the primary PG database!" >> $LOGFILE
exit 1
fifault.sh
#!/bin/bash
GFILE=/data/scripts/pg_db_fault.log
PGPORT=5432
PGMIP=192.168.254.128
LOGFILE=/data/scripts/pg_db_fault.log
PGPID="`head -n1 /data/pg_data/postmaster.pid`"
systemctl stop keepalived
kill -9 $PGPID
if [ $? -eq 0 ]; then
echo -e "$(date +%F\ %T) Error: Because the primary DB encountered an unknown problem, the PostgreSQL database is stopped!" >> $LOGFILE
systemctl stop keepalived
exit 1
fiMake scripts executable:
chmod 755 /data/scripts/pg_monitor.sh /data/scripts/failover.sh /data/scripts/fault.sh4. Switch Drill
4.1 First Switch
Stop the primary database on 192.168.254.128:
[root@localhost ~]# su - postgres
[postgres@localhost ~]$ pg_ctl -D /data/pg_data/ stopVerify the standby (192.168.254.129) is still serving requests, then use pg_rewind to synchronize the new primary back to the old primary:
pg_rewind -R --target-pgdata '/data/pg_data' --source-server 'host=192.168.254.129 port=5432 user=replia password=123qwert dbname=postgres'If WAL files have been archived, copy them back:
scp [email protected]:/data/pg_archive/00000002* /data/pg_archive/Update postgresql.conf with:
restore_command='cp /data/pg_archive/%f %p'Start PostgreSQL and confirm the walreceiver is running:
[postgres@localhost ~]$ pg_ctl -D /data/pg_data/ start
[postgres@localhost ~]$ ps axu | grep walreceiverStart Keepalived on the primary node and check its status:
[root@localhost ~]# systemctl start keepalived
[root@localhost ~]# systemctl status keepalived4.2 Second Switch
Stop the database on the standby (192.168.254.129) and repeat the rewind process to make the original primary become the new standby.
[postgres@localhost ~]$ pg_ctl -D /data/pg_data/ stop -mf pg_rewind -R --target-pgdata '/data/pg_data' --source-server 'host=192.168.254.128 port=5432 user=replia password=123qwert dbname=postgres'Copy any missing WAL files if necessary and restart PostgreSQL, then start Keepalived on both nodes and verify the VIP has moved to the new master.
5. Summary
1. The failover delay logic can be refined for more precise switching.
2. Automating pg_rewind with scripts reduces manual intervention.
3. Keepalived is highly flexible, allowing custom health checks and rules to be added via scripts.
Full-Stack Internet Architecture
Introducing full-stack Internet architecture technologies centered on Java
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.