Databases 11 min read

PostgreSQL High Availability (PGHA) at Qunar: Architecture, Customization, Testing, and Metrics

This article details Qunar's implementation of PostgreSQL high‑availability using Patroni, covering solution selection, custom DCS and failover mechanisms, operational impact, comprehensive testing procedures, performance metrics, and future directions for cross‑region HA deployment.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
PostgreSQL High Availability (PGHA) at Qunar: Architecture, Customization, Testing, and Metrics

Qunar's DBA team needed a robust PostgreSQL high‑availability (HA) solution to meet strict availability targets (multiple 9s) and reduce manual switchover/failover errors.

Background : Hardware upgrades, manual switchover/failover steps, and unpredictable failures required an automated HA layer.

Solution selection : After evaluating popular open‑source PGHA tools, Patroni was chosen for its active community, Python‑based automation, and strong feature set.

Patroni architecture :

Key advantages include synchronous replication support, minimal data‑loss switchover, high automation to avoid split‑brain, and pg_rewind for recovery.

Operational impact : Patroni takes full control of cluster configuration files, making the cluster more invasive and requiring HA‑driven operational procedures.

Customizations for Qunar :

DCS (Distributed Configuration Store) was set to ZooKeeper (zk) because Qunar's stack is Java + zk.

zk is deployed as a centralized cluster for easier maintenance.

To avoid unreliable watchdog‑based power‑off, a custom OS console power‑reset ("kill") mechanism was implemented for rapid master shutdown.

Virtual IP (VIP) strategy: a floating VIP is used instead of PgBouncer/Pgpool/Haproxy, simplifying the architecture.

Both master and slave VIPs were customized to support automatic failover and switchover without service interruption.

Testing : After each customization iteration, comprehensive regression tests are run to ensure all test cases meet expectations.

Metrics :

Target availability: 99.95% (≈262.8 minutes downtime per year) and higher (up to 5 9s) with the HA layer.

Master failover timeline includes TTL, safety margin, loop_wait, OPS reset, promote time, and VIP bind (≈10 seconds total).

Switchover for master or slave typically completes within 3 seconds; slave failover within 10 seconds.

Formulas for safe parameter settings:

ttl - safety_margin > loop_wait
ttl >= loop_wait * 2 + retry_timeout
# defaults: ttl=30s, safety_margin=5s, loop_wait=5s, retry_timeout=10s

Future directions :

Namespace‑based connection strings without fixed host/IP, enabling cross‑region deployments.

Language‑specific connection pool implementations (Python, Java, Go) that map logical namespaces to physical hosts.

Leveraging native JDBC failover for Qunar's Java‑centric stack.

Conclusion : The multi‑phase PGHA rollout at Qunar is stable, achieves >5 9s availability, and continues to evolve through community feedback and internal optimizations.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

high availabilityMetricsZooKeeperDatabase operationsPatroniHA Testing
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.