Operations 24 min read

Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform

At the 2020 GOPS Global Operations Conference, Tencent’s senior operations engineer Xie Hailin detailed the design and implementation of the Panshi platform—a comprehensive, high‑availability solution that unifies change management, fault handling, continuous operation, and disaster recovery to ensure uninterrupted payment services for billions of daily transactions.

Efficient Ops

Dec 1, 2020

Zero‑Downtime Ops: Inside Tencent’s Panshi High‑Availability Platform

Based on Xie Hailin’s talk at the 2020 GOPS Global Operations Conference (Shenzhen), this article introduces the “Panshi” platform, a financial‑grade high‑availability operations system built by Tencent.

1. Background of Panshi

After the 2014 “red‑packet war”, the payment business grew to tens of thousands of transactions per second during peak periods and millions per second of settlement, with daily traffic reaching billions. The system comprises tens of thousands of servers, generating trillions of log entries each day. Business requirements demand sub‑200 ms response times, strict availability, and real‑time balance visibility.

Frequent releases, hardware unreliability, software bugs, and human error further increase operational risk.

2. Overall Solution

The guiding principle is to “fully cover uncertainty and minimise the impact of unknown risks”. Three dedicated platforms were built:

Unified Change – ensure loss‑less releases.

Fault Handling – minimise business impact when failures occur.

Continuous Operation – continuously reduce hidden risks.

All three aim to keep the platform as stable as a rock.

3. Unified Change System

The change system follows five key practices:

System‑wide login elimination – all changes are performed through approved workflows and tools.

Gradual (gray) rollout – releases are phased from low‑priority services to high‑priority ones.

Instant rollback – any problem triggers an immediate revert.

Consistency (publish‑and‑take‑effect) – releases become effective as soon as they are approved.

Change record – every deployment is fully traceable.

During gray rollout, traffic is increased step‑by‑step, each step is validated through automated checks and single‑side deployment to ensure no impact on the live system.

Three safety nets are provided:

Switch traffic to the opposite side if the current side fails.

Version rollback with gray‑scale support.

Baseline rollback – force a revert to the previous stable version when both switch and rollback fail.

4. Fault Handling System

Faults are divided into high‑frequency known failures (self‑healing) and other failures that require manual intervention. The handling workflow includes:

Fault discovery.

Fault localisation.

Fault remediation.

Post‑mortem review.

Fault‑drill exercises.

Key challenges include massive data volume, fast and accurate alarm detection, and efficient remediation.

Alarm management evolved through three generations:

First‑generation: manual configuration, prone to drift.

Second‑generation: AIOps 1.0 – algorithmic thresholds learned offline.

Third‑generation: AIOps 2.0 – real‑time learning, area‑based alarm aggregation, and anomaly classification.

To reduce noise, alarms are filtered by impact area and severity; low‑priority alerts are monitored, while high‑priority alerts trigger immediate response and automatic tagging.

5. Future of Panshi Operations

The roadmap focuses on platform‑level service capabilities: unified on‑call, continuous operation, SRE metrics, and an operations delivery pipeline. The PASS platform will integrate fault handling, change management, capacity planning, and an open SRE API.

Infrastructure will provide configuration centers, workflow engines, operation channels, and data pipelines, with an emphasis on integration, openness, and intelligence to further reduce manual dependence.

Ultimately, the goal is to deliver at least 60 % of operations capability out‑of‑the‑box for any service that integrates with Panshi, and to enable teams to reach 90 % through self‑service extensions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

monitoring Operations high availability change management aiops fault handling

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.