How Tencent’s BlueKing Automates Fault Recovery and Zero‑Touch Game Server Launch
This article explains how Tencent Game's BlueKing platform redesigns operations by building open‑source PaaS capabilities, automating fault self‑healing, enabling fully automated game server region launches, supporting self‑service change releases, leveraging big‑data for real‑time decisions, and moving toward open‑source and hybrid‑cloud solutions.
1. Fault Auto‑Recovery
Traditional manual fault repair has become a basic requirement; BlueKing implements automatic alarm handling using Fault Tree Analysis (FTA) where alerts are classified as critical alarms or pre‑warnings, each with processing or analysis logic.
Faults directly affect user experience and revenue; monitoring and automatic recovery are essential.
BlueKing provides a SaaS‑style "fault self‑healing" app that lets operators drag‑and‑drop to create fault logic trees for common alerts, and an IDE for custom complex logic.
During a test, a port alarm was automatically diagnosed as a dead process and restarted in 1 minute 13 seconds.
In the first half of 2015, BlueKing handled 3.31 million alerts, achieving a 100 % success rate for 3.03 million pre‑warnings and a 94.25 % success rate for 280 k alarms, saving over 10 k man‑hours.
2. Automated Game Server Region Launch
Game regions (servers) need frequent opening; BlueKing automates the entire workflow in four stages.
Stage 1: Automated Physical Deployment
Operators replace manual scripts with a BlueKing tool that calls atomic components to allocate resources and deploy servers.
Stage 2: Automated Environment Deployment
Additional steps such as time reset, test‑data cleanup, and website updates are scripted and integrated into the same tool.
Stage 3: Automated Decision Support
Product planners define opening rules; BlueKing’s data platform pulls real‑time metrics from IDC, computes whether a new region should be opened, and triggers the process automatically, with an optional manual confirmation step during testing.
Stage 4: Generic Launch Tool
After many game‑specific tools were built, experts consolidated common patterns into a universal launch tool that can be adopted across different games, reducing maintenance overhead.
3. Self‑Service Change Release
Similar automation applies to scaling, configuration changes, and deployments; any operation that can be expressed as Linux/Windows commands can be wrapped in BlueKing apps, allowing operators to provide solutions rather than manual labor.
Early BlueKing tools accounted for 90 % of operations; after standardization, they now represent less than 40 %.
4. Big‑Data‑Assisted Operation
BlueKing’s data platform streams real‑time metrics via Kafka and Storm, enabling multi‑dimensional monitoring, automated capacity expansion decisions, and product‑level user behavior analysis.
Detect simultaneous online increase and login drop to trigger alerts.
Calculate required server count based on CPU load and network traffic thresholds.
Segment users by download speed to target retention incentives.
5. Open Source and Hybrid‑Cloud Plans
BlueKing’s core components (configuration, job, and control platforms) have been deployed for gaming, finance, e‑commerce, and media customers, both on public and private clouds. An open‑source release of the configuration platform is planned for year‑end, pending internal review.
Current focus remains on internal deployments; large‑scale commercial private‑cloud offerings are not planned within the next six months.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.