How Tencent’s Blue Whale Powers Unattended Ops, SaaS Automation, and DevOps Value
The talk outlines Tencent’s Blue Whale platform, describing how automated publishing tools, unattended change processes, fault‑handling strategies, alert‑driven self‑healing, low‑cost tool culture, and a thriving DevOps ecosystem together transform operations from routine maintenance to high‑value, scalable services.
4. SaaS
4.1 Publishing Tool
This is a low‑cost operation built in a few hours that simply invokes scripts or other actions, reducing manual effort. By turning the work into a scheduled automation, operators configure the workflow once and let an outsourced role execute it, freeing them from repetitive tasks.
When many such systems exist, they are gradually integrated. Because the business deals with non‑standardized workloads, SaaS layers standardize processes step by step—for example, consolidating various expansion operations into a common scaling system.
Our philosophy differs from many companies that first standardize then automate; we automate first, then standardize.
Unattended Change
We offer a service called "open zone" for game servers, which traditionally takes a long time. Operators break the whole operation into atomic steps, externalize the deployment, and embed decision logic so the process can run without human intervention.
For example, a fourth node in the workflow is a WeChat review; before the unattended flow is trusted, the product team must approve it to avoid uncontrolled mass openings.
Interactive actions such as WeChat or SMS confirmations become automated nodes. During debugging, a button can bypass the confirmation step, achieving fully unattended zone openings for many games.
4.2 Fault Handling – How to Build SaaS
MTBF = MTTF + MTTR , meaning "keep services running and fix failures quickly".
High availability cannot rely solely on architecture; operations must still intervene when failures occur.
Operations handle three core tasks during incidents:
Process alerts
Set up pre‑warnings
Restore high availability
4.2 Alerting and Self‑Healing
Our alert system categorizes alerts using a tree‑structured workflow. Different operators may construct different fault trees for the same alert based on their business understanding.
Fault‑tree configuration can be done by writing code (the most flexible) or by dragging logic in a visual page builder; the system then executes the defined chain automatically.
The correctness and reliability of the fault tree directly affect self‑healing success rates. Automated handling is much faster than manual intervention, though the steps are equivalent.
We aggregate data from various business systems, parse incoming alerts, match them to known patterns, and trigger appropriate remediation actions—for example, automatically diagnosing network‑related alerts.
Three independent self‑healing systems are deployed across regions; they can recover each other unless all three are simultaneously destroyed.
Our operations support two service types:
Basic services : A 200‑person DevOps ecosystem aims for >60% unattended rate as a KPI.
Value‑added services : Operations can directly boost business metrics by analyzing user flow, identifying drop‑off points, and providing actionable insights that product teams can use to improve conversion.
4.3 Why Push Costs to the Floor?
1) Validation errors are valuable – Small, low‑cost projects expose flawed ideas early without heavy resource waste.
2) Operations‑centric tools – Tools must be mandatory for product teams but optional for operators, allowing flexibility.
3) Dual value of ops systems – Besides efficiency and safety gains, each system serves as a reference for future platforms; low‑cost systems accelerate this evolution.
4) Maximizing technical ops team efficiency – By enabling ops to develop demos and collaborate with development, the overall workflow improves.
Our ops team is divided into four roles: planning, development, assistance, and business ops. Planners identify automation opportunities, which developers then implement.
4.4 Tool Culture
"Where a tool can be used, a person should not be." When tool‑building costs are low, teams naturally turn small tasks into systems, embedding a strong tool culture.
DevOps Ecosystem
DevOps – a one‑way road.
When operators experience the personal value growth from delivering value‑added services, they rarely revert.
We aim to keep ops staff focused on their own skill set rather than forcing them to learn unrelated languages; low‑cost tools enable sustainable growth.
Internal transformation and external hiring work together to convert existing staff into DevOps roles while recruiting new talent.
Operations deliver three kinds of value:
Self‑service automation that frees operators.
Development of operational systems that relieve product teams of repetitive work.
Data‑driven analysis of operational environments to support business decisions, a capability beyond typical development teams.
My sharing ends here, thank you!
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.