How NetEase Solved 10 Years of Ops Challenges: From Scripts to Cloud-native Platforms
In this talk, senior NetEase operations engineer Gu Xianjie shares a decade‑long journey tackling technical debt, rapid product growth, and on‑call pain points, describing the evolution from manual scripts to automated platforms, service‑oriented tools, DevOps/SRE practices, and cloud‑native strategies that boosted efficiency and reliability.
Today I will share not the amount of our business, but the problems we encountered in operations, our reflections, explorations, and cloud applications.
1. Challenges of System Operations
NetEase has faced many challenges over its 20‑year history, including massive technical debt, rapid product growth that outpaces staff, sudden demand spikes, and legacy systems such as the long‑standing NetEase Passport. On‑call pain points, communication gaps with developers, and repetitive issue reporting also strain the team.
Resource constraints and unconventional operational strategies, such as custom scripts and ad‑hoc fixes, further complicate reliability. The team also deals with dangerous, dull, and "dirty" tasks that demand automation.
2. Evolution of Operations Tools
Initially, operations relied on simple scripts for a few hundred servers. As business scaled dramatically, the team moved to automation tools such as cfengine, Puppet 2.0/3.0, and eventually built internal platforms that support modular, repo‑based management.
Tool evolution also addressed bottlenecks: scripts became ineffective beyond 50‑100 machines, prompting the adoption of configuration‑distribution systems and platform services. Cost pressure from leadership drove optimization of server resources and online product consumption.
3. Service‑ization of Operations Capability
The team pursued service‑oriented approaches in three areas: security, lifecycle maintenance, and foundational services. Projects like the "Zero" platform provide a Rails‑based configuration service, while DDoS automatic defense and log‑analysis pipelines turn security operations into reusable services.
DevOps and SRE practices are integrated across platforms, with RESTful, stateless APIs, strict data formats, and strong authentication (similar to AWS SigV4). Modules are designed to be idempotent, robust, and easily replaceable, enabling rapid iteration and multi‑datacenter support.
4. Facing Cloud Challenges
In the cloud era, traditional OPS is being replaced by DevOps and SRE. NetEase Cloud leverages its own SDN network, PaaS layers, and automated workflows to handle account provisioning, cloud‑disk loading, and service delivery. The team also explores AlOps‑driven development to bridge operations and business needs.
These cloud‑native initiatives have dramatically improved operational efficiency, reduced incident resolution time, and increased the speed of feature delivery.
Overall, the NetEase experience demonstrates how systematic automation, platform engineering, and cloud‑native practices can transform legacy operations into a scalable, reliable service ecosystem.
Efficient Ops
This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.