Operations 17 min read

How NetEase Solved 10 Years of Ops Challenges: From Scripts to Cloud-native Platforms

In this talk, senior NetEase operations engineer Gu Xianjie shares a decade‑long journey tackling technical debt, rapid product growth, and on‑call pain points, describing the evolution from manual scripts to automated platforms, service‑oriented tools, DevOps/SRE practices, and cloud‑native strategies that boosted efficiency and reliability.

Efficient Ops

Feb 26, 2019

How NetEase Solved 10 Years of Ops Challenges: From Scripts to Cloud-native Platforms

Today I will share not the amount of our business, but the problems we encountered in operations, our reflections, explorations, and cloud applications.

1. Challenges of System Operations

NetEase has faced many challenges over its 20‑year history, including massive technical debt, rapid product growth that outpaces staff, sudden demand spikes, and legacy systems such as the long‑standing NetEase Passport. On‑call pain points, communication gaps with developers, and repetitive issue reporting also strain the team.

Resource constraints and unconventional operational strategies, such as custom scripts and ad‑hoc fixes, further complicate reliability. The team also deals with dangerous, dull, and "dirty" tasks that demand automation.

2. Evolution of Operations Tools

Initially, operations relied on simple scripts for a few hundred servers. As business scaled dramatically, the team moved to automation tools such as cfengine, Puppet 2.0/3.0, and eventually built internal platforms that support modular, repo‑based management.

Tool evolution also addressed bottlenecks: scripts became ineffective beyond 50‑100 machines, prompting the adoption of configuration‑distribution systems and platform services. Cost pressure from leadership drove optimization of server resources and online product consumption.

3. Service‑ization of Operations Capability

The team pursued service‑oriented approaches in three areas: security, lifecycle maintenance, and foundational services. Projects like the "Zero" platform provide a Rails‑based configuration service, while DDoS automatic defense and log‑analysis pipelines turn security operations into reusable services.

DevOps and SRE practices are integrated across platforms, with RESTful, stateless APIs, strict data formats, and strong authentication (similar to AWS SigV4). Modules are designed to be idempotent, robust, and easily replaceable, enabling rapid iteration and multi‑datacenter support.

4. Facing Cloud Challenges

In the cloud era, traditional OPS is being replaced by DevOps and SRE. NetEase Cloud leverages its own SDN network, PaaS layers, and automated workflows to handle account provisioning, cloud‑disk loading, and service delivery. The team also explores AlOps‑driven development to bridge operations and business needs.

These cloud‑native initiatives have dramatically improved operational efficiency, reduced incident resolution time, and increased the speed of feature delivery.

Overall, the NetEase experience demonstrates how systematic automation, platform engineering, and cloud‑native practices can transform legacy operations into a scalable, reliable service ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Platform Engineering SRE

Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.