Achieving Non‑Stop Game Maintenance: Practices and Network Solutions at NetEase Games
This article details how NetEase Games implemented non‑stop maintenance for large‑scale online titles by reusing resources with Docker virtualization, evaluating Calico and Flannel overlay networks, and building a custom SDN solution, ultimately improving operational efficiency and player experience.
Game downtime for maintenance has long been a challenge for online titles, causing reduced new user registrations and revenue on update days. NetEase Games addressed this by implementing a non‑stop maintenance strategy for major games such as Chu Liuxiang , Onmyoji , and Tomorrow After , dramatically improving operational efficiency.
The speaker, senior operations manager Richard, explained that traditional approaches rely on selecting low‑traffic windows and shortening maintenance time, which often leads to higher failure rates. Industry solutions like gray‑scale or blue‑green deployments were considered but rejected due to development constraints and resource inefficiencies.
NetEase’s breakthrough was to treat the total number of online users as constant during maintenance, allowing two service versions to share the same total compute resources. Initial attempts at multi‑process servers proved problematic, so the team turned to Docker virtualization and host‑over‑commit to isolate environments while running both old (A) and new (B) services on the same hardware.
Extensive testing was performed on Docker performance, compatibility, Kubernetes (K8s) cluster deployment, and an internal image‑management platform. As the cluster grew, network congestion from virtual network devices emerged as a new bottleneck.
To solve networking issues, the team first tried a simple Host Network approach, which overloaded physical routers. They then evaluated Calico, which automates host‑network configuration but still depends on physical devices, and Flannel, an overlay network that encapsulates packets and stores routing information in etcd. While Flannel reduced physical device usage, it did not fully meet large‑scale deployment needs.
Consequently, NetEase developed a custom SDN solution called Gon based on the OpenFlow protocol. The architecture uses Open vSwitch (OVS) on each host, managed by a Network Virtualization Controller (NVC), and introduces elastic virtual gateways (IGW and BGW) for internal‑external traffic separation. The SDN can handle 10 Gbps per gateway, 80 k packets per second, and up to 500 k concurrent connections, supporting VPC, IP reuse, and hybrid Docker‑private‑cloud networking.
Because the maintenance model requires both A and B pods to run simultaneously on the same node, the team combined podAffinity with nodeSelector and PodAntiAffinity to enforce strict placement rules, ensuring resource reuse without conflicts.
The final outcome was a successful non‑stop maintenance process that did not interrupt players, increased player satisfaction, and reduced overtime for the operations team. Additional benefits included faster new‑server provisioning, self‑healing of faulty nodes, API‑driven resource scheduling, and overall higher engineering productivity.
In summary, the solution combined Docker‑based resource reuse, a custom overlay SDN, and advanced Kubernetes scheduling to achieve zero‑downtime maintenance, demonstrating that innovative engineering can turn traditionally labor‑intensive operations into streamlined, value‑adding processes.
NetEase Game Operations Platform
The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.