Databases 20 min read

How NetEase Built an Automated DBA Platform with AIOps for Massive Scale

This article details NetEase's journey in designing and implementing a large‑scale database automation platform, covering its requirements, tool‑based operations, architecture, AIOps integration, and the practical lessons learned for managing thousands of database clusters efficiently.

Efficient Ops
Efficient Ops
Efficient Ops
How NetEase Built an Automated DBA Platform with AIOps for Massive Scale

Database Automation Platform Requirements and Design Goals

NetEase outlines the challenges faced by DBAs, including massive service volume, complex distributed environments, limited DBA resources, and a wide range of operational tasks such as deployment, change, permission management, backup, scaling, migration, and troubleshooting. To address these, the platform aims to increase automation, enforce standardized procedures, improve workflow efficiency, and enhance overall DBA control across thousands of clusters.

Tool‑Based Operations Phase

NetEase describes its evolution from manual and semi‑automatic scripts to a more systematic approach using CMDB, Zabbix for monitoring, Fabric for distributed operations, MHA for high‑availability switching, DataX for data import/export, and various open‑source tools for backup, recovery, and schema changes. The platform also integrates custom scripts to handle large‑scale change requests, often processing hundreds of change tickets per day.

Building the DBA Automation Platform

The architecture consists of three layers: a front‑end DBA management portal, a user ticket system, and a web‑based data query platform. Core components include a CMDB for automatic asset discovery, an alerting system, and a client‑side agent that replaces SSH for remote execution. The platform automates configuration collection, health monitoring, backup management, and slow‑query analysis, feeding data into centralized storage for reporting and risk assessment.

NetEase AIOps Exploration and DBA Platform

NetEase integrates AIOps to correlate massive monitoring alerts, perform root‑cause analysis, and trigger automated remediation such as scaling or rate‑limiting. By feeding enriched database metrics into the AIOps engine, the system can identify faulty modules, suggest self‑healing actions, and provide detailed context to both operations and development teams, improving overall service reliability.

monitoringoperationsscalabilityAIOpsDBADatabase Automation
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.