Big Data 10 min read

Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization

This presentation by NetEase senior SRE Jin Chuan details the current state of NetEase's big data platform, introduces the internally built EasyOps management system, explains a generic Ansible‑based operation framework, describes Prometheus/Grafana monitoring and alerting, and shares practical lessons on network, storage, and cloud migration for large‑scale Hadoop services.

DataFunTalk
DataFunTalk
DataFunTalk
Practical Experience in Operating NetEase's Big Data Platform: Architecture, EasyOps, Monitoring, and Optimization

Jin Chuan, a senior SRE from NetEase Hangzhou Research Institute's big data infrastructure team, delivered a comprehensive talk on the practical operation of NetEase's big data platform.

The agenda covered five parts: an overview of NetEase's big data applications, the status and design of the internally named EasyOps platform, a generic big data service operation framework, a Prometheus‑based monitoring and alerting solution, and finally, hands‑on operational experience.

The platform supports major NetEase products such as Cloud Music and Yanxuan, built on a Hadoop ecosystem with over 22 components and an internal "YouShu" middle platform with about 27 components. Offline clusters are divided into six groups, while two real‑time clusters run Spark Streaming or Flink jobs.

EasyOps was created to replace Ambari, addressing deployment and management pain points. It provides service instance details, host inventories, configuration management (including version history and arbitrary parameter propagation), and Grafana dashboards for unified monitoring.

The generic operation framework is built on Ansible. Roles are organized per service with standard directories: defaults (default variables), tasks (task scripts), templates (configuration templates), and vars (dynamic variables). This structure enables rapid development of service‑specific operation playbooks.

Monitoring and alerting rely on Prometheus, Grafana, and a customized time‑series database (NTSDB, derived from InfluxDB). A high‑availability architecture uses a watchdog to restart failed Prometheus instances and an alarm manager that triggers Ansible to redeploy missing instances. Metrics are collected directly from services exposing Prometheus endpoints, while JVM metrics use the Micrometer plugin.

Log collection employs an internal DSAgent that forwards logs to Kafka, where custom analysis pipelines extract anomalies and aggregate metrics, storing results in Elasticsearch, NTSDB, or MySQL for visualization and alerting via Grafana.

Operational lessons include adopting a spine‑leaf network architecture to handle the massive east‑west and north‑south traffic of Hadoop workloads, separating storage and compute to achieve at least 20% cost savings, and using HDFS Router/Federation and Yarn Node Labels for efficient resource management.

For cloud migration, Alluxio is introduced as an abstraction layer to hide underlying storage differences (e.g., S3, OBS, OSS), allowing compute frameworks to run unchanged on various cloud providers.

In conclusion, the speaker summarized key performance‑optimization principles and invited the audience to continue the discussion.

monitoringBig DataSREPrometheuscloudplatform operationsAnsible
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.