NetEase Big Data Platform: HDFS Optimization and Practice
This article presents NetEase's big data platform architecture, detailing multi‑layer storage and compute design, HDFS deployment challenges, NameNode and NameSpace performance optimizations, cluster scaling strategies, data tiering, hardware upgrades, and real‑world business use cases, illustrating practical large‑scale big data engineering.
NetEase has operated Hadoop for over ten years, building a comprehensive big data platform that spans six logical layers: application development, application scenarios, data computation, data management, data storage, and data sources, with cross‑cloud deployment and a focus on open‑source principles.
The platform provides visual development tools for SQL‑based data queries and job execution, supports various data formats including structured, semi‑structured (JSON, audio/video), and integrates auxiliary services such as Azkaban for job scheduling, Kerberos for authentication, and unified metadata management.
HDFS serves as the core distributed storage, with optimizations for high‑throughput workloads (e.g., using HBase) and a robust architecture featuring active/standby NameNodes, multiple DataNodes, and JournalNodes for high availability.
Key challenges addressed include rapid cluster scaling to thousands of nodes, managing massive metadata growth, and reducing NameNode startup latency. Optimizations involve parallel loading of FSImage and INODE structures, concurrent validation, multithreaded metadata parsing, and configurable lease handling to minimize RPC overhead during DataNode full‑report cycles.
Performance improvements also encompass NameSpace isolation via Router‑based federation, RPC priority queues, and resource‑aware scheduling to protect critical workloads, achieving up to 80% faster NameNode restarts and 20%+ RPC throughput gains.
Data management strategies feature tiered storage (hot vs. cold clusters), storage‑compute separation, and the adoption of high‑density hardware (NVMe) to boost I/O performance while reducing costs, with erasure coding (RS6x3) applied to cold data for space efficiency.
Operational monitoring leverages Matrix metrics and custom scripts to track RPC latency, DataNode heartbeats, and system resource usage, enabling rapid detection of anomalies and ensuring 24‑hour service availability.
Real‑world use cases span NetEase's consumer services (Music, News, Yanxuan) and enterprise offerings (supply chain, finance, media), demonstrating the platform's ability to handle petabyte‑scale workloads with high reliability.
The article concludes with a call for community engagement and highlights upcoming resources such as a downloadable big‑data PPT collection.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.