NetEase Data Infrastructure: Database Technologies and Big Data Platform Overview
This article presents NetEase Hangzhou Research Institute's experience in building a data infrastructure, covering database innovations such as InnoSQL, NTSDB, and InnoRocks, as well as the integration of big‑data components like HDFS, Spark, Impala, and Kudu to enable efficient storage, processing, and real‑time analytics.
NetEase Hangzhou Research Institute focuses on platform technology development and data governance solutions, aiming to reduce user costs, enrich platform functions, and meet large‑scale usage demands.
The institute has developed several database technologies, including InnoSQL (a MySQL branch), the newly created NTSDB time‑series database, and InnoRocks built on RocksDB, which offers higher write performance, better compression (reducing storage to 20‑30% of InnoDB), and more stable latency for workloads such as logs and orders.
Performance tests show InnoRocks outperforms InnoDB in write speed, while InnoDB remains faster in reads; the storage size of InnoRocks is significantly smaller, making it suitable for scenarios requiring high write throughput, high compression, or low‑latency cache replacement.
The big‑data platform integrates open‑source components that have been productized and bug‑fixed internally. Data ingestion is performed via NDC for full‑volume imports from sources like Oracle and logs, while storage relies on HDFS and HBase. Offline computation runs on Spark, and self‑service analytics leverage Impala and Kudu.
Impala addresses ad‑hoc query needs on massive data sets, offering stateless nodes, metadata caching, Hive compatibility, and push‑down operators, though it lacks a unified master and has limited permission granularity. The team added Zookeeper‑based load balancing, SQL persistence, fine‑grained permission control, and metadata synchronization with Hive.
Kudu provides a storage layer that supports both batch and real‑time analytics, enabling joint queries across HDFS/Hive and Kudu tables. Its column‑family storage, tablet splitting, TTL, runtime filter, and bitmap index features improve query performance, especially for multi‑dimensional filtering, achieving up to ten‑fold speedups in TPC‑H benchmarks.
The author, Jiang Hongxiang, is the chief architect of NetEase Data Science Center and co‑author of "MySQL Kernel: InnoDB Storage Engine Vol.1," leading the development of database kernels, time‑series databases, and big‑data platforms.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.