Databases 11 min read

Practical Application of Apache Kudu at NetEase: Architecture, Use Cases, Challenges and Future Directions

This article explains Apache Kudu’s architecture, schema design, update mechanism, and how NetEase leverages it for real‑time data ingestion, dimension table joins, data‑warehouse ETL, and AB‑testing, while also discussing encountered issues and upcoming feature requests.

DataFunTalk

Jul 25, 2021

Practical Application of Apache Kudu at NetEase: Architecture, Use Cases, Challenges and Future Directions

Kudu Positioning and Architecture

Apache Kudu is a columnar storage engine that integrates with OLAP engines such as Impala, Presto and Spark, offering low‑latency random reads/writes together with high‑throughput batch queries.

Unlike HBase or Cassandra, Kudu requires a declared schema, enabling richer metadata for query optimization, column‑level encoding (bitshuffle, run‑length, dictionary) and space savings.

Key features of its columnar design include space efficiency, predicate push‑down, and vectorized execution.

Kudu Schema and Partitioning

Kudu stores data in tables that are split into tablets; each tablet contains rowsets that can be in‑memory (MemRowSet) or on‑disk (DiskRowSet). Tables are created with range and/or hash partitioning, which can be combined to balance load and support fast look‑ups.

Masters manage metadata while tablet servers hold the actual data. Kudu uses the Raft consensus protocol to replicate data across multiple nodes, ensuring high availability.

Update Design

Updates are treated as new operations recorded as undo/redo entries and persisted in DeltaMemStore. Base data, undo and redo are stored together in a RowSet, allowing fast point‑lookups without locking, though queries must locate the correct RowSet version.

Kudu also employs LSM‑style compactions (MinorDeltaCompaction, MajorDeltaCompaction, MergingCompaction) to merge delta files.

Production Practices at NetEase

Real‑time Data Ingestion

Before Kudu, user‑behavior logs were written to HBase and later copied to OLAP stores, causing latency. With Kudu, streaming engines write directly to Kudu, enabling immediate updates and analytical queries.

Dimension Table Joins

Kudu synchronizes MySQL dimension tables via NDC, allowing real‑time joins between event logs and reference data without an extra ETL step.

Real‑time Data Warehouse ETL

Replacing Oracle with Kudu improved scalability and integration with the Hadoop ecosystem.

ABTest Pipeline

Switching from an HDFS‑Spark pipeline to a Kafka‑Flink‑Kudu architecture reduced data latency from days to near‑real‑time, with Kudu handling both incremental updates and out‑of‑window corrections.

Encountered Issues

Load imbalance: Range‑hash partitions caused hotspot nodes; a custom load‑balancing algorithm redistributed tablets evenly.

Complex table design: Lack of secondary indexes required careful primary‑key and partition‑key choices; NetEase implemented internal secondary indexes to simplify design.

Other challenges included the absence of built‑in secondary indexes and the need for bespoke schema design per business scenario.

Future Outlook

Bloom Filter Support

Bloom filters can be pushed down to storage to filter large tables during joins; recent Kudu releases already support this.

Flexible Hash Buckets

Allowing dynamic adjustment of hash bucket counts per range partition is a requested improvement (KUDU‑2671).

Multi‑row Transactions

Kudu currently lacks multi‑row transaction support; implementation is tracked under KUDU‑2612.

Schema Flexibility

Supporting tables without a primary key would broaden use cases (KUDU‑1879).

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data real-time analytics NetEase Apache Kudu

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.