Big Data 11 min read

Hudi Data Lake Implementation and Optimization Practice at vivo

Vivo’s big‑data team deployed Apache Hudi to create a lakehouse that unifies streaming and batch workloads, leverages COW and MOR storage modes, automates small‑file clustering and compaction, and applies extensive version, streaming, batch, and lifecycle optimizations, delivering minute‑level latency, hundred‑million‑records‑per‑minute ingestion, and query speeds up to 20 % faster than Hive.

vivo Internet Technology

Dec 13, 2023

Hudi Data Lake Implementation and Optimization Practice at vivo

This article introduces how vivo's big data team implemented Apache Hudi to enable lakehouse acceleration for business departments, focusing on streaming-batch integration, real-time pipeline optimization, and wide-table stitching scenarios.

1. Hudi Basic Capabilities

Streaming-Batch Integration: Unlike Hive, Hudi data written via Spark/Flink can be continuously read using Spark/Flink engines in streaming mode. The same Hudi data source supports both batch and streaming reads.

COW (Copy On Write): Each update copies a new data version. While writes may have memory amplification, queries are efficient as they don't require merging. Similar to Java's CopyOnWriteArrayList.

MOR (Merge On Read): Updates/inserts are written to Avro log files instead of parquet. Data is grouped by FileGroup, consisting of base files (parquet) and log files. During reads, base and log files are merged in memory. This provides faster writes but slower queries.

Small File Management: Hudi uses Clustering for COW tables (reorganizing multiple FileGroups into larger files) and Compaction for MOR tables (merging base parquet with log files into new base files).

2. Component Optimization

Version Upgrade: Upgraded from Hudi 0.12 to 0.14 for better component capabilities.

Streaming Optimizations: Implemented rate limiting to control commits per ingestion, avoiding OOM issues. Separated clean operator from compaction/clustering to improve stability. Adjusted state.backend.fs.memory-threshold from 20KB to 1KB to reduce JM memory usage.

Batch Optimizations: Optimized Bucket index BulkInsert by pre-sorting based on partition path and bucket ID, and closing idle write handles early, improving write performance by 30-40%. Fixed partition pruning issues in 0.14 that caused driver OOM. Added support for multiple OLAP engines (Presto, StarRocks) to query MOR tables.

Small File Merging: Fixed TypedProperties serialization issue in clustering, achieving 30%+ performance improvement. Added batch execution support for compaction/clustering procedures. Optimized clean operations by batching partition requests to reduce TimelineServer memory pressure.

Lifecycle Management: Implemented directory-based data deletion for efficient table lifecycle management without requiring locks or task stops.

3. Results

Real-time: Supports 100 million records/minute quasi-real-time写入; streaming read latency at minute level. Offline: Supports trillion-level batch writes; query performance comparable to or 20%+ better than Hive. Small file治理: 95%+ of merge tasks complete within 10 minutes.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Batch Processing Vivo Apache Hudi Data Lakehouse Streaming Processing

Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.