Hudi Data Lake Implementation and Optimization Practice at vivo
Vivo’s big‑data team deployed Apache Hudi to create a lakehouse that unifies streaming and batch workloads, leverages COW and MOR storage modes, automates small‑file clustering and compaction, and applies extensive version, streaming, batch, and lifecycle optimizations, delivering minute‑level latency, hundred‑million‑records‑per‑minute ingestion, and query speeds up to 20 % faster than Hive.
This article introduces how vivo's big data team implemented Apache Hudi to enable lakehouse acceleration for business departments, focusing on streaming-batch integration, real-time pipeline optimization, and wide-table stitching scenarios.
1. Hudi Basic Capabilities
Streaming-Batch Integration: Unlike Hive, Hudi data written via Spark/Flink can be continuously read using Spark/Flink engines in streaming mode. The same Hudi data source supports both batch and streaming reads.
COW (Copy On Write): Each update copies a new data version. While writes may have memory amplification, queries are efficient as they don't require merging. Similar to Java's CopyOnWriteArrayList.
MOR (Merge On Read): Updates/inserts are written to Avro log files instead of parquet. Data is grouped by FileGroup, consisting of base files (parquet) and log files. During reads, base and log files are merged in memory. This provides faster writes but slower queries.
Small File Management: Hudi uses Clustering for COW tables (reorganizing multiple FileGroups into larger files) and Compaction for MOR tables (merging base parquet with log files into new base files).
2. Component Optimization
Version Upgrade: Upgraded from Hudi 0.12 to 0.14 for better component capabilities.
Streaming Optimizations: Implemented rate limiting to control commits per ingestion, avoiding OOM issues. Separated clean operator from compaction/clustering to improve stability. Adjusted state.backend.fs.memory-threshold from 20KB to 1KB to reduce JM memory usage.
Batch Optimizations: Optimized Bucket index BulkInsert by pre-sorting based on partition path and bucket ID, and closing idle write handles early, improving write performance by 30-40%. Fixed partition pruning issues in 0.14 that caused driver OOM. Added support for multiple OLAP engines (Presto, StarRocks) to query MOR tables.
Small File Merging: Fixed TypedProperties serialization issue in clustering, achieving 30%+ performance improvement. Added batch execution support for compaction/clustering procedures. Optimized clean operations by batching partition requests to reduce TimelineServer memory pressure.
Lifecycle Management: Implemented directory-based data deletion for efficient table lifecycle management without requiring locks or task stops.
3. Results
Real-time: Supports 100 million records/minute quasi-real-time写入; streaming read latency at minute level. Offline: Supports trillion-level batch writes; query performance comparable to or 20%+ better than Hive. Small file治理: 95%+ of merge tasks complete within 10 minutes.
vivo Internet Technology
Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.