Big Data 14 min read

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

The Kuaishou‑hosted Apache Hudi Asia technical salon gathered over 230 attendees and featured seven experts from Kuaishou, Meituan, TikTok, Huawei, JD and others, who shared best practices, architecture designs, and performance optimizations for large‑scale data lake applications across AI, BI, and real‑time workloads.

DataFunTalk

Apr 9, 2025

Highlights of the Apache Hudi Asia Technical Salon Hosted by Kuaishou – Practices and Innovations from Leading Companies

On March 29, Kuaishou exclusively organized the first Apache Hudi Asia technical salon at its Beijing headquarters, attracting more than 230 on‑site participants and over 16,000 online viewers. Seven technical experts from Kuaishou, Meituan, TikTok, Huawei, JD and other leading companies presented best practices and case studies of Apache Hudi in data‑lake scenarios.

In the opening video address, Vinoth Chandar, CEO & Founder of Onehouse and Apache Hudi PMC Chair, praised Chinese developers for contributing more than half of the code and core features of Hudi. He announced the milestone 1.0 release, which includes storage‑format optimizations, index‑system redesign, enhanced concurrency control, deep Flink integration, and breakthrough incremental processing capabilities. Future 1.1 and 1.2 releases will focus on migrating Spark streaming to Flink, adding Presto and File‑Group‑Reader integrations, index pruning, multi‑language support, and extensions for unstructured/semistructured data.

Wang Jing, head of Kuaishou’s Data Platform Department, described the company’s “AI+Data” strategy that supports 4.01 billion daily users and 126 minutes of usage per user. Kuaishou has built a 10 EB‑scale storage foundation and a million‑core compute cluster, heavily leveraging Spark, Flink, ClickHouse, Doris and Apache Hudi for AI and BI workloads.

Five session highlights:

1. Apache Hudi in Kuaishou AI & BI scenarios – Experts Yu Zhaojing and Zhong Liang explained how Kuaishou’s AI DataLake uses full‑link vectorization, real‑time subscription, and logical wide‑table column stitching. The lake now stores EB‑scale data, sustains TB/s throughput, and achieves sub‑30 s end‑to‑end latency, dramatically reducing costs for recommendation, advertising and search services.

2. Meituan’s incremental lake “Beluga” – Wang Mengmeng presented a “one‑table three‑mode” architecture that combines row‑based HFile for fast streaming writes with columnar Parquet for batch processing, managed by an independent MetaServer. Integration with Flink, Spark and Presto yields minute‑level data freshness and significant storage‑compute cost savings.

3. TikTok’s Sample Center – Geng Xiaoyu and Yao Xiang described how TikTok replaced a traditional Kafka row store with a unified lake architecture, enabling real‑time sample‑label stitching via Flink, PB‑scale daily processing, and vectorized feature extraction that cuts CPU usage by 30 % and memory by 50 %.

4. Huawei Cloud large‑scale Hudi practice – Meng Tao outlined challenges such as cluster pressure after switching from batch to real‑time jobs and poor upsert performance. Optimizations included avro‑free RowData writes (1‑10× speedup), log‑offset‑based stream reads (2× faster), CDC‑driven dynamic schema, and column‑family storage that boosted write throughput 3‑5× and query speed by two orders of magnitude. Huawei also built the LDMS lake‑warehouse management service for automated lifecycle, layout optimization and index recommendation.

5. JD’s data‑lake innovation – Zhang Yue introduced a multi‑model storage architecture with a buffered‑layer + persistent‑layer hierarchy, binary stream copy to bypass serialization, and a toolchain (DataBus, EasyStudio) covering the full data lifecycle. These techniques doubled write performance, cut class‑execution time by 93 %, and reduced compute by 95 %, delivering sub‑5‑minute data freshness for transaction scenarios.

In the closing speech, Li Yuan, head of Kuaishou’s Data Engine Center, emphasized that data‑lake technology will face even greater business and technical challenges in 2025, and only innovation‑driven approaches can keep the industry leading the wave.

The event concluded with heartfelt thanks to the seven speakers and partners such as Yunqi, Datafun, CSDN, InfoQ, Open‑Source Society, musp, 思否 and 51CTO, highlighting Apache Hudi’s rapid rise from technical innovation to industry‑wide empowerment.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data AI Batch Processing Streaming data lake Apache Hudi

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.