Big Data 24 min read

Building a Doris‑Based Lakehouse Integrated Analytics System at Kuaishou

This article presents Kuaishou's experience of designing and implementing a Doris‑driven lakehouse integrated analytics system, covering the current OLAP landscape, challenges of data duplication and governance, the new architecture with caching and auto‑materialization, implementation details, performance impact, and future work.

DataFunSummit

Aug 26, 2024

Building a Doris‑Based Lakehouse Integrated Analytics System at Kuaishou

Kuaishou uses OLAP extensively across its business, processing billions of queries daily, but the traditional workflow of moving data from the data lake to analytical engines like ClickHouse or Doris incurs high storage costs, latency, and governance overhead.

The team identified key problems: expensive data ingestion, redundant storage, delayed data readiness, and high maintenance effort for ADS models, which often remain unused after dashboards are retired.

To address these issues, a lakehouse‑integrated analytics architecture was built on Doris, combining the strengths of data lakes and data warehouses. The design enables Doris to query lake data directly with performance comparable to native warehouses, while improving data delivery and governance.

The architecture consists of three core components: the Doris execution layer, a metadata and data caching subsystem (leveraging Alluxio), and an automatic materialization subsystem. Metadata such as table schemas and partition information are synchronized from Hive Metastore to a Meta Server, cached, and periodically refreshed to Doris Frontend.

Cache mechanisms store schema, partition, and split information, reducing reliance on Hive Metastore and HDFS during query planning. Data pre‑heating ensures that frequently accessed partitions are cached in Alluxio, improving query latency without noticeable impact on overall response time.

Automatic materialization (auto‑materialization) drives production of ADS models based on consumption patterns. Two discovery methods are supported: rule‑based expert specifications and automatic identification from historical query logs. Materialized views are stored externally (e.g., Hive/Hudi) and registered with Doris via a KwaiMTMV wrapper, allowing transparent query rewrite and efficient execution.

Production of materialized views is orchestrated by a service that generates tasks for the company's offline scheduler, handling incremental updates, lineage‑driven re‑materialization, and large‑scale processing via Spark or Hudi when necessary.

Performance tests show that the additional caching latency adds only tens of milliseconds, keeping most queries in the sub‑second range. The system also provides fallback to Spark for large, low‑frequency queries.

Future work includes implementing history‑based view discovery, enhancing index support for Parquet files, and expanding materialization to more complex metrics.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Data Warehouse OLAP Lakehouse doris Auto Materialization

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.