Big Data 16 min read

How WeChat Implements a StarRocks‑Powered Lakehouse Across Multiple Business Scenarios

WeChat evolved its data platform from Hadoop to ClickHouse and finally to a StarRocks‑based lakehouse, solving data fragmentation and storage redundancy while achieving sub‑second to minute‑level query latency, cutting storage costs by over 65%, halving operational tasks, and reducing offline job time by two hours across several business lines.

Past Memory Big Data
Past Memory Big Data
Past Memory Big Data
How WeChat Implements a StarRocks‑Powered Lakehouse Across Multiple Business Scenarios

Background : WeChat’s data analysis architecture originally relied on Hadoop, which suffered from slow queries, high latency, and a bulky batch‑stream separation. To meet the growing demand for personalized experiences in services like Video Channels, the team built an sub‑second real‑time OLAP warehouse on ClickHouse, achieving massive data ingestion with low latency and precise once‑only ingestion.

Unified Requirement : Despite solving the "real‑time" and "ultra‑fast" challenges, the architecture still lacked a unified experience, leading to fragmented data access and redundant storage.

Lakehouse Integration with StarRocks : The lakehouse solution, now deployed in multiple WeChat scenarios (Video Channel live, WeChat Keyboard, WeChat Reading, Official Accounts), runs on a cluster of hundreds of machines with data ingestion approaching a trillion rows. It aims for unified storage, unified interfaces, and unified metadata, delivering a SQL‑first experience where users no longer need to know the underlying architecture.

Technical Route 1 – Lake‑on‑Warehouse

In this approach, data‑lake technologies (Delta Lake, Hudi, Iceberg, Hive 3.0) and SQL‑on‑Hadoop engines (Presto, Impala) are introduced, along with Hive Metastore for metadata and object storage for persistence. Within WeChat, the stack evolved from Presto + Hive to StarRocks + Iceberg, improving data freshness from hour/day to minute level and query speed from minutes to seconds. Approximately 80 % of large queries are answered by StarRocks within seconds; the remaining very large queries are handled by Spark.

Seconds‑level response for medium‑size tables using StarRocks.

Minute‑level response for very large tables using Spark.

Advantages: low cost, simple implementation, strong Hadoop compatibility. Drawbacks: higher latency (5‑10 minutes) and slower ODS/DWD queries, requiring local caching for acceleration.

Technical Route 2 – Warehouse‑Lake Fusion

This route adds cross‑source federation to the warehouse, allowing data to be ingested directly into the warehouse and then cold‑stored into the lake. A Meta Server provides unified metadata, and SparkLake API supports generic offline computation. Real‑time latency improves to seconds‑to‑2 minutes, and DWD queries become faster, but the solution incurs higher cost and reduced Hadoop compatibility.

Deployed in WeChat Security, the fusion architecture handles daily tables of tens of billions of rows, with cold‑storage latency of minutes for hourly partitions and hours for daily partitions. A single task’s memory consumption stays around 5 GB, showing minimal impact on the cluster.

Combined Approach

WeChat adopts a hybrid of lake‑on‑warehouse and warehouse‑lake fusion, allowing users to choose the ingestion mode based on cost, performance, and timeliness requirements. Users can start with the lower‑cost lake‑on‑warehouse mode and switch to the warehouse‑centric mode when higher performance is needed, supporting both ultra‑fast BI analysis and general offline compute.

Real‑Time Incremental Materialized Views

StarRocks previously offered asynchronous and synchronous materialized views (MV). Asynchronous MVs refresh periodically or manually, requiring full INSERT OVERWRITE of tables or partitions, which is costly for large tables and unsuitable for real‑time scenarios. Synchronous MVs bind the MV result as an index to the base table, invisible to users, and provide transparent acceleration during base‑table queries, but they have several limitations: no complex expressions, no column aliases, no multiple references to the same base column, limited aggregation functions, and tight coupling with the base table.

To meet real‑time, high‑performance needs, WeChat designed incremental MVs with the following characteristics:

Large‑scale tables: only incremental updates, no full refresh.

Strict real‑time requirement: synchronous refresh only.

Multi‑table metric stitching: combine metrics from multiple base tables into a single MV target table.

High‑performance dimension‑table joins during MV writes.

The new MV roadmap includes:

Multi‑stream synchronous MV → Global dictionary association → Streaming MV (in development) → Lake‑on‑incremental synchronous MV.

Implementation decouples the base ODS table (retention 3‑7 days) from the MV result DWS table (retention 6‑12 months), allowing independent storage policies and simplifying maintenance. It also enables multiple base tables to write into the same MV target, achieving metric stitching.

Future Work

While multi‑stream synchronous MV is operational, it still lacks support for JOINs and generic aggregation functions. Future streaming MV aims to remove these restrictions and, combined with external‑table MV, will provide a complete lake‑on‑warehouse solution.

Summary and Benefits

The StarRocks‑based lakehouse is now live in several WeChat business lines, with clusters of hundreds of machines and data ingestion near a trillion rows. In a live‑streaming scenario, the lake‑on‑warehouse redesign halved the number of operational tasks for data developers, reduced storage costs by over 65 %, and shortened offline job production time by two hours.

Long‑term goals include a fully SQL‑centric experience where users are unaware of the underlying architecture, unified access/query experience, consistent sub‑second/minute latency across workloads, and standardized SQL interaction.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Sign in to view source
Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactadmin@besthub.devand we will review it promptly.

Big DataReal-time AnalyticsStarRocksWeChatMaterialized ViewsLakehouse
Past Memory Big Data
Written by

Past Memory Big Data

A popular big-data architecture channel with over 100,000 developers. Publishes articles on Spark, Hadoop, Flink, Kafka and more. Visit the Past Memory Big Data blog at https://www.iteblog.com. Search "Past Memory" on Google or Baidu.

0 followers
Reader feedback

How this landed with the community

Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.