Databases 12 min read

Xiaohongshu’s OLAP Architecture Evolution and DorisDB Adoption

This article details Xiaohongshu’s multi‑stage evolution of its OLAP infrastructure—from Redshift to Presto, ClickHouse, and finally DorisDB—describing the data pipeline, tool comparisons, advertising use‑case implementation, and the resulting performance and operational benefits.

DataFunTalk
DataFunTalk
DataFunTalk
Xiaohongshu’s OLAP Architecture Evolution and DorisDB Adoption

Xiaohongshu, a lifestyle sharing platform, experienced explosive growth after 2017, prompting the need for robust data analysis and OLAP capabilities; the company introduced DorisDB, a high‑performance MPP database, to build a unified data service platform that simplifies data pipelines and accelerates high‑concurrency queries.

The OLAP evolution is described in four phases: Phase 1 used AWS Redshift before 2017 but faced scaling, ETL contention, and storage bottlenecks; Phase 2 migrated ETL to Hadoop/Hive and adopted Presto for flexible query access; Phase 3 added ClickHouse to meet real‑time performance demands; Phase 4 built a real‑time warehouse and data service platform, introducing DorisDB to satisfy low‑latency, high‑concurrency requirements.

The current data analysis architecture consists of data collection (Flume, Kafka, Canal), storage (S3, HDFS, TiDB, HBase, ClickHouse, DorisDB), batch processing (Hive/Spark) and streaming processing (Flink), a data‑sharing layer offering unified access to TiDB/HBase/ClickHouse/DorisDB, and an application layer serving reporting, ad‑hoc analysis, and complex SQL workloads.

A comparative review of OLAP engines highlights ClickHouse’s strong single‑table performance but limited delete/update and concurrency, DorisDB’s strong multi‑table performance, high concurrency, and MySQL compatibility but immature ecosystem, and TiDB/TiFlash’s OLTP support and update capabilities but weaker OLAP performance.

In the advertising data center, the original solution relied on numerous Flink jobs writing to MySQL, Redis, HDFS, and ClickHouse, leading to fragmented logic, limited scalability, and low availability. The new DorisDB‑based solution consolidates business logic within DorisDB, meeting requirements for high‑throughput writes, sub‑100 ms query latency, high QPS, real‑time binlog sync, and multi‑table joins.

DorisDB’s data model supports detail, aggregate, and update tables; partitioning by time and hash‑bucketed keys (e.g., advertiser ID) improves query pruning and concurrency; materialized views accelerate common aggregations across advertiser, user, and creative dimensions.

Data ingestion is performed via Flink DorisDB connectors for real‑time ETL, routine load tasks for micro‑batch streams, and broker‑load jobs for offline warehouse imports.

Performance testing shows each Frontend node handling ~2,000 QPS, with the cluster delivering tens of thousands of QPS and 99th‑percentile query latency under 100 ms, thanks to DorisDB’s MPP architecture and range‑hash sharding.

Operationally, DorisDB provides multi‑replica FE/BE nodes for high availability and online elastic scaling without downtime, meeting the critical availability needs of the advertising service.

Since early 2023, Xiaohongshu has deployed five DorisDB clusters, two of which are in production, achieving unified data services, simplified real‑time pipelines, and sustained high‑concurrency, low‑latency query performance across business scenarios.

big dataFlinkClickHousedata warehouseTiDBOLAPDorisDB
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.