Big Data 19 min read

Apache Doris at Xiaomi: Architecture Evolution, Performance Optimizations, and Production Practices

This article details Xiaomi's three‑year journey of adopting Apache Doris across dozens of internal services, describing the transition from a Spark‑SQL‑based Lambda architecture to a unified MPP database, performance benchmarks, data ingestion pipelines, compaction tuning, two‑phase commit, single‑replica writes, monitoring, and community contributions.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Apache Doris at Xiaomi: Architecture Evolution, Performance Optimizations, and Production Practices

In 2019 Xiaomi introduced Apache Doris to address growing user‑behavior analysis needs, and after three years it now powers dozens of internal services such as advertising, new retail, growth analytics, dashboards, and user profiling, forming a Doris‑centric data ecosystem.

The original growth‑analytics platform was built on a Lambda architecture using SparkSQL, Kudu, and HDFS, which suffered from high operational cost, query latency, and resource contention.

To solve these issues, Xiaomi replaced the legacy stack with a Doris‑based architecture where data from front‑end event logs is ingested via Stream Load or Broker Load (real‑time via Flink/Doris connector or batch via Hive → Broker → Doris) and queried directly through the internal Shujing platform.

Doris was chosen for its excellent query performance, standard SQL support, simple operations, and active community. Benchmark tests on a 1 billion‑row daily workload showed up to 85% latency reduction for event analysis and 50% for retention/funnel analysis compared to the previous stack.

Production practices include guiding users to set appropriate partitions and bucketing, reducing import frequency, avoiding excessive delete operations, and configuring Compaction parameters per business cluster to balance query performance and resource usage.

Compaction was enhanced with QuickCompaction, priority scheduling between Base and Cumulative Compaction, and a producer‑consumer model that limits the number of files merged concurrently, preventing OOM and improving efficiency.

Two‑phase commit (2PC) was added to Stream Load, enabling exactly‑once semantics for Flink and reliable batch imports for Spark, eliminating partial data loads.

Single‑replica write optimization reduces CPU and memory consumption by performing sorting, aggregation, and compression once in memory and then replicating the resulting files, achieving up to 3× resource savings.

Monitoring is performed via Prometheus, Grafana, and Falcon, with health checks using select current timestamp(); executed every minute by a Cloud‑Doris daemon to detect unresponsive nodes.

Through active contributions to the Doris community, Xiaomi helped implement features such as two‑phase Stream Load, single‑replica writes, and Compaction memory limits, many of which are now part of Doris 1.1/1.2 releases.

The article concludes by inviting readers to join the Apache Doris community, participate in upcoming events, and explore further real‑time analytics use cases.

Real-time AnalyticsCompactionData WarehouseMPPTwo-Phase CommitApache DorisXiaomi
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.