Principles and Practices of Apache Doris: Architecture, Key Technologies, and Real‑World Use Cases
This article presents a comprehensive overview of Apache Doris, covering its positioning as a distributed MPP analytical database, core architecture with FE and BE nodes, key technologies such as vectorized execution and materialized views, integration with Kafka and Elasticsearch, additional features, roadmap, and detailed case studies from Baidu Statistics and Meituan, illustrating its practical deployment and performance characteristics.
Recently, Baidu senior R&D engineer Li Chaoyong was invited to 360 to share the principles and practice of the open‑source database Apache Doris.
Apache Doris Overview
Doris (originally Baidu Palo) is a distributed SQL database based on large‑scale parallel processing technology, open‑sourced by Baidu in 2017 and entered the Apache Incubator in August 2018.
Doris Positioning
MPP‑architecture relational analytical database
PB‑level data with sub‑second query latency
Designed for multi‑dimensional analysis and reporting
Architecture
Doris has a simple architecture consisting of two roles and two processes: Frontend (FE) and Backend (BE). FE stores and maintains cluster metadata, while BE stores physical data.
From a query perspective, FE receives and parses SQL, plans and schedules execution, and returns results; BE executes the physical plan generated by FE. Both FE and BE can scale linearly.
FE includes three node types: leader, follower, and observer. Leader and followers provide high‑availability metadata; observers add query capacity without handling writes.
BE ensures data reliability through configurable replication (default three replicas).
Frontend Metadata Management
Metadata is kept in memory with Paxos consensus, checkpointing, and journaling to guarantee high performance and durability. Updates are first written to a log file, then to memory, and periodically checkpointed to disk.
Data Distribution
User data can reside in HDFS, Kafka, or object storage (BOS, Amazon S3). Data is ingested into Doris, stored on BE nodes with multi‑replica management, and visualized via reporting tools.
Key Technologies
Overall Architecture
FE handles metadata and query planning; BE stores physical data. FE stores cluster metadata, BE stores physical data. FE parses queries, creates logical plans, and distributes them to BE for execution.
Distributed Logical Plan
Doris adapts Impala’s query engine, moving the entire planning stage to FE to avoid stale metadata and improve performance.
Distributed Physical Plan
The physical plan is broken into fragments executed by individual BE nodes, with data exchange via RPC. FE aggregates results and returns them through the MySQL protocol.
Vectorized Execution
Columnar storage enables vectorized processing, reducing CPU cache misses and improving throughput (3‑4× speedup in star‑schema benchmarks).
Materialized Views
To overcome sparse index limitations, Doris supports materialized views that reorder data and pre‑aggregate columns, improving multi‑dimensional query performance at the cost of additional storage.
Two‑Level Partitioning and Tiered Storage
Logical (time‑based) partitions separate hot and cold data, allowing placement on SSD or SATA. Physical hash partitions distribute data across the cluster for balanced load.
Integration with Elasticsearch
Doris can create external tables backed by Elasticsearch, combining Doris’s SQL capabilities with Elasticsearch’s inverted index for point‑lookup workloads.
Kafka Routine Load
Since version 0.10, Doris natively supports streaming ingestion from Kafka via a built‑in routine load task, eliminating the need for external scripts.
Other Features
ACID guarantees for atomic imports and schema changes
LLVM‑based query acceleration
User‑Defined Functions (UDF)
Support for HyperLogLog type
Roadmap
Enhance storage format with richer compression algorithms
Optimize file organization for better I/O efficiency
Cost‑model‑driven query optimization
Separate compute and storage for cloud‑native deployments
Case Studies
Baidu Statistics
Uses an aggregate‑key model for high‑concurrency reporting (≈1000 QPS) with sub‑60 ms latency, and a duplicate‑key model for fine‑grained multi‑dimensional analysis.
Meituan
Values rich SQL support, simple architecture, and high availability. Deploys a 70‑node Doris cluster storing >18 TB, leveraging dynamic scaling, Paxos‑based FE high‑availability, and BE replication.
Conclusion
The presentation covered Doris’s design principles, core components, performance optimizations, ecosystem integrations, and real‑world deployments, providing a solid foundation for practitioners interested in building real‑time analytical data warehouses.
360 Tech Engineering
Official tech channel of 360, building the most professional technology aggregation platform for the brand.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.