Databases 25 min read

Principles and Practices of Apache Doris: Architecture, Key Technologies, and Real‑World Use Cases

This article presents a comprehensive overview of Apache Doris, covering its positioning as a distributed MPP analytical database, core architecture with FE and BE nodes, key technologies such as vectorized execution and materialized views, integration with Kafka and Elasticsearch, additional features, roadmap, and detailed case studies from Baidu Statistics and Meituan, illustrating its practical deployment and performance characteristics.

360 Tech Engineering

Jul 18, 2019

Principles and Practices of Apache Doris: Architecture, Key Technologies, and Real‑World Use Cases

Recently, Baidu senior R&D engineer Li Chaoyong was invited to 360 to share the principles and practice of the open‑source database Apache Doris.

Apache Doris Overview

Doris (originally Baidu Palo) is a distributed SQL database based on large‑scale parallel processing technology, open‑sourced by Baidu in 2017 and entered the Apache Incubator in August 2018.

Doris Positioning

MPP‑architecture relational analytical database

PB‑level data with sub‑second query latency

Designed for multi‑dimensional analysis and reporting

Architecture

Doris has a simple architecture consisting of two roles and two processes: Frontend (FE) and Backend (BE). FE stores and maintains cluster metadata, while BE stores physical data.

From a query perspective, FE receives and parses SQL, plans and schedules execution, and returns results; BE executes the physical plan generated by FE. Both FE and BE can scale linearly.

FE includes three node types: leader, follower, and observer. Leader and followers provide high‑availability metadata; observers add query capacity without handling writes.

BE ensures data reliability through configurable replication (default three replicas).

Frontend Metadata Management

Metadata is kept in memory with Paxos consensus, checkpointing, and journaling to guarantee high performance and durability. Updates are first written to a log file, then to memory, and periodically checkpointed to disk.

Data Distribution

User data can reside in HDFS, Kafka, or object storage (BOS, Amazon S3). Data is ingested into Doris, stored on BE nodes with multi‑replica management, and visualized via reporting tools.

Key Technologies

Overall Architecture

FE handles metadata and query planning; BE stores physical data. FE stores cluster metadata, BE stores physical data. FE parses queries, creates logical plans, and distributes them to BE for execution.

Distributed Logical Plan

Doris adapts Impala’s query engine, moving the entire planning stage to FE to avoid stale metadata and improve performance.

Distributed Physical Plan

The physical plan is broken into fragments executed by individual BE nodes, with data exchange via RPC. FE aggregates results and returns them through the MySQL protocol.

Vectorized Execution

Columnar storage enables vectorized processing, reducing CPU cache misses and improving throughput (3‑4× speedup in star‑schema benchmarks).

Materialized Views

To overcome sparse index limitations, Doris supports materialized views that reorder data and pre‑aggregate columns, improving multi‑dimensional query performance at the cost of additional storage.

Two‑Level Partitioning and Tiered Storage

Logical (time‑based) partitions separate hot and cold data, allowing placement on SSD or SATA. Physical hash partitions distribute data across the cluster for balanced load.

Integration with Elasticsearch

Doris can create external tables backed by Elasticsearch, combining Doris’s SQL capabilities with Elasticsearch’s inverted index for point‑lookup workloads.

Kafka Routine Load

Since version 0.10, Doris natively supports streaming ingestion from Kafka via a built‑in routine load task, eliminating the need for external scripts.

Other Features

ACID guarantees for atomic imports and schema changes

LLVM‑based query acceleration

User‑Defined Functions (UDF)

Support for HyperLogLog type

Roadmap

Enhance storage format with richer compression algorithms

Optimize file organization for better I/O efficiency

Cost‑model‑driven query optimization

Separate compute and storage for cloud‑native deployments

Case Studies

Baidu Statistics

Uses an aggregate‑key model for high‑concurrency reporting (≈1000 QPS) with sub‑60 ms latency, and a duplicate‑key model for fine‑grained multi‑dimensional analysis.

Meituan

Values rich SQL support, simple architecture, and high availability. Deploys a 70‑node Doris cluster storing >18 TB, leveraging dynamic scaling, Paxos‑based FE high‑availability, and BE replication.

Conclusion

The presentation covered Doris’s design principles, core components, performance optimizations, ecosystem integrations, and real‑world deployments, providing a solid foundation for practitioners interested in building real‑time analytical data warehouses.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

real-time analytics Data Warehouse Columnar Storage Distributed SQL Apache Doris Kafka Integration

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.