Databases 22 min read

Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics

This article walks through Apache Doris’s lakehouse‑in‑one architecture, explains its core value and paradigm, details the system’s components and use cases, examines technical challenges such as file‑format diversity and I/O stability, and presents a suite of optimizations—from predicate push‑down and partition pruning to metadata caching and dynamic scheduling—that dramatically improve query performance and resource utilization, while also outlining future roadmap plans.

DataFunSummit
DataFunSummit
DataFunSummit
Unlocking Apache Doris: How Lakehouse Integration Supercharges Data Analytics

Introduction

The session titled "Apache Doris Lakehouse‑in‑One Technical Analysis" introduces the core value and paradigm of lakehouse technology and outlines the agenda covering lakehouse fundamentals, Doris architecture, use cases, technical challenges, optimization techniques, performance results, future plans, and a Q&A.

Lakehouse Core Value and Paradigm

Lakehouses combine the low‑cost, scalable storage of data lakes (supporting structured, semi‑structured, and unstructured data) with the high‑performance analytics of data warehouses. They use a schema‑on‑read approach for flexibility, but early Hadoop‑based lakes suffered performance bottlenecks that have been mitigated over time.

Traditional lake‑warehouse separation leads to data and application fragmentation, higher costs, and lower efficiency. The lakehouse‑in‑one model unifies storage, metadata, and compute, offering a single API for real‑time queries, batch processing, and AI workloads.

Apache Doris Lakehouse‑in‑One Architecture

Doris serves as a typical lakehouse system, integrating storage layers (S3, HDFS) with file formats (Parquet, ORC) and table formats (Hive, Iceberg, Paimon, Hudi, Delta Lake). Its core components include a high‑performance vectorized engine, pipeline execution, smart cost‑based optimizer, materialized views, JDBC/MySQL access, Arrow Flight for AI, and unified metadata management across multiple catalogs.

Use Cases

Enterprise‑level lakehouse acceleration for existing Hadoop/Hive/Spark ecosystems.

Federated analytics across heterogeneous sources (Hive, MySQL, etc.) using a unified query engine.

Lightweight data‑warehouse scenarios for mid‑size enterprises and CDP deployments.

Technical Challenges

Key challenges include diverse data formats (Iceberg, Paimon, Hudi, Parquet, ORC), unstable I/O performance on object stores, and complex resource management (concurrent queries, inaccurate statistics).

Optimization Techniques

File‑Read Optimizations

Predicate push‑down using min‑max and Bloom filters on Parquet/ORC.

Partition pruning to skip irrelevant partitions.

Delayed materialization to read only necessary columns before applying filters.

Dictionary filtering to compare integer codes instead of strings.

I/O Optimizations

Merge small I/O requests into larger ranges.

Local block caching of remote object‑store reads.

Special handling for tiny files, ORC stripes, and row‑store formats.

Metadata Optimizations

Batch split assignment with locality awareness and consistent hashing.

Multi‑level metadata caches (catalog, schema, partition, file list).

Business Scheduling Optimizations

Join Runtime Filters (JRF) to filter large tables during joins.

Dynamic partition pruning based on runtime filters.

Dynamic priority scheduling to prevent query starvation.

Statistics Optimizations

Collecting statistics from Hive Metastore, Iceberg metadata tables, and JDBC system tables to guide join order, execution strategy, and data‑skew handling.

Other Optimizations

SIMD for fixed‑length fields, reducing virtual function calls, and optimizing nullable handling.

Performance Results

In batch processing (TPC‑DS 10 TB) Doris outperforms Spark by ~30 %. In interactive analytics (TPC‑DS 1 TB) Doris is up to 3× faster than Trino on Iceberg tables and 10× faster on native tables. Real‑world deployments show up to 40 % reduction in 95th‑percentile latency and significant CPU savings.

Future Roadmap

Support for writing to external lake formats (Paimon, Iceberg rewrite, snapshots).

Extended format support (Iceberg V3, Variant, Geo types).

Integration with open data catalogs (Gravitino, Unity Catalog, Polaris).

Q&A Highlights

Answers covered Doris’s write capabilities to Hive/Iceberg, current limitations for CDC streaming, the role of materialized views in balancing freshness versus performance, and upcoming enhancements for write‑path optimizations.

Conclusion

The presentation demonstrates how Apache Doris’s lakehouse‑in‑one design delivers unified, high‑performance analytics while addressing the traditional trade‑offs between data freshness and query speed, and outlines a clear path for future feature expansion.

Performance OptimizationBig DataData WarehouseLakehouseApache Doris
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.