Building a Real-Time Unified Data Platform with Apache Doris: Insights from SelectDB
SelectDB shares its perspective on modern data analytics stacks, detailing the current challenges, the evolution of data architectures, and how Apache Doris enables a real‑time unified data foundation, while also reviewing Doris 1.2’s latest features, performance gains, and future roadmap.
Guide : This article shares SelectDB’s view on the modern data analysis stack and its work around Apache Doris.
The presentation covers four points: (1) the current state and challenges of data analysis stacks; (2) building a real‑time unified data foundation with Apache Doris; (3) a deep dive into the latest Apache Doris features; (4) information about SelectDB.
01 Current Data Analysis Stack: Status and Challenges
The modern data analysis stack can be divided into two major categories.
Category 1 : Relational databases (e.g., Oracle) serve as data sources, synchronized to a data warehouse where ETL/ELT processes run, and BI tools (Tableau, Quick BI, etc.) present reports. OLAP engines such as ClickHouse or Apache Doris are often used for acceleration.
Category 2 : Logs or third‑party APIs act as sources; data is streamed via Kafka or Flink to S3/HDFS, managed by lake‑house solutions (DeltaLake, Hudi, Iceberg), then processed by batch (Spark, MapReduce), streaming (Flink), or interactive engines (Impala, Presto) before landing in an OLAP system for various analytics (user behavior, A/B testing, tracing, etc.).
Architecture Evolution : Since Hadoop’s birth in 2006, the stack has evolved through three stages.
Stage ① (Hadoop era): enabled massive data computation, moving from TB‑scale to PB‑scale.
Stage ② (Big‑data bloom): emergence of Spark, Flink, Impala, Presto, and OLAP engines like Doris, Kylin, Druid, ClickHouse.
Stage ③ (Cloud integration): cloud‑native warehouses such as Snowflake abstract away component management, offering SaaS‑style data processing.
Data‑warehouse and big‑data technologies are converging, and cloud infrastructure improves both efficiency and resource management.
Modern Data‑Analytics Requirements
Unchanged: performance (more data per cost) and timeliness (real‑time value).
Changing: flexibility (shorter delivery cycles) and democratization (any employee can explore data).
Key Challenges
Multi‑dimensional reporting: high concurrency, millisecond‑level latency.
Ad‑hoc queries: unpredictable, heavy scans, high CPU/IO pressure.
Unified data warehouse: combine online queries and offline ETL without resource contention.
Lake‑house acceleration: break data silos, present a unified business view.
02 Building a Real‑Time Unified Data Foundation with Apache Doris
Apache Doris graduated from the Apache Incubator in June 2022 and is a high‑performance MPP real‑time analytical database. It is used for multi‑dimensional reports, ad‑hoc queries, user profiling, real‑time dashboards, log analysis, and lake‑house acceleration. Over 700 enterprises run Doris in production, guaranteeing stability and service quality.
Typical Scenarios
Ingesting relational‑database data for OLTP‑style analytics.
Ingesting log data to generate PV/UV reports and support IoT time‑series analytics.
Case Study: Internet User Growth Analysis Platform
Previously built on Kudu, Spark, YARN, the platform was simplified to a single Doris architecture, delivering 2‑10× performance improvement. Average query latency is ~10 seconds, 95th‑percentile < 30 seconds, handling tens of thousands of SQLs daily on clusters of hundreds of nodes.
03 Apache Doris Latest Features (Version 1.2)
1. Primary‑Key Model Optimization : Replaces the Merge‑on‑Read Unique‑key model with a primary‑key index + Delete‑Bitmap approach, improving real‑time update performance by >10×.
2. Light Schema Change : Schema modifications (add/drop column, type change) are applied by updating metadata only, completing in milliseconds instead of minutes or hours.
3. Multi‑Catalog : SelectDB introduced a mechanism that maps an entire Hive Metastore into Doris, automatically syncing schemas and changes, supporting Iceberg and Hudi, and eliminating the need to create millions of external tables.
4. JDBC Data Source : Replaces ODBC with JDBC for better version compatibility and stability.
5. Hot/Cold Data Separation : Stores hot data on local disks and cold data on S3, reducing storage cost by ~70 %. Rowset‑level granularity automatically classifies data, while preserving full functionality (import, query, schema change).
6. New MemTracker : Introduces three‑level memory limits (process, query, operator). Over‑limit queries are auto‑canceled, providing fine‑grained memory observability.
7. Additional New Functions : Supports Array type, nested structures, row‑column conversion, JSON storage, a higher‑precision Decimal type, and Java UDFs.
8. Performance Improvements : Over 100 optimizations yield ~4× speedup versus 1.1, surpassing industry competitors by >3× in ClickBench benchmarks and ranking first on popular cloud instance types.
04 About SelectDB
SelectDB is the commercial entity behind Apache Doris, investing heavily in R&D to advance Doris as a world‑leading open‑source analytical database.
05 Q&A Session
Q1: Where is the S3 path for cold‑hot separation stored? A: In the BE node.
Q2: Any SaaS multi‑tenant storage evolution? A: Not yet, but planned for SelectDB Cloud.
Q3: Does Rowset tiering increase small files? A: No, cold data is accessed infrequently and local cache eviction mitigates the issue.
Q4: Cloud‑native features roadmap? A: Version 1.3 will add Kubernetes support and multi‑tenant interfaces.
Q5: Advice on choosing ClickHouse, Doris, StarRocks? A: Doris is preferred due to its ClickHouse‑inspired codebase and superior GROUP‑BY and JOIN performance.
Q6: Storage efficiency in the new version? A: Introduced ZSTD compression for strings, improving storage density; integer compression gains are modest.
Q7: Memory‑management improvements? A: New MemTracker limits memory at multiple levels; version 1.3 will further refine limits and ensure exception‑safe code.
Q8: Does the new Unique‑key model affect Segment V2? Can 1.1 upgrade directly to 1.2? A: No impact on Segment V2; the model is backward compatible, allowing seamless rolling upgrades.
End of the sharing session – thank you.
Promotional Content
Free e‑book, questionnaire with rewards, business cooperation, past article collection, and information about the DataFun community are presented with accompanying images.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.