Databases 17 min read

Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB

This article details how SelectDB’s data technology architect designed and implemented a new stream‑batch unified data platform using Apache Doris, covering the shortcomings of the early CDH‑based architecture, the selection process, data modeling, ingestion pipelines, performance testing, operational optimizations, and future plans.

DataFunTalk
DataFunTalk
DataFunTalk
Building a Stream‑Batch Integrated Data Architecture with Apache Doris at SelectDB

Fujian Zongteng Network Co., Ltd. (Zongteng Group) faced increasing data demands that the legacy CDH‑based data warehouses could no longer meet due to complex stacks, low efficiency, and poor data quality. In 2022 the group introduced Apache Doris and built a new stream‑batch integrated data architecture centered on Doris.

Early Architecture

The early data warehouse consisted of two independent CDH clusters serving different product lines, data dashboards, and BI reports. While low coupling simplified management, it caused metadata gaps, data silos, high maintenance costs, and divergent technology stacks.

Problems Encountered

Insufficient metadata and data‑quality governance.

Data silos caused by isolated storage.

Uneven data‑center locations leading to high maintenance overhead.

Diverse and inconsistent technology stacks across clusters, making unified operations difficult.

Architecture Selection

After evaluating traditional warehouses, real‑time warehouses, and data lakes, the team chose a real‑time warehouse model that best matched their requirements. A comparative analysis of ClickHouse, Apache Druid, and Apache Doris showed Doris’s clear advantages in simplicity, performance, and ecosystem support, making it the optimal choice.

New Data Architecture

The new architecture simplifies data collection, storage, and computation by:

Integrating DataHub for custom metadata collection and scheduling.

Using Seatunnel with Flink‑Doris connector for unified full‑ and incremental data ingestion.

Converging storage technologies (ClickHouse, Kudu, HBase) into Doris for unified stream‑batch storage.

Leveraging Doris as the core data plane, combined with Apache Kyuubi’s JDBC engine and Spark‑Doris connector for ETL development.

Data Platform (Data Middle‑Platform)

Based on Doris, the platform provides an indicator center, metadata center, configuration center, ad‑hoc analysis, and data‑service interfaces, already serving hundreds of metrics.

Data Modeling

For the ODS layer the team adopted the Unique model to enable merge‑on‑write, improving real‑time write performance. The DIM/DED/DWS/ADS layers use the Aggregate model, offering four aggregation methods (SUM, MAX/MIN, REPLACE_IF_NOT_NULL) to meet various analytical needs.

Data Ingestion

ODS data is primarily loaded via Stream Load, with historical data imported through Broker Load or Spark Load. DW layer data uses INSERT INTO statements, and many ETL jobs run on Kyuubi‑on‑Spark to reduce Doris memory pressure. Over ten thousand daily DW tasks are scheduled via DolphinScheduler.

Practical Experience

The migration was divided into six stages: preparation, verification, stress testing, deployment, promotion, and operation. Key highlights include:

Choosing Apache Doris 0.15‑release for its transactional insert, Unique‑key upsert, SQL blocklist, and multi‑tenant resource groups.

Verifying connectors, federated queries, load methods, and materialized views.

Stress testing with TPC‑DS datasets (1 TB, 5 TB, 10 TB) and achieving up to 269 MB/s write throughput and 680 K ops/s.

Optimizing BE memory parameters to reduce TCMalloc lock contention, e.g.: tc_use_memory_min = 207374182400 tc_enable_aggressive_memory_decommit = false tc_max_total_thread_cache_bytes = 20737418240

Adjusting bucket settings to avoid excessive Tablet counts, which previously caused severe read/write slowdown.

Deployment and Operation

During the launch phase the team verified FE/BE ports, network connectivity, and read/write functionality, enabled global_enable_profile , and defined BE resource groups for multi‑tenant isolation. Permissions were scoped per role using SELECT_PRIV , LOAD_PRIV , and CREATE_PRIV .

Optimization Insights

Key observations:

Higher bucket counts dramatically degrade write speed; reducing from 80 to 9‑10 buckets restored performance.

Concurrent Stream Load, Broker Load, and query workloads compete for resources; careful scheduling mitigates impact.

TCMalloc lock contention was a major bottleneck; tuning memory decommit settings alleviated it.

Summary Benefits

The production Doris cluster (3 FE + 9 BE) now stores 44.4 TB across three replicas, handling 60 % of legacy data and a portion of DW data. Real‑time ingestion reaches over a hundred million rows daily, supporting tens of thousands of scheduled tasks, with query speed up to five times and ingestion speed up to two times compared to the previous architecture.

Future Plans

Future work includes integrating a data lake for unified structured and unstructured storage, and further focusing Doris on OLAP workloads while using it as a unified lake‑warehouse query engine.

For more details, refer to the official Apache Doris and SelectDB websites.

big dataBatch ProcessingstreamingPerformance Testingdata warehouseDatabase OptimizationApache Doris
DataFunTalk
Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.