Big Data 23 min read

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

This article details the evolution of Tencent Music's content data platform from version 1.0 to 4.0, describing the migration from ClickHouse to Apache Doris, the introduction of a semantic layer, optimization of data ingestion, query performance, and cost reduction strategies that dramatically improved data timeliness, operational efficiency, and storage costs.

macrozheng

Feb 28, 2023

How Tencent Music Scaled Its Content Data Platform with Apache Doris: From ClickHouse to 4.0 Architecture

Business Background

Tencent Music Entertainment operates a massive content library covering recorded music, live performances, audio, and video. To support comprehensive analysis of songs, albums, artists, and other content objects, the Content Library Data Platform integrates data sources into a unified asset with metrics and tag systems, providing inventory, segmentation, metric analysis, and tag selection services for downstream applications.

Data Architecture Evolution

Architecture 1.0

The initial architecture consisted of three layers: a data warehouse layer (ODS‑DWD‑DWS) building wide tables, an acceleration layer using ClickHouse for analysis and Elasticsearch for search, and an application layer creating DataSets for logical views. Problems included delayed data updates, high storage costs, complex coupling between ClickHouse and Elasticsearch, and difficulty maintaining both engines.

Architecture 2.0

To address these issues, the analysis engine was switched from ClickHouse to Apache Doris. Doris offers simple deployment, MySQL compatibility, partial column updates, federated queries across Hive, Iceberg, Hudi, MySQL, and Elasticsearch, and multiple import methods (batch from HDFS/S3, MySQL binlog, Kafka, Flink). The migration solved data latency, reduced storage redundancy, and improved query throughput.

Architecture 3.0

A semantic layer was introduced to centralize metric and tag definitions, turning them into first‑class citizens. This layer decouples the application from physical storage, allowing analysts to define unified metrics and tags, and enabling business users to query without building DataSets.

Architecture 4.0

The semantic layer became the core of the architecture, handling SQL generation for the warehouse, configuring acceleration tasks, and directing application queries. This unified view standardizes metric and tag management, simplifies optimization, and resolves data‑management fragmentation.

Optimization Experience

Data Ingestion

Two approaches for building wide tables were evaluated: generating them in TDW (high storage cost, poor real‑time performance) versus using Flink to aggregate streams and write directly to Doris. The Flink approach reduced storage, improved latency, and allowed flexible ETL reuse.

Data Model Selection

Doris 1.1.3 supports three models: Unique (primary‑key updates), Aggregate (supports partial column updates via REPLACE_IF_NOT_NULL), and Duplicate (raw storage). The Aggregate model was chosen to meet the partial‑update requirement.

Schema Management

Metadata for tags/metrics is stored in a MySQL table mapping business names to internal column IDs (e.g., song_name → a4). Queries rewrite column names to IDs, allowing name changes without altering Doris tables. Tag lifecycle is managed via metadata status, and pre‑allocated ID columns enable rapid onboarding of new tags/metrics with minimal schema change impact.

Write Optimizations

Flink pre‑aggregation reduces write pressure.

Adaptive batch sizes lower Flink resource usage.

Doris BE tuning (partition size, compaction parameters) improves write stability.

BE submission logic caches BE nodes and randomizes batch submissions for balanced load.

max_XXXX_compaction_thread<br/>max_cumulative_compaction_num_singleton_deltas

Query Optimizations

Metrics are stored as partitioned tables in Doris, while tags reside in Elasticsearch. A "Doris on ES" solution combines Doris' distributed query planning with ES's full‑text capabilities. Query sessions can enable es_optimize to push bucket joins, compress ES results with bitmaps, and limit data transfer, reducing a million‑record tag‑selection query from 60 seconds to 3.7 seconds.

Cost Optimizations

Hot data kept in Doris (TTL for one year); older data moved to TDW.

Partition‑level replica settings (3 replicas for recent data, decreasing for older data).

Cold data migrated from SSD to HDD.

Tag/metric lifecycle management reduces unnecessary storage.

Doris' simple FE/BE architecture, MySQL compatibility, and federated query support lower operational overhead.

These measures lowered storage costs by 42% and development/time costs by 40%.

Future Plans

Automatic hot/cold data identification using Doris for hot data and Iceberg for cold data.

Materialized views for frequently used tag/metric combinations.

Exploring Doris for data‑warehouse compute tasks to simplify logic and improve timeliness.

The authors thank the Apache Doris community and SelectDB for their support and contributions.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Big Data Flink Data Architecture Apache Doris content platform

Written by

macrozheng

Dedicated to Java tech sharing and dissecting top open-source projects. Topics include Spring Boot, Spring Cloud, Docker, Kubernetes and more. Author’s GitHub project “mall” has 50K+ stars.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.