Big Data 16 min read

Data Preparation Practices at Douyin Group for Diverse Application Scenarios

This article explains Douyin Group's large‑scale data applications, introduces the concept and architecture of data preparation, details its four subsystems and modular capabilities, and showcases how these are applied in BI, CDP, and custom scenarios within the Volcano Engine ecosystem.

DataFunSummit
DataFunSummit
DataFunSummit
Data Preparation Practices at Douyin Group for Diverse Application Scenarios

Introduction – The session focuses on Douyin Group's data preparation practices for various application scenarios, covering data scale, architecture, and the role of data preparation as the foundation for downstream analytics.

Douyin Group's Data Scale and Architecture – Douyin processes petabyte‑level data with peak traffic of 100 million TPS and daily jobs in the millions. Its architecture consists of three layers: the platform layer (data warehouse and compute engines), the entrance layer (access control), and the middle layer (data applications and middle‑platform development).

Data Preparation Overview – Data preparation transforms raw sources into usable data sets through two core modules: data ingestion and data modeling. It is divided into four subsystems: Modeling, Execution, Enhanced Preparation, and System Management.

Modeling Subsystem – Provides low‑code, visual data modeling that abstracts data sources and processing logic into logical models, which become inputs for the execution subsystem.

Execution Subsystem – Converts logical models into executable tasks, handling task generation, execution, monitoring, and resource allocation.

Enhanced Preparation – Offers intelligent features such as type inference, relationship inference, cleaning suggestions, and performance tuning.

System Management – Handles permissions, resource governance, and overall system health.

Practical Scenarios

BI Scenario – Combines ingestion, modeling, and dataset modules to build data marts for organization‑wide reporting, with measures like task isolation, dynamic parameter tuning, multi‑path dispatch, rule‑based diagnostics, and real‑time monitoring to ensure high throughput and stability.

CDP Scenario – Focuses on data integration to break data silos, supporting both private (database plugins) and public (cloud‑based) data sources, enabling multi‑enterprise deployment with low coupling.

Custom Scenario – Provides ingestion, dataset, and output capabilities, allowing users to build bespoke data flows via open APIs and visual pipelines.

Volcano Engine Data Preparation – Serves as the underlying data foundation for five SaaS products (DataTester, GMP, DataFinder, VeCDP, DataWind) within the Volcano Engine marketing suite, delivering modular data capabilities without being exposed as a standalone product.

Conclusion – Data preparation unifies multi‑source data, offers low‑code modeling, supplies rich data marts, and enables end‑to‑end data pipelines that power diverse analytics use cases across Douyin and Volcano Engine.

Q&A – Addresses task diagnostics, type inference in enhanced preparation, and the adoption of visual modeling within Douyin's internal workflows.

Big Datacloud computingdata pipelineBICDPdata preparation
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.