Upgrading Data Warehouse Dependency Model: From Project-Level to Task-Level and External Dependency Integration
This article explains how a data warehouse dependency model was transformed from coarse project-level dependencies to fine-grained task-level DAGs, introduces virtual tasks for external dependencies, describes offset handling, and outlines the technical implementation and future automation plans for large‑scale scheduling systems.
Data warehouse construction relies on data models, where analysts use tables representing different dimensions, creating upstream‑downstream dependencies equivalent to task dependencies. The scheduling system triggers downstream tasks only after upstream tasks complete at the configured time.
The core service of the scheduling system is data development, organized as projects containing multiple jobs that form a DAG of table computations. Currently, dependencies are modeled at the project level, which does not support cross‑project task dependencies, reducing accuracy and timeliness of data production.
Both batch (offline) and real‑time (stream) tasks produce data, and external non‑platform tasks also need to be integrated to avoid idle runs. Data has time attributes (hourly, daily, weekly, monthly), and dependencies may involve offsets such as T‑1, requiring flexible description of upstream data usage.
Terminology
Term
English
Description
任务
job
The smallest unit of development and execution, containing executable code, parameters, lineage, and scheduling dependencies.
依赖
DAG
Data dependencies between tasks forming a DAG that represents production relationships.
自依赖
--
Data of a later period depends on the previous period.
项目
project
A canvas managing a group of tasks.
实例
instance
An execution record generated each time a task runs.
业务时间
bizDate
Instance attribute indicating the partition time of the output table.
触发时间
--
Instance attribute marking when the task is triggered.
T-1
--
One period offset between trigger time and business time.
偏移
offset
Describes the deviation between business time and trigger time.
Dependency Model Upgrade
1. From Project Dependency to Task Dependency
To transition, root and end nodes are introduced: the root node marks the start of a project, and all user‑created tasks depend on it; the end node marks the project’s termination, and all tasks are depended on by it. This converts project‑level dependencies into equivalent task‑level dependencies, enabling a risk‑free migration and improving chain efficiency by up to four hours.
2. Bridging External Dependencies
Goal: provide a universal solution for all systems without reducing developer efficiency.
Solution 1: Table partition readiness – downstream tasks trigger when upstream partitions are ready. Advantages: low implementation cost. Disadvantages: requires all business partitions to be ready, causing over‑broad dependencies; relies on HDFS success files, leading to performance bottlenecks and limited applicability.
Solution 2: Virtual tasks – a virtual task represents a specific data slice (e.g., hourly table at 3 am). Downstream tasks depend on the virtual task, which in turn depends on the actual upstream task. Advantages: fine‑grained, accurate dependencies; unified task‑level model; no external performance bottlenecks. Disadvantage: additional maintenance overhead.
We chose virtual tasks and integrated them with the DQC quality service, ensuring upstream data quality is automatically validated.
3. Dependency Offsets
Business time offset describes the difference between the data’s business date and the scheduling date (e.g., T‑1 for yesterday’s data). Dependency offset specifies which business times a downstream task should depend on, expressed as collections or ranges.
Dependency Scenario (business/data time) Downstream Offset Configuration 2‑day downstream depends on 2‑day upstream Set, 0 2‑day downstream depends on 1‑day upstream Set, -1 2‑day downstream depends on 2‑hour upstream Range, 0‑23 Weekly downstream depends on same‑week days upstream Set, 0,1,2,3,4,5,6 or Range, 0‑6
Technical Implementation
Bilibili runs about 80 k tasks, 150 k daily instances, and 120 k dependency edges. Key goals: low latency (seconds) and high throughput under peak load.
1. Abstract Dependency Model
Supports project, job, and table data dependencies, both scheduled and manual backfill. Introduces a DependencySubject object identified by a DependencySubjectId composed of conditions (e.g., taskId=1234&&bizDate=20220101). This flexible model can represent single dates, date ranges, or table partitions.
2. Asynchronous Dependency Callback
The core component DependencyCenter calculates dependencies, performs inspections, and handles callbacks. When a task triggers, the dependency submission component queries DependencyCenter; if upstream tasks are complete, the downstream proceeds, otherwise it waits. DependencyCenter listens for upstream completions and triggers callbacks via various detectors (inspection, message subscription, API calls).
3. Performance
The dependency module achieves 100% accuracy with average latency of 1.6 seconds (90th percentile) and a maximum of 6.3 seconds, operating stably for three years.
Future Plans
1. Dependency Automation
When data lineage is accurate, automatic dependency generation can replace most manual task links, with manual task dependencies serving only exceptional cases.
2. Enriching Dependency Rules
Currently only strong dependencies with offsets are supported. Future work includes weak dependencies (e.g., proceed after a timeout or when any instance in a range succeeds) to handle more scenarios.
3. Supporting Dependency Ecosystem
Richer rules increase complexity for operation tools and baseline systems, requiring efficient algorithms to compute repair times across multi‑level dependency chains.
以上是今天的分享内容。
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.