Big Data 13 min read

Upgrading Data Warehouse Dependency Model: From Project-Level to Task-Level and External Dependency Integration

This article explains how a data warehouse dependency model was transformed from coarse project-level dependencies to fine-grained task-level DAGs, introduces virtual tasks for external dependencies, describes offset handling, and outlines the technical implementation and future automation plans for large‑scale scheduling systems.

DataFunTalk

May 16, 2024

Upgrading Data Warehouse Dependency Model: From Project-Level to Task-Level and External Dependency Integration

Data warehouse construction relies on data models, where analysts use tables representing different dimensions, creating upstream‑downstream dependencies equivalent to task dependencies. The scheduling system triggers downstream tasks only after upstream tasks complete at the configured time.

The core service of the scheduling system is data development, organized as projects containing multiple jobs that form a DAG of table computations. Currently, dependencies are modeled at the project level, which does not support cross‑project task dependencies, reducing accuracy and timeliness of data production.

Both batch (offline) and real‑time (stream) tasks produce data, and external non‑platform tasks also need to be integrated to avoid idle runs. Data has time attributes (hourly, daily, weekly, monthly), and dependencies may involve offsets such as T‑1, requiring flexible description of upstream data usage.

Terminology

Term

English

Description

任务

job

The smallest unit of development and execution, containing executable code, parameters, lineage, and scheduling dependencies.

依赖

DAG

Data dependencies between tasks forming a DAG that represents production relationships.

自依赖

Data of a later period depends on the previous period.

项目

project

A canvas managing a group of tasks.

实例

instance

An execution record generated each time a task runs.

业务时间

bizDate

Instance attribute indicating the partition time of the output table.

触发时间

Instance attribute marking when the task is triggered.

T-1

One period offset between trigger time and business time.

偏移

offset

Describes the deviation between business time and trigger time.

Dependency Model Upgrade

1. From Project Dependency to Task Dependency

To transition, root and end nodes are introduced: the root node marks the start of a project, and all user‑created tasks depend on it; the end node marks the project’s termination, and all tasks are depended on by it. This converts project‑level dependencies into equivalent task‑level dependencies, enabling a risk‑free migration and improving chain efficiency by up to four hours.

2. Bridging External Dependencies

Goal: provide a universal solution for all systems without reducing developer efficiency.

Solution 1: Table partition readiness – downstream tasks trigger when upstream partitions are ready. Advantages: low implementation cost. Disadvantages: requires all business partitions to be ready, causing over‑broad dependencies; relies on HDFS success files, leading to performance bottlenecks and limited applicability.

Solution 2: Virtual tasks – a virtual task represents a specific data slice (e.g., hourly table at 3 am). Downstream tasks depend on the virtual task, which in turn depends on the actual upstream task. Advantages: fine‑grained, accurate dependencies; unified task‑level model; no external performance bottlenecks. Disadvantage: additional maintenance overhead.

We chose virtual tasks and integrated them with the DQC quality service, ensuring upstream data quality is automatically validated.

3. Dependency Offsets

Business time offset describes the difference between the data’s business date and the scheduling date (e.g., T‑1 for yesterday’s data). Dependency offset specifies which business times a downstream task should depend on, expressed as collections or ranges.

Dependency Scenario (business/data time) Downstream Offset Configuration 2‑day downstream depends on 2‑day upstream Set, 0 2‑day downstream depends on 1‑day upstream Set, -1 2‑day downstream depends on 2‑hour upstream Range, 0‑23 Weekly downstream depends on same‑week days upstream Set, 0,1,2,3,4,5,6 or Range, 0‑6

Technical Implementation

Bilibili runs about 80 k tasks, 150 k daily instances, and 120 k dependency edges. Key goals: low latency (seconds) and high throughput under peak load.

1. Abstract Dependency Model

Supports project, job, and table data dependencies, both scheduled and manual backfill. Introduces a DependencySubject object identified by a DependencySubjectId composed of conditions (e.g., taskId=1234&&bizDate=20220101). This flexible model can represent single dates, date ranges, or table partitions.

2. Asynchronous Dependency Callback

The core component DependencyCenter calculates dependencies, performs inspections, and handles callbacks. When a task triggers, the dependency submission component queries DependencyCenter; if upstream tasks are complete, the downstream proceeds, otherwise it waits. DependencyCenter listens for upstream completions and triggers callbacks via various detectors (inspection, message subscription, API calls).

3. Performance

The dependency module achieves 100% accuracy with average latency of 1.6 seconds (90th percentile) and a maximum of 6.3 seconds, operating stably for three years.

Future Plans

1. Dependency Automation

When data lineage is accurate, automatic dependency generation can replace most manual task links, with manual task dependencies serving only exceptional cases.

2. Enriching Dependency Rules

Currently only strong dependencies with offsets are supported. Future work includes weak dependencies (e.g., proceed after a timeout or when any instance in a range succeeds) to handle more scenarios.

3. Supporting Dependency Ecosystem

Richer rules increase complexity for operation tools and baseline systems, requiring efficient algorithms to compute repair times across multi‑level dependency chains.

以上是今天的分享内容。

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

automation DAG Task Scheduling dependency model

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.