Operations 21 min read

How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance

This article explains ByteDance's baseline monitoring system for data pipelines, detailing its motivation, core concepts, architecture, instance generation, alert types, and handling of complex task dependencies to reduce operational costs and improve SLA compliance across hundreds of projects.

ByteDance Data Platform
ByteDance Data Platform
ByteDance Data Platform
How Baseline Monitoring Transforms Data Pipeline Reliability at ByteDance
ByteDance's Data Platform Development Suite team built a dependency‑based full‑link intelligent monitoring and alerting system called Baseline Monitoring, now widely used internally across more than 100 projects such as Douyin, e‑commerce, and advertising, covering over 80% of SLA tasks.

Author: Zhensheng – Data Platform Development Suite Team

As ByteDance’s business grows rapidly, the number of operational tasks in big‑data development increases, and traditional monitoring systems that rely on manually configured rules can no longer meet the needs. Developers face three main problems:

Task volume and complex dependencies : It is hard to find all upstream tasks of an important job, and monitoring every task generates many useless alerts, causing important alerts to be ignored.

High operational cost of configuration : Each task has different runtime characteristics and promised completion times, so configuring individual alerts for every task is labor‑intensive.

Diverse alert timing requirements : Hour‑level tasks need different alert timeliness at different periods, which ordinary monitoring cannot satisfy.

To manage daily tasks effectively and ensure data quality, the team developed Baseline Monitoring , which intelligently decides whether, when, how, and to whom to alert based on task execution status, protecting the entire production chain.

Baseline Monitoring is now widely adopted within ByteDance, covering over 100 projects and achieving more than 80% coverage of SLA task baselines.

Real‑world Example

User Xiao Ming has an SLA task that must finish before 10:00. The upstream dependency graph includes tasks from Project A and Project B. Xiao Ming only has operational permission for Project B.

Before baseline monitoring, Xiao Ming would configure multiple basic alerts on the SLA task and its upstream tasks (e.g., three alerts on each), leading to at least nine rules, high manual effort, and missed upstream tasks outside his permission.

With baseline monitoring, Xiao Ming only needs to add the SLA task as a "guarantee task" . All upstream nodes are automatically covered, eliminating the need for numerous individual alerts and allowing instant detection of any upstream delay.

Concept Overview

Baseline Monitoring decides alerting logic (whether, when, how, and to whom) based on monitoring rules and task runtime, protecting the whole output chain. Its core goals are:

Cover all tasks in the chain.

Reduce monitoring configuration cost.

Avoid ineffective alerts.

Guarantee Task : Typically a task with SLA requirements. The system automatically monitors all its upstream tasks.

Time Definitions :

Commit Time : The latest completion time (SLA).

Warning Buffer : Baseline SLA buffer; consuming the buffer triggers a warning.

Warning Time : Commit time minus warning buffer.

Predicted Runtime : Estimated runtime based on historical execution.

Latest Start Time (Commit) : Commit time minus predicted runtime.

Latest Start Time (Warning) : Warning time minus predicted runtime.

These times are illustrated in the following diagram:

Monitoring Scope

Baseline monitoring by default covers the guarantee task and all its upstream tasks. Users can restrict coverage to specific projects if needed.

Baseline Instance

Similar to tasks, a baseline has a business time that generates a baseline instance, which monitors the guarantee task instance and all upstream task instances for that time. Daily ("day baseline") and hourly ("hour baseline") instances are generated as follows:

Day Baseline : One instance per day, aligned with the guarantee task’s business time.

Hour Baseline : Either a unified commitment (24 instances) or a time‑slot commitment (N instances, where N∈[1,24]) based on the configured monitoring window.

Baseline Instance Status

Safe : Task finishes before warning time.

Warning : Task has not started by warning time but is still before commit time.

Broken : Task has not finished by commit time.

Other : Instance closed or no associated tasks.

Alert Types

Baseline Warning : First node in the chain that does not start by warning time.

Baseline Broken : Node that does not start by commit time and has no broken upstream nodes.

Broken Intensify : Execution slowdown that pushes the task beyond predicted runtime.

Guarantee Task Warning Time Not Completed : Guarantee task not finished by warning time while no prior alerts exist.

Guarantee Task Commit Time Not Completed : Guarantee task not finished by commit time.

Task Failure Event : Any task that fails after retries.

Baseline Event Types

Slowdown Event : Runtime exceeds predicted runtime by a configured percentage.

Failure Event : Task encounters a failure during execution.

Event states are "New" and "Recovered"; a recovered state is set when the task eventually succeeds.

System Implementation

Overall Architecture

Baseline Management Module : Handles creation, update, deletion, and metadata of baselines.

Baseline Instance Generation : Daily scheduled job that creates baseline instances and traverses upstream tasks (BFS) to generate monitoring points, calculating predicted runtime, warning time, commit time, and start‑time windows.

Monitoring Point Validation : Maintains a delay queue; at each validation point (warning start, commit start, etc.) the system checks task status and triggers the appropriate alert.

Baseline Instance Generation

At a fixed time (e.g., 22:00), the system creates baseline instances for the day or hour. For each instance, it walks the task DAG from the guarantee task upward (BFS) and computes the following for every task:

predicted runtime, warning time, commit time, latest start time for warning, latest start time for commit

These timestamps are stored as monitoring points.

Task node numbers indicate predicted runtime (e.g., A(1.5h) means 1.5 hours).

Thus, for task A with a commit time of 9:00 and a warning buffer of 0.5 h, the warning time is 8:30, the latest start time for warning is 7:00, and the latest start time for commit is 7:30.

Baseline Point Validation

Generated monitoring points are placed into a

BaselineTimeQueue

. The queue drives the following validation stages:

Baseline Warning Check (

CHECK_START_WARNING_TIME

): Triggered at the earliest warning start time. If the task has not started and is the first such task, the instance state changes from Safe to Baseline Warning and an alert is sent.

Baseline Broken Check (

CHECK_START_COMMIT_TIME

): Triggered at the earliest commit start time. If the task still has not started, the state changes to Baseline Broken and an alert is sent.

After the broken check, the system may enter Broken Intensify Check (

CHECK_OVERTIME_INTENSIFY

) or Wait Intensify Check (

CHECK_WAIT_OVERTIME_INTENSIFY

) depending on whether the task has started and whether upstream tasks are already broken.

When the intensify time arrives, if the task is still not successful, a Baseline Broken Intensify alert is emitted and the instance ends with

FINISH_WITH_UNSAFE

; otherwise it ends with

FINISH_WITH_SAFE

.

Handling Complex Task Chains

Task‑chain changes : Baseline instances are snapshots of the DAG at generation time. If the chain changes later, the new snapshot is used only in the next generation cycle.

Cross‑layer dependencies : When a downstream task depends on an upstream task from a different layer, the monitoring points of the shared upstream task are updated to the earliest timestamp to keep the chain consistent.

Cyclic dependencies : Only the latest business‑time instance of each task is kept; older instances are discarded because earlier alerts would have already been raised.

latest_task_time

is used to identify the instance to retain.

Future Work

The team plans to enhance baseline monitoring with key‑path analysis, improve instance generation efficiency, and further optimize algorithm performance to provide stronger full‑link monitoring capabilities for DataLeap users.

Big Dataoperationsalertingdata pipelinesbaseline monitoringtask dependency
ByteDance Data Platform
Written by

ByteDance Data Platform

The ByteDance Data Platform team empowers all ByteDance business lines by lowering data‑application barriers, aiming to build data‑driven intelligent enterprises, enable digital transformation across industries, and create greater social value. Internally it supports most ByteDance units; externally it delivers data‑intelligence products under the Volcano Engine brand to enterprise customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.