Big Data 22 min read

Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

Bilibili redesigned its real-time data-quality control platform by replacing per-rule Flink jobs with a unified, dynamically-configured architecture that classifies Kafka topics, aggregates via InfluxDB full-table and continuous queries, mitigates data inflation, adds a high-performance proxy, and implements robust monitoring and recovery to ensure scalable, reliable data quality for its big-data services.

Bilibili Tech
Bilibili Tech
Bilibili Tech
Design and Optimization of Real-Time Data Quality Control (DQC) Platform on Bilibili's Big Data System

Data quality is a prerequisite for the effectiveness of big‑data‑derived applications. Bilibili’s rapidly growing business and its vision of building deeper, more competitive services on big data require a data platform that can provide real‑time, accurate, and trustworthy data to all business parties. Consequently, a data‑quality platform (DQC) has become an indispensable component of Bilibili’s big‑data platform.

The DQC platform consists of offline and real‑time components, similar to traditional monitoring systems such as Prometheus. Offline DQC processes data produced by batch jobs, while real‑time DQC focuses on Kafka data streams, which are critical for the real‑time data warehouse.

First‑generation real‑time DQC design : For each new collection object or rule, a new Flink job was generated, writing results to MySQL. This simple architecture suffered from three major drawbacks: low resource utilization (many idle Flink tasks), high network bandwidth consumption (multiple tasks consuming the same topic), and poor stability (rule changes required task restarts, often failing due to YARN/K8s resource shortages).

New design goals were defined: (1) start once and never restart, (2) a single task can serve multiple topics, and (3) a single consumption can perform multiple rule checks.

New solution architecture includes:

Classification of topics into large, medium, and small based on QPS, with different processing paths.

Medium/Small‑topic scheme: data is fully ingested into an InfluxDB “full‑table”, then aggregated via InfluxDB Continuous Queries (CQ) into a CQ table. Dynamic topic management is achieved by a configuration‑center‑driven Kafka consumer and a mapper bytecode generated at runtime (stored in Base64 in the configuration center).

Large‑topic scheme: each Flink task consumes a single high‑traffic topic. Rules are dynamically loaded from the configuration center; a FlatMap tags records with rule IDs, and aggregation operators compute results per rule.

During optimization, a data‑inflation problem was identified: a record matching multiple rules produced multiple output records, causing up to 4× data expansion and back‑pressure. Two mitigations were applied: (1) output format changed to <groupKey, RuleIdList, Data> to merge rule IDs, and (2) a map‑reduce‑reduce pattern introduced a local aggregation step before global reduction.

InfluxDB proxy solution was introduced to improve read/write performance. The proxy performs dual‑write for consistency, retries failed writes, selects the optimal backend node for queries, compresses data with gzip, and batches writes. After refactoring the write protocol to send one DB per request and eliminate redundant tags, network I/O decreased dramatically.

Key InfluxDB concepts:

Full‑table stores raw data for one hour (TTL) with tags such as subtask and sinknum to avoid record overwrites.

CQ table stores aggregated results (time, rule_id, value) with a 14‑day TTL to support day‑/week‑over‑day comparisons.

Tag selection follows two rules: the field must appear in a WHERE clause and have a low‑cardinality domain; high‑cardinality fields become field s.

Operational safeguards include monitoring of InfluxDB write/read QPS, sequence count, and resource usage; handling traffic spikes (predictable events like esports finals vs unpredictable viral videos) by pre‑allocating resources or degrading service; checkpoint‑based crash recovery for Flink tasks; preventing duplicate consumption by binding tasks to topics in a registry; and tiered protection for P0/P1 topics.

Future work focuses on engineering automation (automated rule migration, priority‑based task scheduling), finer‑grained Flink task management (deciding whether to add a topic to an existing task or launch a new one based on load), and ensuring that quality‑monitoring workloads do not interfere with normal production jobs.

big dataFlinkstreamingResource OptimizationInfluxDBDQCReal-time Data Quality
Bilibili Tech
Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.