Big Data 11 min read

Optimizing Full Partition Tables with Zipper Tables, Hudi+Flink CDC, and Data Warehouse Strategies

Facing server‑hardware constraints, Bilibili’s data platform replaced wasteful full‑partition tables with a zipper‑table approach—preserving change history while cutting storage from petabytes to terabytes—and complemented it with Hudi + Flink CDC for near‑real‑time updates, dramatically lowering I/O, compute usage and latency.

Bilibili Tech

Apr 25, 2022

Optimizing Full Partition Tables with Zipper Tables, Hudi+Flink CDC, and Data Warehouse Strategies

Since 2020 the global semiconductor shortage has constrained server hardware procurement for internet companies. Bilibili, which relies heavily on enterprise‑grade servers for data computation and storage, faces increasing demand for storage and compute resources as user activity grows, leading to insufficient capacity for both calculation and storage.

The data platform team classified the main data types into four categories: log data, business data, manually produced data, and metadata. Log data (client and server logs) are incremental and static, while business data originates from transactional databases and requires frequent updates. Manual and metadata have relatively low volume and resource demands.

To handle business data, the team traditionally uses a full‑partition table pattern (T‑1), where each partition stores the complete dataset as of the previous day. When resources are abundant, this approach works well, but under tight resource conditions it causes severe storage waste due to duplicate copies of unchanged data and risks data loss when early partitions are deleted.

Solution 1 – Traditional Full Partition Table (offline mode)

Data is refreshed in an offline pipeline: T‑2 full data is loaded, then T‑1 incremental data is applied to produce the latest T‑1 full partition. This method offers low timeliness, high performance, and high throughput.

SQL pseudo‑code:

insert overwrite table target_table partition (partition = yesterday)

select
  column_list
from(
    select
      column_list,
      row_number() over(
        partition by primary_key
        order by
          update_field desc
      ) as rn
    from
      (
        select
          column_list
        from
          incremental_table
        where
          partition = yesterday
        union all
        select
          column_list
        from
          full_table
        where
          partition = day_before
      ) n
  ) t
where
  rn = 1

Solution 2 – Real‑time Data Lake with Hudi + Flink CDC

By ingesting the initial full snapshot offline into a Hudi table and then continuously applying MySQL binlog changes via Flink CDC, the system achieves near‑real‑time consistency (minute‑level latency). Benchmarks show that for daily new/changed records below 10 million, Hudi + Flink CDC can merge data efficiently, though full historical versioning still needs community improvements.

Solution 3 – Zipper Table (link table) for Full Partition Optimization

The zipper table records the full history of each record, enabling seamless migration from full partition tables and supporting low‑timeliness offline workloads while preserving all change versions. This approach aligns with the data‑warehouse concept of a subject‑oriented, integrated, time‑variant collection.

Definition: a zipper table stores start and end dates for each record, where the end date ‘9999‑12‑31’ denotes the current version.

Example SQL to generate a zipper table:

INSERT OVERWRITE TABLE zipper_table
SELECT
  n1.id,
  n1.nickname,
  n1.start_date,
  CASE
    WHEN n1.end_date = '9999-12-31'
      AND n2.id IS NOT NULL THEN 'business_date-1'
    ELSE n1.end_date
  END AS end_date
FROM zipper_table n1
LEFT OUTER JOIN
  (SELECT id FROM user_table
   WHERE yesterday_new_register OR yesterday_nickname_change) n2
ON n1.id = n2.id
UNION ALL
SELECT id, nickname, 'business_date' as start_date, '9999-12-31' as end_date
FROM user_table
WHERE
  yesterday_new_register OR yesterday_nickname_change

Further optimization is achieved via a MapReduce workflow that reads the existing zipper table and the incremental data once, avoiding extra sorting steps and reducing both storage and compute consumption.

In summary, converting full partition tables to zipper tables significantly cuts storage (e.g., from 1.20 PB to 5.06 TB in a test case) and reduces I/O and compute resource usage by about 17 %. Future work includes making zipper tables idempotent, automating coverage and duplication testing, and selectively applying the optimization.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data SQL Flink CDC Hudi Partition Table Zipper Table

Written by

Bilibili Tech

Provides introductions and tutorials on Bilibili-related technologies.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.