Big Data 28 min read

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

This article details Bilibili's evolution of its offline computing platform from Hadoop‑based Hive to Spark, describing the migration process, automated SQL conversion, result verification, stability and performance enhancements, meta‑store optimizations, and future work on remote shuffle and vectorized execution.

DataFunTalk

Sep 15, 2022

Bilibili Offline Platform: Migration from Hive to Spark and Large‑Scale Optimizations

Background : Since 2018 Bilibili built an offline computing service on Hadoop, scaling from a few hundred to nearly ten thousand nodes across multiple data centers, using Hive, Spark and Presto as batch engines, with about 200k daily batch jobs.

From Hive to Spark : In early 2021 Hive handled >80% of jobs, Spark 2.4 about 20%. After Spark 3.1 release, Bilibili migrated Hive‑SQL to Spark‑SQL, using a custom migration tool that rewrites SQL, replaces input/output tables, and performs dual‑run result comparison.

SQL Conversion : Re‑implemented SparkSqlParser to replace tables, handle DAG‑level replacements, convert SELECT to CTAS for output, encode column names, and skip DDL that does not need comparison.

Result Comparison : Compared schemas via DESC, then performed full‑data comparison using a GROUP BY and UNION ALL strategy to detect mismatched rows, while noting limitations with complex types and nondeterministic queries.

Migration & Rollback : Each task undergoes at least three dual‑run comparisons; post‑migration monitoring compares the first three executions against the previous seven‑run average, rolling back tasks that show negative optimization.

Spark Practice – Stability Improvements :

Small file handling: added a fallback merge step and a reparation‑based merge using AQE's rebalance hint.

Shuffle stability: prioritized SSD for DiskBlockManager, introduced a remote shuffle service (RSS) with a shuffle‑service master for node selection, achieving ~25% execution‑time reduction for large jobs.

Large result spill: driver memory monitoring writes excess results to disk and streams them back in batches.

Task parallelism limits: dynamic limits on total tasks and per‑SQL tasks based on executor usage.

Dangerous join detection and join‑inflation monitoring using execution metrics.

Data skipping: leveraged ORC/Parquet statistics, added ordering on hot columns, and demonstrated order‑by‑state reducing scan size dramatically.

Performance Optimizations :

DPP and AQE compatibility fixed in Spark 3.2.

AQE‑enabled ShuffledHashJoin when partition size permits.

Runtime filter implementation using DynamicBloomFilterPruning to prune large tables before shuffle, reducing rows from billions to tens of thousands.

Data skipping enhancements with column‑level statistics.

Functional Improvements :

ZSTD compression support added to Spark 3.1 with bug fixes in ORC.

Multi‑format reading compatibility for tables with mixed file formats.

Table conversion and small‑file merge syntax:

CONVERT TABLE target=tableIdentifier (convertFormat | compressType) partitionClause? #convertTable

MERGE TABLE target=tableIdentifier partitionClause? #mergeTable

Field‑level lineage extraction via a custom LineageQueryListener.

HBO – Automatic Parameter Optimization : Collected execution fingerprints and metrics to recommend memory, parallelism, shuffle, and small‑file settings, improving memory utilization to ~50% and enabling dynamic resource allocation.

Smart Data Manager (SDM) : Provides asynchronous table format conversion, data re‑organization (order/Z‑order), statistics collection, small‑file merging, Hive index creation, and lineage parsing, with lock management to keep operations transparent to users.

Hive Metastore Optimizations :

MetaStore federation to handle multi‑data‑center deployments, supporting both WaggleDance and native HMS federation, with a StateStore router.

Request tracing and traffic control: propagate CallerContext via EnvironmentContext, limit getPartitions calls, and implement memory‑based throttling and connection termination.

Future Work : Investigate remote shuffle services for K8s, apply vectorized execution to Spark, and enhance automatic fault‑diagnosis systems.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Optimization Hive Shuffle Data Skipping MetaStore

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.