Bilibili Presto on Hadoop: Architecture, Scaling, and Performance Enhancements
Bilibili’s Presto on Hadoop combines a multi‑engine offline platform with Kubernetes‑managed YARN scheduling, Ranger security, and a custom dispatcher, scaling to over 400 nodes handling 160 k daily queries on 10 PB, while adding coordinator HA, resource‑group punishment, query limits, Alluxio caching, dynamic filtering, and numerous SQL‑level enhancements, with future auto‑scaling and materialized‑view automation.
This document describes the overall architecture and continuous improvements of the Presto SQL engine deployed at Bilibili (B站) on a Hadoop ecosystem.
1. Overall Architecture – The offline platform consists of three compute engines (Presto, Spark, Hive), HDFS storage, and YARN scheduling. A custom Dispatcher routes SQL queries to the most suitable engine based on syntax, data size, and engine load. The system integrates Ranger for fine‑grained permission control (table/column masking, row filtering) and all components are managed via the company’s Kubernetes platform.
2. Presto Application – Presto (originally from Facebook, now Trino) is a distributed MPP engine that processes data in‑memory with a streaming pipeline. Advantages include no‑shuffle data landing, thread‑level split scheduling, and plug‑in data sources. It is used for ADHOC, BI, DQC, AI‑ETL, and data‑exploration workloads.
3. Cluster Scale & Business Growth – Currently 7 Presto clusters across two data centers, the largest with >400 nodes and >1000 nodes total. Daily query volume is ~160 k queries processing ~10 PB of HDFS data, a ten‑fold increase since early 2020.
4. Stability Improvements
4.1 Coordinator HA – Multi‑coordinator deployment with a Redis mapping to keep query‑to‑coordinator affinity. Coordinators acquire a global lock on start‑up and register with a State Store service for discovery.
4.2 Label Isolation – Labels are loaded into memory and attached to nodes; the NodeSelector uses label information to route queries to the appropriate workers.
4.3 Real‑time Punishment – A runtime service monitors CPU usage per resource group and issues punish messages to workers when quotas are exceeded. The core algorithm is:
long cSum = lastCSum + usagePerSecond;
if (cSum <= punishCpuLimit) {
cSum = 0;
} else if (cSum >= 2 * punishCpuLimit) {
// 这边记录当前resource group 需要惩罚
cSum = cSum - punishCpuLimit;
} else if (punishCpuLimit < cSum && cSum < 2 * punishCpuLimit) {
cSum = cSum - punishCpuLimit;
}4.4 Query Limits – Feature extraction, bad‑SQL detection, OOM‑killer handling, and rate‑limiting of identical query signatures.
4.5 Worker OOM Killer – Periodic memory‑usage checks trigger task‑level kills when usage exceeds thresholds.
4.6 Monitoring & Alerts – JMX metrics are exported to Grafana with custom alerts.
5. Availability Improvements
5.1 Implicit Type Conversion – Added session‑level support to allow comparisons like 1 = '1' :
hive> select 1 = '1';
true
Time taken: 3.1 seconds, Fetched: 1 row(s)
presto> select 1 = '1';
Query 20220301_114217_08911_b5gjq failed: line 1:10: '=' cannot be applied to integer, varchar(1)
presto> set session implicit_conversion=true;
SET SESSION
presto> select 1 = '1';
_col0
-------
true
(1 row)5.2 Hive UDF Compatibility – GenericUDFs are loaded via a custom classloader; arguments are converted between Hive and Presto types.
presto> select b_security_mask_email('[email protected]',0);
_col0
------------------
1*[email protected]
(1 row)5.3 INSERT OVERWRITE Support – Added syntax parsing to map INSERT OVERWRITE statements to the underlying Insert node.
presto> insert overwrite table tmp_db.tmp_table select '1' as a, '2' as b;
INSERT OVERWRITE: 1 row
presto> insert overwrite directory "/tmp/xxx/insert1" select value.features from ai.xxxTable limit 10;
rows
------
10
(1 row)5.4 Ranger Plugin – Integrated Presto with Ranger 1.2, enabling Hive‑style policies (row‑level filtering, column masking) for Presto.
5.5 Hint Syntax – Session parameters can be expressed as SQL hints.
/*+ query_max_execution_time= '1h', scale_writers=true*/
SELECT clo_1, col_2 FROM xxxx WHERE log_data='20211019'5.6 HAVING / GROUP BY Alias – Alias resolution added to avoid errors when using aliases in HAVING.
presto> select log_date, sum(job_percent) as percent from test.test_report group by log_date having percent > 0;
log_date | percent
----------+-----------------------
20211231 | 0.036253466638700515.7 Additional Enhancements – Coral for Hive view support, dynamic Resource‑Group loading, multi‑source connectors (Kafka, JDBC, TiDB, ClickHouse, Iceberg, Hudi, ES), catalog hot‑swap, query‑queue reason UI, job‑history persistence, zero‑downtime releases, fine‑grained lineage, etc.
6. Performance Improvements
6.1 Presto + Alluxio – Hot tables are tagged and cached in Alluxio, reducing HDFS RPC latency and improving TPC‑DS benchmark by 20‑30%.
6.2 Multi‑Data‑Center Architecture – Queries are routed to the data‑local data center based on table/partition location and estimated cross‑DC traffic.
6.3 Query Result Cache – Query MD5 + partition metadata forms a cache key stored in Redis; a cache‑invalidator watches for partition updates.
6.4 RaptorX Local Cache – Page‑level cache (≈1 MB) on workers, with consistent hashing for high hit‑rate and minimal impact during scaling.
6.5 Struct Push‑Down – Back‑ported Trino’s struct‑field push‑down; example query:
SELECT
A.ip,
B.info.mid
FROM
tmp_bdp.tmp_struct_test A
JOIN tmp_bdp.tmp_struct_test B on A.ip = B.ip;6.6 JDK Upgrade – Migrated from JDK 8 to Zulu JDK 11 (G1 GC) with a 2‑point throughput gain; ZGC was evaluated but not adopted due to class‑unloading limitations.
6.7 Dynamic Filtering – Implemented Broadcast Join, Dynamic Partition Pruning, and Partitioned Join with a new PredicatePushDown optimizer and a DynamicFilterService.
6.8 Other Optimizations – Split merging for small files, Observer Namenode usage, FileStatus cache, spill to disk, plan cache, phased multi‑stage execution, improved CBO statistics handling.
7. Future Plans
• Auto‑scaling (HPA) for Presto workers in mixed YARN clusters. • Heuristic indexes for pre‑filtering splits. • Automatic materialized view creation and refresh. • Further optimization for Array/Map types. • Heuristic‑based routing of heavy ETL queries to Presto‑on‑Spark.
Bilibili Tech
Provides introductions and tutorials on Bilibili-related technologies.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.