Big Data | BestHub

Bilibili Tech

Jan 3, 2025 · Big Data

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Bilibili replaced Spark’s unstable External Shuffle Service with a push‑based approach, then deployed Apache Celeborn’s remote shuffle on Kubernetes using HA masters, tiered workers, extensive monitoring, history‑based routing, chaos testing, and seamless Spark, Flink, and MapReduce integration, while planning self‑healing, elastic scaling, and priority‑aware I/O enhancements.

Apache CelebornBig DataKubernetes

0 likes · 28 min read

Evolution and Production Practices of Apache Celeborn Remote Shuffle Service at Bilibili

Bilibili Tech

Dec 17, 2024 · Big Data

Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili

Bilibili adopted Apache Gravitino as a unified metadata platform that decouples consumers, consolidates schemas and Fileset‑based unstructured data across heterogeneous sources, cuts metadata and storage costs, resolves inconsistencies, boosts Hive Metastore performance, and enables features such as Iceberg branching and future AI‑centric governance.

Apache GravitinoBig DataFileset

0 likes · 20 min read

Apache Gravitino: Metadata Management Practices and Production Experience at Bilibili

iQIYI Technical Product Team

Nov 21, 2024 · Big Data

Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI

iQIYI integrates Alluxio with its QBFS multi‑AZ unified scheduling system, automatically caching hot tables, applying table‑level policies, page‑level storage and AZ‑aware worker selection, which together cut cross‑zone traffic, halve query latency, achieve up to 20× I/O speedup and a three‑fold overall performance boost.

AlluxioBig DataCache Optimization

0 likes · 23 min read

Alluxio Integration and Optimization for Multi‑AZ Big Data Analytics at iQIYI

Bilibili Tech

Oct 25, 2024 · Big Data

DataFunSummit2024: Next-Generation Data Architecture Technology Summit

DataFunSummit2024, co-hosted by Bilibili, convenes industry experts, scholars, and enterprise leaders across six forums to discuss next‑generation data architecture, showcasing Bilibili’s Iceberg‑based stream‑batch innovations, AI‑BI analytics, NoETL practices, and emerging alternatives to Lambda architecture.

AI+BIBig DataData Architecture

0 likes · 3 min read

DataFunSummit2024: Next-Generation Data Architecture Technology Summit

iQIYI Technical Product Team

Jul 26, 2024 · Artificial Intelligence

Optimizing Advertising Feature Evaluation Process with the Opal Machine Learning Platform

By migrating iQIYI’s advertising feature‑evaluation workflow to the Opal machine‑learning platform, the team replaced a manual, engineer‑heavy process with a unified, automated pipeline that cut evaluation cycles from five days to 1.5 days, tripling iteration speed while lowering barriers and improving consistency for future feature optimization.

Big DataFeature EvaluationOpal Platform

0 likes · 6 min read

Optimizing Advertising Feature Evaluation Process with the Opal Machine Learning Platform

iQIYI Technical Product Team

Jun 28, 2024 · Artificial Intelligence

Feature Center Overview in iQIYI's Opal Machine Learning Platform

The Feature Center in iQIYI’s Opal platform centralizes feature creation, storage, and real‑time access through a drag‑and‑drop DAG workflow and DSL‑driven transformations, handling massive QPS and low‑latency demands while enabling fast business iteration, cross‑team reuse, and monitoring for advertising, recommendation, and risk‑control applications.

Big DataFeature EngineeringOpal

0 likes · 13 min read

Feature Center Overview in iQIYI's Opal Machine Learning Platform

vivo Internet Technology

Dec 13, 2023 · Big Data

Hudi Data Lake Implementation and Optimization Practice at vivo

Vivo’s big‑data team deployed Apache Hudi to create a lakehouse that unifies streaming and batch workloads, leverages COW and MOR storage modes, automates small‑file clustering and compaction, and applies extensive version, streaming, batch, and lifecycle optimizations, delivering minute‑level latency, hundred‑million‑records‑per‑minute ingestion, and query speeds up to 20 % faster than Hive.

Apache HudiBig DataData Lakehouse

0 likes · 11 min read

Hudi Data Lake Implementation and Optimization Practice at vivo

vivo Internet Technology

Sep 27, 2023 · Big Data

Horizontal Scaling of Hive Metastore Service at Vivo: Evaluation, TiDB Migration, and Lessons Learned

Vivo’s big‑data team horizontally scaled its Hive Metastore by evaluating MySQL sharding (Waggle‑Dance) against a TiDB migration, ultimately adopting TiDB, which after a synchronized cut‑over delivered ~15% faster queries, 80% DDL latency reduction, linear scaling, low resource use, and valuable operational lessons.

Big DataHive MetastorePerformance Optimization

0 likes · 19 min read

Horizontal Scaling of Hive Metastore Service at Vivo: Evaluation, TiDB Migration, and Lessons Learned

iQIYI Technical Product Team

Sep 22, 2023 · Big Data

Data Lake: Concepts, Architecture, and Application in iQIYI's Data Platform

iQIYI’s data‑middle‑platform team built a four‑zone data lake—raw, product, work, and sensitive—integrated with unified ODS/DWD/MID layers, a metadata catalog, and self‑service tools, leveraging HDFS, Hive/Iceberg, Spark/Trino, and Flink, migrated to Apache Iceberg for real‑time freshness, and now aims to further streamline modules and adopt new technologies.

Apache IcebergBig DataData Lake

0 likes · 13 min read

Data Lake: Concepts, Architecture, and Application in iQIYI's Data Platform

iQIYI Technical Product Team

Sep 15, 2023 · Big Data

Apache Spark at iQIYI: Current Status and Optimization

iQIYI now relies on Apache Spark as its main offline engine, processing over 200 000 daily tasks for ETL, data synchronization and analytics, while recent optimizations—dynamic resource allocation, adaptive query execution, compression, rebalance, Z‑order and resource‑governance—have cut compute usage by ~27 %, storage by up to 76 % and improved query speed, completing a large‑scale migration from Hive and paving the way for Spark 3.4 and Iceberg support.

Apache SparkBig DataData Lake

0 likes · 21 min read

Apache Spark at iQIYI: Current Status and Optimization

Didi Tech

Aug 10, 2023 · Big Data

Implementing ZSTD Compression in Didi's Elasticsearch for High‑Performance Log Ingestion

By integrating ZSTD compression into Didi’s Elasticsearch 7.6, the team cut CPU usage by about 15 %, reduced index storage roughly 30 %, boosted write throughput up to 25 %, and eliminated over 20 servers, demonstrating a faster, more storage‑efficient solution for petabyte‑scale log ingestion.

Big DataLuceneZSTD

0 likes · 10 min read

Implementing ZSTD Compression in Didi's Elasticsearch for High‑Performance Log Ingestion

iQIYI Technical Product Team

Jun 9, 2023 · Big Data

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

iQIYI accelerated its big‑data platform by migrating the OLAP layer from Hive to Spark SQL, achieving a 67 % speedup, 50 % CPU reduction and 44 % memory savings, while automating the conversion of tens of thousands of tasks and delivering faster analytics for advertising, BI, membership and user‑growth services.

AutomationBig DataHive

0 likes · 18 min read

Accelerating iQIYI Big Data Platform: Migrating from Hive to Spark SQL

Bilibili Tech

Mar 14, 2023 · Big Data

Bilibili HDFS Erasure Coding Strategy and Implementation

Bilibili reduced petabyte‑scale storage costs by back‑porting erasure‑coding patches to its HDFS 2.8.4 cluster, deploying a parallel EC‑enabled cluster, adding a data‑proxy service, intelligent routing and block‑checking, and automating cold‑data migration, while noting write overhead and planning native acceleration.

Big DataData ReliabilityErasure Coding

0 likes · 14 min read

Bilibili HDFS Erasure Coding Strategy and Implementation

Alimama Tech

Mar 8, 2023 · Artificial Intelligence

Secure Data Hub: Alibaba's Marketing Privacy Computing Platform

Alibaba’s Secure Data Hub (SDH) is a privacy‑preserving data clean‑room platform that uses secure multi‑party computation and privacy‑enhancing machine learning to let advertisers, ad platforms, and auditors jointly analyze marketing data via a simple SQL API while keeping raw data encrypted, column‑level protected, and confined to each party’s private domain.

Big DataData Clean RoomFederated Learning

0 likes · 13 min read

Secure Data Hub: Alibaba's Marketing Privacy Computing Platform

Alimama Tech

Feb 15, 2023 · Big Data

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

Dolphin, Alibaba’s hyper‑converged multi‑modal big‑data engine, unifies OLAP, AI, streaming, and batch workloads on a decoupled compute‑storage MPP foundation, offering a Dolphin SQL layer, advanced bitmap/GroupTable/AFile indexes, intelligent materialization, and one‑write‑multiple‑read storage that cuts costs over 70% while delivering sub‑millisecond queries on trillion‑row datasets.

AIBig DataIndexing

0 likes · 14 min read

Dolphin: Alibaba's Hyper‑Converged Multi‑Modal Big Data Engine Overview

iQIYI Technical Product Team

Feb 3, 2023 · Big Data

Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI

iQIYI’s data lake combines public‑cloud and private storage with Apache Iceberg’s snapshot‑based table format to enable near‑real‑time, unified batch‑and‑stream analytics, reducing costs, simplifying architecture, and improving data freshness across use cases such as log collection, audit, pingback, and member order processing.

Apache IcebergBig DataData Architecture

0 likes · 25 min read

Data Lake Concepts, Benefits, and Iceberg‑Based Implementations at iQIYI

vivo Internet Technology

Jan 11, 2023 · Cloud Native

Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar

vivo’s Internet Storage team details how it operates RocketMQ for low‑latency online services and Kafka for massive big‑data pipelines, outlines resource isolation, traffic balancing, intelligent throttling, and governance practices, and describes its migration from RabbitMQ and planned shift from Kafka to cloud‑native Pulsar.

Big DataKafkaMessage Middleware

0 likes · 22 min read

Practices of Distributed Message Middleware at vivo: From RocketMQ to Kafka and Pulsar

Bilibili Tech

Dec 23, 2022 · Big Data

Data Service Platform Architecture and Design

The article outlines a standardized data‑service platform built atop a warehouse, detailing its construction, query, and gateway layers—supporting model definition, acceleration, reusable APIs, unified DSL/SQL interfaces, and observability—to solve ingestion, definition, and lineage issues, achieving 500+ APIs, sub‑day creation, and 18% cost reduction.

API GatewayBig DataData Service

0 likes · 22 min read

Data Service Platform Architecture and Design

Bilibili Tech

Nov 22, 2022 · Big Data

Overview of the Berserker Big Data Platform and Its Data Development Architecture

The Berserker big‑data platform provides a one‑stop data development and governance solution built on over 40 micro‑services, featuring the Archer scheduler with CN and EN nodes, Raft‑based state management, Docker‑isolated task execution, smart routing, and plans to make EN stateless, migrate to Kubernetes, and unify batch and streaming services.

ArcherBig DataDocker

0 likes · 17 min read

Overview of the Berserker Big Data Platform and Its Data Development Architecture

Bilibili Tech

Oct 21, 2022 · Big Data

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing

Bilibili adopted the open‑source Kyuubi proxy to replace its unstable STS layer, enabling multi‑tenant, multi‑engine (Spark, Presto, Flink) SQL/Scala processing with Hive Thrift compatibility, fine‑grained queue isolation, UI monitoring, stability safeguards, and Kubernetes/YARN deployment, while planning further cloud‑native extensions.

Big DataKubernetesKyuubi

0 likes · 20 min read

Kyuubi at Bilibili: Architecture, Enhancements, and Production Practices for Large‑Scale Data Processing