Big Data 16 min read

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.

Rare Earth Juejin Tech Community

Nov 29, 2024

How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray

Ray Summit, the annual global event of the Ray community, was held from September 30 to October 2, 2024 in San Francisco with the theme "Where Builders Create the AI Future" and featured speakers from leading AI companies such as OpenAI, Meta, Google, Nvidia, and ByteDance.

ByteDance engineers Xiaohong Dong, Zhibei Ma, and Liguang Xie presented a talk titled "How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray".

Their team, part of ByteDance Seed (audio, vision) and Data Infra, aims to build high‑performance, scalable distributed data processing platforms to improve multimodal model capabilities.

They identified three major challenges: exponential data growth to petabyte scale, limited GPU/CPU resources, and increasingly complex processing tasks.

Ray was chosen as the solution because it can handle massive data, optimize heterogeneous resource allocation, and provide flexible orchestration.

The audio data processing pipeline is organized into three layers: infrastructure (storage, compute, scheduling), a custom data‑processing pipeline for audio, and an application layer that uses the processed data for downstream tasks such as music generation.

Key concepts in the pipeline include node (a task or operator that may require CPU or GPU) and flow (directed connections between nodes), assembled into a DAG via YAML.

Initial issues included limited scalability, manual task scheduling, lack of high‑availability and fault‑tolerance, and cumbersome data transfer.

After evaluating RayCore, they moved to RayData, which offers ready‑made operators, multimodal data source support, automatic data sharding, and auto‑scaling, reducing development effort.

RayServe was also employed for efficient model deployment, providing built‑in high‑availability and fault‑tolerance.

Advantages of Ray for audio pipelines include excellent scalability, flexible APIs, a rich data ecosystem, high performance distributed computing, and compatibility with existing tools like Pandas and Spark.

The video pipeline faces additional challenges due to the large size of video data and the need for intensive processing (e.g., ffmpeg encoding/decoding). The workflow involves segmenting videos, processing clips, storing metadata, uploading segments, and packaging them into Parquet files for training.

Direct use of RayData for video packaging revealed two problems: costly binary serialization/deserialization for large objects and performance degradation when the ObjectStore spills to disk.

To overcome this, they fused all operations into a single actor that runs multiple threads, achieving high throughput and linear scalability with added CPU resources.

Further enhancements include a task‑reallocation strategy in the RayData scheduler that redistributes failed tasks to other actors without relying on RayCore's automatic actor restarts, and a lineage‑tracking mechanism that records input‑output relationships to recompute lost data when objects disappear.

These solutions improve resilience on unstable Kubernetes pods, ensuring that partial GPU loss does not halt the entire job.

In summary, the presentation covered building scalable audio/video data pipelines with Ray, handling unstable resources, and proposing RayData improvements to enhance fault tolerance and resource efficiency.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data AI Distributed computing Ray data pipelines multimodal models ByteDance

Written by

Rare Earth Juejin Tech Community

Juejin, a tech community that helps developers grow.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.