How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray
The article details ByteDance's use of Ray and RayData to construct scalable audio and video data processing pipelines for multimodal AI models, addressing challenges of massive data volume, resource constraints, and fault tolerance through pipeline design, RayCore enhancements, and custom scheduling optimizations.
Ray Summit, the annual global event of the Ray community, was held from September 30 to October 2, 2024 in San Francisco with the theme "Where Builders Create the AI Future" and featured speakers from leading AI companies such as OpenAI, Meta, Google, Nvidia, and ByteDance.
ByteDance engineers Xiaohong Dong, Zhibei Ma, and Liguang Xie presented a talk titled "How ByteDance Builds Large-Scale Data Processing Pipelines for Multimodal Models with Ray".
Their team, part of ByteDance Seed (audio, vision) and Data Infra, aims to build high‑performance, scalable distributed data processing platforms to improve multimodal model capabilities.
They identified three major challenges: exponential data growth to petabyte scale, limited GPU/CPU resources, and increasingly complex processing tasks.
Ray was chosen as the solution because it can handle massive data, optimize heterogeneous resource allocation, and provide flexible orchestration.
The audio data processing pipeline is organized into three layers: infrastructure (storage, compute, scheduling), a custom data‑processing pipeline for audio, and an application layer that uses the processed data for downstream tasks such as music generation.
Key concepts in the pipeline include node (a task or operator that may require CPU or GPU) and flow (directed connections between nodes), assembled into a DAG via YAML.
Initial issues included limited scalability, manual task scheduling, lack of high‑availability and fault‑tolerance, and cumbersome data transfer.
After evaluating RayCore, they moved to RayData, which offers ready‑made operators, multimodal data source support, automatic data sharding, and auto‑scaling, reducing development effort.
RayServe was also employed for efficient model deployment, providing built‑in high‑availability and fault‑tolerance.
Advantages of Ray for audio pipelines include excellent scalability, flexible APIs, a rich data ecosystem, high performance distributed computing, and compatibility with existing tools like Pandas and Spark.
The video pipeline faces additional challenges due to the large size of video data and the need for intensive processing (e.g., ffmpeg encoding/decoding). The workflow involves segmenting videos, processing clips, storing metadata, uploading segments, and packaging them into Parquet files for training.
Direct use of RayData for video packaging revealed two problems: costly binary serialization/deserialization for large objects and performance degradation when the ObjectStore spills to disk.
To overcome this, they fused all operations into a single actor that runs multiple threads, achieving high throughput and linear scalability with added CPU resources.
Further enhancements include a task‑reallocation strategy in the RayData scheduler that redistributes failed tasks to other actors without relying on RayCore's automatic actor restarts, and a lineage‑tracking mechanism that records input‑output relationships to recompute lost data when objects disappear.
These solutions improve resilience on unstable Kubernetes pods, ensuring that partial GPU loss does not halt the entire job.
In summary, the presentation covered building scalable audio/video data pipelines with Ray, handling unstable resources, and proposing RayData improvements to enhance fault tolerance and resource efficiency.
Rare Earth Juejin Tech Community
Juejin, a tech community that helps developers grow.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.