Cloud Native 11 min read

How BYD and Alibaba Cloud Use Argo Workflows to Efficiently Schedule Millions of Autonomous Driving Tasks

Facing over 1 PB of daily sensor data, BYD replaced Airflow with a multi‑cluster Argo Workflows and Argo CD architecture, integrated Ray for GPU workloads, and achieved 20‑40 k concurrent workflows, an 11‑fold efficiency boost, 30% cost reduction, and near‑99% success rates.

Alibaba Cloud Infrastructure

May 26, 2026

How BYD and Alibaba Cloud Use Argo Workflows to Efficiently Schedule Millions of Autonomous Driving Tasks

At KubeCon + CloudNativeCon Europe 2026, BYD and Alibaba Cloud presented a talk titled “Empowering Autonomy: BYD's Journey Taming Million‑Task Scale With Argo Workflows,” describing how they rebuilt their autonomous‑driving data‑processing platform using Argo Workflows.

The platform must ingest at least 1 PB of multi‑sensor data each day for auto‑labeling. The existing Airflow solution showed bottlenecks in scalability, state synchronization, and version management, leading to scheduling blockages and difficulty supporting GitOps‑style continuous delivery.

To address these issues, the team designed a multi‑cluster Argo Workflows architecture. Each Kubernetes cluster runs an identical Argo Workflows environment, managed centrally with Argo CD for unified deployment and GitOps control. Ray Cluster is added to handle GPU‑intensive tasks, while CPU workloads run on Alibaba Cloud ECS instances with elastic scaling. This layered design provides high throughput, low latency, strong fault‑tolerance, and the ability to resume from any interruption point.

After migration, the system supports 20 k–40 k concurrent workflows, delivering an 11× increase in task execution efficiency and reducing compute costs by roughly 30%. Workflow success rates approach 99%.

Resource management is refined through namespace‑level concurrency limits, Semaphore‑based fine‑grained control, and elastic quota mechanisms that prioritize high‑priority jobs while preventing resource starvation. The team also introduced a cache‑compare mechanism for informer updates, reduced API‑server load by cutting create/patch calls, and off‑loaded heavy operations (e.g., pod list queries, deletions) to asynchronous workers, cutting CPU usage by about 50%.

Observability improvements show a pending workflow queue capacity of up to 200 k items, typical queue latency around 50 ms, and the ability to run 2 × 10⁴–4 × 10⁴ workflows simultaneously, confirming strong parallel processing capability.

Beyond internal use, the team contributed back to the open‑source Argo Workflows project, submitting pull requests that address informer/cache performance and controller OOM issues observed in million‑task production environments.

Overall, the migration to a cloud‑native, multi‑cluster Argo Workflows system enabled BYD to overcome Airflow’s limits, scale to millions of autonomous‑driving tasks, improve efficiency, lower costs, and showcase the practical value of cloud‑native workflow orchestration for large‑scale AI data pipelines.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Kubernetes Ray Autonomous Driving Workflow orchestration Argo Workflows

Written by

Alibaba Cloud Infrastructure

For uninterrupted computing services

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.