Artificial Intelligence 26 min read

Building Complex Distributed Systems with Ray: An AutoML Case Study and Cloud‑Native Deployment

This article explains how the Ray distributed computing engine simplifies the design, deployment, and operation of complex cloud‑native distributed systems—illustrated through an AutoML service example—by detailing system complexity, Ray’s core concepts, resource customization, runtime environments, monitoring, and ecosystem integrations.

DataFunTalk

Aug 22, 2023

Building Complex Distributed Systems with Ray: An AutoML Case Study and Cloud‑Native Deployment

Introduction – Ray is a distributed computing engine that can serve as foundational infrastructure for big data, AI, and any workload requiring a distributed system. The article uses an AutoML case to show the challenges of building such a system without Ray and how Ray resolves them.

1. Distributed System Complexity

AutoML service consists of proxy, trainer, and worker roles, each with distinct responsibilities and lifecycle.

Without Ray, developers must manually handle scheduling (asyncio), communication (protobuf/gRPC), storage (RocksDB, HDFS), deployment (Docker, Kubernetes operators), and monitoring (Grafana + Prometheus).

2. Ray Overview

Open‑source project from UC Berkeley RISELab, rapidly gaining stars and adoption in AI (e.g., OpenAI, Google Pathways).

Provides a generic distributed programming model that abstracts functions as Task and classes as Actor, backed by a distributed object store.

3. Ray Core APIs

Task – stateless remote function ( @ray.remote decorator, ray.get to retrieve results).

Object Store – distributed in‑memory store with zero‑copy transfer, automatic garbage collection, and spilling to disk.

Actor – stateful remote class instance, enabling persistent services.

4. Building the AutoML Service with Ray

Cluster deployment – one‑click ray up on any cloud/Kubernetes or manual ray start for head and worker nodes.

Components

Proxy and Trainer are implemented as Actors (stateful services).

Worker tasks are stateless Task s that train and evaluate models.

Client uses Ray client to obtain the proxy handle and submit jobs.

Resource customization – CPU, GPU, memory, and custom resources can be specified directly in the Ray decorator, eliminating separate YAML specifications.

Runtime environment – Ray supports plug‑in environments (process, virtualenv, Conda, container) that can be attached to each Task/Actor at runtime.

Operations & monitoring – Ray Dashboard shows nodes, actors, tasks, logs, and profiling; Ray State Client enables programmatic queries; metrics can be exported to Grafana.

5. Ray Architecture

Head node runs the Global Control Service (GCS) for scheduling and node management.

Worker nodes run Raylet processes that host the object store and execute Tasks/Actors.

6. Ecosystem & Open‑Source Projects

Native libraries: RLlib, Ray Serve, Ray Tune, Ray AIR (AI Runtime) for end‑to‑end pipelines.

Third‑party integrations: Spark, Dask, Mars, Alpa, Colossal‑AI, trlX, GeaFlow, Yinyu (privacy computing), etc.

Enterprise adoption across many global and Chinese companies for AI services, large‑model training, and data processing.

7. Q&A Highlights

Ray supports real‑time streaming, function‑as‑a‑service, and can coexist with Spark.

Object store provides temporary, distributed storage; persistence must be handled by the user.

Fault tolerance is managed at the process level by Ray, while fine‑grained state recovery is left to the application.

Overall, Ray abstracts away the boilerplate of building distributed systems, allowing developers to focus on business logic while leveraging cloud‑native deployment, flexible resource management, and a rich AI‑oriented ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Python AI Kubernetes AutoML Distributed computing Ray

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.