Big Data 15 min read

Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components

This article provides a comprehensive overview of Apache Spark's architecture, covering its RDD abstraction, computation model, various cluster deployment modes, RPC communication layer, startup procedures, core components, interaction flows, and block management for broadcast variables.

Architect

Apr 17, 2017

Apache Spark is an open‑source, general‑purpose cluster computing system that offers high‑level programming APIs for Scala, Java, and Python; its core is written in Scala, leveraging functional programming for efficient abstraction across computation layers.

RDD Abstraction

Resilient Distributed Datasets (RDD) are an in‑memory abstraction of distributed data that provides fault tolerance through limited shared memory, offering higher efficiency than traditional data‑flow models. An RDD has five key properties:

A set of partitions

A function to compute each data slice

A set of dependencies on other RDDs

Optionally, a Partitioner for key‑value RDDs (usually HashPartitioner)

Optionally, preferred location information (e.g., HDFS block locations)

These properties enable RDDs to express distributed datasets and serve as the foundation for building DAGs.

Computation Abstraction

Key concepts in Spark’s computation model include:

Application : the user‑written Spark program consisting of a Driver and a set of Executors.

Job : generated each time an Action is called; a Job contains multiple Stages.

Stage : either ShuffleMapStage or ResultStage, created at shuffle boundaries.

TaskSet : a collection of Tasks with the same execution logic, scheduled as a unit.

Task : the basic unit running on a physical node, either ShuffleMapTask or ResultTask.

Cluster Modes

Spark separates resource management into a pluggable layer, supporting three cluster managers:

Standalone mode : default internal manager with a Master‑Worker architecture.

YARN mode : integrates with Hadoop YARN for resource negotiation.

Mesos mode : runs on Apache Mesos, allowing fine‑grained or coarse‑grained scheduling.

The design enables third‑party resource managers to be easily integrated via abstract SchedulerBackend interfaces.

RPC Network Communication Abstraction

Spark’s RPC layer is built on Netty but abstracts the underlying details, exposing RpcEndpoint and RpcEndpointRef objects managed by a central RpcEnv. This design allows alternative RPC frameworks to be plugged in without affecting higher‑level components.

Starting a Standalone Cluster

The startup flow in Standalone mode includes:

Master creates an RpcEnv and registers a Master endpoint.

Workers start their own RpcEnv and register Worker endpoints.

Workers connect to the Master, register host, port, CPU, and memory.

Master acknowledges registration and begins heartbeat monitoring.

When a user submits a Spark application, the Master coordinates Driver launch.

Core Components

The runtime core consists of Driver and Executor processes, each hosting a SparkEnv that contains most of the necessary subsystems (scheduler, block manager, RPC, etc.). Both share many components via SparkEnv.

Core Component Interaction Process

Key interaction points:

An Application launches a Driver.

The Driver tracks resource and task status.

The Driver manages a set of Executors.

Each Executor runs Tasks belonging to its Driver.

Detailed flows (submission, Driver start, Application registration, Executor launch, Task execution, and Task completion) are illustrated with colored arrows in the original diagrams.

Block Management

Block management underlies Spark’s broadcast mechanism. Broadcast variables are split into 4 MB blocks; Executors fetch needed blocks from peers, reducing network load and enabling local reuse. The process involves the Driver maintaining block locations, Executors requesting missing blocks, and caching received blocks for subsequent tasks.

Source: 简单之美 (http://shiyanjun.cn/archives/1545.html)

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data RPC Apache Spark RDD Cluster Mode

Written by

Architect

Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.