Understanding Apache Spark Architecture: RDD, Computation Model, Cluster Modes, RPC, and Core Components
This article provides a comprehensive overview of Apache Spark's architecture, covering its RDD abstraction, computation model, various cluster deployment modes, RPC communication layer, startup procedures, core components, interaction flows, and block management for broadcast variables.
Apache Spark is an open‑source, general‑purpose cluster computing system that offers high‑level programming APIs for Scala, Java, and Python; its core is written in Scala, leveraging functional programming for efficient abstraction across computation layers.
RDD Abstraction
Resilient Distributed Datasets (RDD) are an in‑memory abstraction of distributed data that provides fault tolerance through limited shared memory, offering higher efficiency than traditional data‑flow models. An RDD has five key properties:
A set of partitions
A function to compute each data slice
A set of dependencies on other RDDs
Optionally, a Partitioner for key‑value RDDs (usually HashPartitioner)
Optionally, preferred location information (e.g., HDFS block locations)
These properties enable RDDs to express distributed datasets and serve as the foundation for building DAGs.
Computation Abstraction
Key concepts in Spark’s computation model include:
Application : the user‑written Spark program consisting of a Driver and a set of Executors.
Job : generated each time an Action is called; a Job contains multiple Stages.
Stage : either ShuffleMapStage or ResultStage, created at shuffle boundaries.
TaskSet : a collection of Tasks with the same execution logic, scheduled as a unit.
Task : the basic unit running on a physical node, either ShuffleMapTask or ResultTask.
Cluster Modes
Spark separates resource management into a pluggable layer, supporting three cluster managers:
Standalone mode : default internal manager with a Master‑Worker architecture.
YARN mode : integrates with Hadoop YARN for resource negotiation.
Mesos mode : runs on Apache Mesos, allowing fine‑grained or coarse‑grained scheduling.
The design enables third‑party resource managers to be easily integrated via abstract SchedulerBackend interfaces.
RPC Network Communication Abstraction
Spark’s RPC layer is built on Netty but abstracts the underlying details, exposing RpcEndpoint and RpcEndpointRef objects managed by a central RpcEnv . This design allows alternative RPC frameworks to be plugged in without affecting higher‑level components.
Starting a Standalone Cluster
The startup flow in Standalone mode includes:
Master creates an RpcEnv and registers a Master endpoint.
Workers start their own RpcEnv and register Worker endpoints.
Workers connect to the Master, register host, port, CPU, and memory.
Master acknowledges registration and begins heartbeat monitoring.
When a user submits a Spark application, the Master coordinates Driver launch.
Core Components
The runtime core consists of Driver and Executor processes, each hosting a SparkEnv that contains most of the necessary subsystems (scheduler, block manager, RPC, etc.). Both share many components via SparkEnv .
Core Component Interaction Process
Key interaction points:
An Application launches a Driver.
The Driver tracks resource and task status.
The Driver manages a set of Executors.
Each Executor runs Tasks belonging to its Driver.
Detailed flows (submission, Driver start, Application registration, Executor launch, Task execution, and Task completion) are illustrated with colored arrows in the original diagrams.
Block Management
Block management underlies Spark’s broadcast mechanism. Broadcast variables are split into 4 MB blocks; Executors fetch needed blocks from peers, reducing network load and enabling local reuse. The process involves the Driver maintaining block locations, Executors requesting missing blocks, and caching received blocks for subsequent tasks.
Source: 简单之美 (http://shiyanjun.cn/archives/1545.html)
Architect
Professional architect sharing high‑quality architecture insights. Topics include high‑availability, high‑performance, high‑stability architectures, big data, machine learning, Java, system and distributed architecture, AI, and practical large‑scale architecture case studies. Open to ideas‑driven architects who enjoy sharing and learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.