Big Data 16 min read

Design Principles of the Spark Core – DataFun Introduction to Apache Spark (Part 1)

This article provides a comprehensive overview of Apache Spark, covering its origins, key characteristics, core concepts such as RDD, DAG, partitioning and dependencies, the internal architecture including SparkConf, SparkContext, SparkEnv, storage and scheduling systems, as well as deployment models and the company behind the product.

DataFunSummit

Jul 11, 2024

Introduction

The session is the first part of DataFun's "Deep Dive into Apache Spark" series. The speaker, Geng Jia'an, a senior architect at Shuxin Network and Spark Committer, introduces his background and the two Spark‑related products CyberEngine and CyberData.

Topic: Design Principles of the Spark Core

The presentation is organized into the following sections:

Getting Started with Spark

Key Features of Spark

Basic Concepts

Core Functions

Model Design

Deployment Architecture

Company Overview

01 – Getting Started with Spark

Apache Spark is a general‑purpose parallel computing framework created by the AMP Lab at UC Berkeley in 2009, open‑sourced in 2010, and has become one of the most active Apache projects in the big‑data ecosystem.

02 – What Are Spark’s Characteristics?

Spark’s five main characteristics are flexible memory management, flexible parallelism control, optional shuffle sorting, avoidance of recomputation, and reduced disk I/O.

Flexible Memory Management

Memory is divided into four quadrants (heap/off‑heap, storage/computation) and can be borrowed between execution and storage to improve utilization. Since Spark 1.4, the Tungsten project provides off‑heap data structures and direct OS memory allocation, reducing JVM overhead. Spark also supports task‑level memory management.

Flexible Parallelism Control

Spark abstracts stages via shuffle dependencies, allowing stages to run serially or in parallel. Within a stage, RDDs let users set custom parallelism to handle massive data workloads.

Optional Shuffle Sorting

Unlike early Hadoop MapReduce, Spark can perform sorting either on the map side or the reduce side, depending on the scenario.

Avoiding Recomputation

The DAG lineage enables Spark to recompute failed stages, and checkpointing further prevents repeated work, which is critical for large‑scale stability.

Reducing Disk I/O

Early Spark generated many small intermediate files per reducer; later versions introduced partition indexing to write sequentially and read sequentially, dramatically cutting random I/O. Spark also caches submitted JARs in memory for faster executor access.

03 – Basic Concepts

1. Resilient Distributed Dataset (RDD)

Transformations build a DAG of RDDs.

Actions trigger the DAG execution via DAGScheduler.

Each RDD originates from a data source that provides the initial partitioned data.

2. Directed Acyclic Graph (DAG)

A DAG represents the dependency graph of RDDs; it has no cycles, ensuring deterministic execution order.

3. Partition

Partitions are the basic units of parallelism; a custom partitioner can control how data is split across executors.

4. Dependency Types

Narrow Dependency : each child partition depends on a fixed parent partition (OneToOne or Range).

Shuffle (Wide) Dependency : a child partition may depend on many parent partitions, determined by the partitioner.

5. Job / Stage / Task

An action creates a Job; DAGScheduler divides the DAG into Stages, which are further split into Tasks. Tasks are scheduled by TaskScheduler onto executors.

6. Why Use Scala vs. Java?

Scala offers functional programming, richer type inference, and syntactic sugar, making Spark APIs more concise, though it has a steeper learning curve than Java.

04 – Core Functions

1. Infrastructure

SparkConf : configuration management for Spark applications.

Built‑in RPC framework : Netty‑based communication between components.

Event Bus : internal communication within SparkContext.

Metrics System : monitors the health of Spark components.

2. SparkContext

Encapsulates network communication, cluster deployment, storage, computation engine, metrics, and UI, exposing a simple API to developers.

3. SparkEnv

Provides task‑level components such as RPCEnv, serializer, broadcast manager, map output tracker, storage, metrics, and output commit coordinator.

4. Storage System

Prioritizes memory, falling back to disk when necessary. Tungsten optimizes off‑heap memory usage. Early versions also supported Alluxio (Tachyon) as an in‑memory distributed file system.

5. Scheduling System

DAGScheduler : creates Jobs, divides DAG into Stages, and generates TaskSets.

TaskScheduler : assigns Tasks to executors using FIFO, FAIR, etc., and interacts with external cluster managers (Standalone, YARN, Kubernetes).

6. Execution Engine

Memory manager (heap/off‑heap, execution/storage).

Task‑level memory manager.

Tungsten off‑heap structures.

External sorter for map/reduce side sorting.

Shuffle manager for persisting intermediate data and fetching it on the reduce side.

05 – Model Design

1. Programming Model

A typical word‑count example shows how actions trigger the driver, which coordinates DAGScheduler, TaskScheduler, BlockManager, and RPCEnv, then interacts with the cluster manager to launch executors.

2. RDD Computation Model

Partitions are processed in parallel across multiple executors using a partitioner.

3. Why Use Scala (vs. Java)

Scala’s functional features and concise syntax make Spark development more expressive.

06 – Deployment Architecture

1. Cluster Manager

Manages resources for Spark clusters (Standalone, YARN, Kubernetes) and provides fault tolerance.

2. Worker

Corresponds to Standalone Worker, YARN NodeManager, or Kubernetes Node, depending on the chosen manager.

3. Executor

Runs on cluster nodes to execute tasks.

4. Driver

Can run inside or outside the cluster; it orchestrates the application.

5. Application

A user‑written main class packaged as a JAR is submitted via spark-submit. If the driver runs on the client, it shares the JVM with the application; otherwise it runs as a separate process managed by the cluster manager.

07 – Company Overview

Zhejiang Shuxin Network Co., Ltd. focuses on multi‑cloud data intelligence platforms and data value circulation. Headquartered in Hangzhou with branches in Shanghai, Beijing, Shenzhen, it serves customers in over 50 cities worldwide. Its products include CyberEngine, CyberData, and CyberAI, with Spark and Flink as core compute engines, offering superior performance, stability, and cloud‑native capabilities compared to upstream open‑source versions.

Thank you for reading.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data-processing Distributed computing Apache Spark RDD Spark Architecture

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.