Big Data 11 min read

Understanding Spark Memory Management: On‑Heap and Off‑Heap Planning and Allocation

This article explains Spark's memory management architecture, covering on‑heap and off‑heap memory planning, the MemoryManager interface, static versus unified memory allocation strategies, and how dynamic borrowing improves resource utilization for Spark executors.

Big Data Technology Architecture
Big Data Technology Architecture
Big Data Technology Architecture
Understanding Spark Memory Management: On‑Heap and Off‑Heap Planning and Allocation

Spark, as an in‑memory distributed computing engine, relies heavily on its memory management module; understanding its principles helps developers write efficient Spark applications and tune performance.

On‑heap memory is configured via the --executor-memory or spark.executor.memory setting. Executor JVM heap is divided into storage memory (for cached RDDs and broadcast variables), execution memory (for shuffle and computation), and the remaining space for other objects. Spark records allocations and releases but relies on the JVM garbage collector for actual reclamation.

Off‑heap memory is optional and enabled with spark.memory.offHeap.enabled and sized by spark.memory.offHeap.size . It stores serialized binary data directly in system memory using the JDK Unsafe API, reducing GC overhead and allowing precise allocation and deallocation.

The MemoryManager interface provides methods for acquiring and releasing storage, execution, and unroll memory, with a MemoryMode argument indicating on‑heap or off‑heap usage:

def acquireStorageMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean

def acquireUnrollMemory(blockId: BlockId, numBytes: Long, memoryMode: MemoryMode): Boolean

def acquireExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Long

def releaseStorageMemory(numBytes: Long, memoryMode: MemoryMode): Unit

def releaseExecutionMemory(numBytes: Long, taskAttemptId: Long, memoryMode: MemoryMode): Unit

def releaseUnrollMemory(numBytes: Long, memoryMode: MemoryMode): Unit

Two allocation strategies exist:

Static memory management (pre‑Spark 1.6) fixes the sizes of storage, execution, and other memory based on configuration fractions; formulas such as availableStorage = systemMaxMemory * spark.storage.memoryFraction * spark.storage.safetyFraction are used.

Unified memory management (default since Spark 1.6) lets storage and execution share a common pool, dynamically borrowing idle space from each other, which improves utilization and reduces the risk of one side exhausting its allocation.

The dynamic borrowing mechanism works by setting a base storage fraction, then allowing the other side to borrow when its own space is insufficient, spilling to disk if necessary, and returning borrowed space when the original owner needs it.

In summary, unified memory management increases the effective use of both on‑heap and off‑heap resources, but developers must still monitor storage and execution memory usage to avoid excessive garbage collection or out‑of‑memory errors.

big datamemory managementSparkOff-heapOn-HeapUnified Memory Manager
Big Data Technology Architecture
Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.