Fundamentals 14 min read

Understanding Multi-Core Processor Architectures: SMP, UMA, NUMA & Cache Hierarchies

This article outlines the main server hardware architectures—SMP, UMA, and NUMA—explains shared-storage models, details multi-core cache structures from private L1 to shared L3, compares access latencies, and discusses inter-core communication mechanisms and cache coherency protocols.

Ops Development Stories

Jul 13, 2021

Understanding Multi-Core Processor Architectures: SMP, UMA, NUMA & Cache Hierarchies

Architecture Overview

From a system architecture perspective, current commercial servers can be divided into three main categories:

Symmetric Multi-Processor (SMP)

Non-Uniform Memory Access (NUMA)

Massive Parallel Processing (MPP)

Shared-storage multiprocessors have two models:

Uniform-Memory-Access (UMA) model

Non-Uniform-Memory-Access (NUMA) model

Cache Structures in Multi-Core Systems

Based on the cache configuration inside a multi-core processor, the organization can be divided into four types:

On-die private L1 cache : Simple multi-core CPUs have a two‑level cache (L1 and L2). Each core has its own private L1 cache, split into instruction (L1‑I) and data (L1‑D) caches. The L2 cache is shared outside the processor chip.

On-die private L2 cache : Each core still has private L1 caches, but the L2 cache is moved on‑die and is also private to each core. Main memory remains outside the chip.

On-die shared L2 cache : Similar to the private L2 structure, but the L2 cache is shared among all cores on the chip while the main memory stays outside. Shared L2 improves overall system performance despite slightly higher latency for each core.

On-die shared L3 cache : With larger on‑die memory resources, high‑performance processors move the L3 cache onto the chip and make it shared among all cores, further boosting performance.

Access speed comparison: L1 cache (instruction and data) is fastest, followed by L2, L3, then RAM. Typical latencies are:

L1: 4 CPU cycles

L2: 11 CPU cycles

L3: 39 CPU cycles

RAM: 107 CPU cycles

Inter-Core Communication Mechanisms

Multi-core processors need efficient communication for data sharing and synchronization. Three common mechanisms are:

Bus‑shared cache : Cores share L2 or L3 cache via a common bus. Simple and fast but limited scalability.

Crossbar switch : Provides high‑bandwidth point‑to‑point connections, avoiding bus contention.

On‑chip network (NoC) : Integrates many cores and resources on a single chip using a network‑on‑chip, employing message passing, routing, and packet switching to overcome bus bottlenecks.

Cache Coherency in Multi-Core Processors

Maintaining coherence across multiple cache levels is essential. In multi-core systems, each core may hold private copies of the same data, leading to potential inconsistencies during writes. Coherency protocols track the state of each cache line and update other caches via mechanisms such as write‑through, write‑back, and bus‑snooping.

UMA (Uniform Memory Access) and SMP

SMP (Symmetric Multi‑Processor) systems have multiple CPUs with equal status, sharing a single physical memory where access latency is uniform—this is also called UMA. Scaling SMP involves adding memory, faster CPUs, more CPUs, or I/O slots. While UMA offers simple design and good load balancing, its scalability is limited because all cores share the same memory bus.

NUMA (Non‑Uniform Memory Access)

NUMA overcomes SMP’s scalability limits by partitioning memory into nodes, each attached to a subset of CPUs via high‑speed interconnects. Access latency varies: remote node > remote CPU within same node > local node. Applications should aim to keep memory accesses within the same NUMA node, often by setting thread affinity.

NUMA retains the simplicity of a single OS and programming model while providing better scalability and performance compared to UMA.

CC‑NUMA (Cache‑Coherent NUMA)

CC‑NUMA systems maintain cache coherence across nodes using specialized hardware, eliminating the need for software‑based coherence. Two common protocols are:

Directory protocol : A centralized controller stores the global state of each cache line in main memory. When a CPU issues a read/write, the controller coordinates data synchronization.

Snoopy protocol : Caches monitor a shared bus and broadcast coherence messages. While simpler, it consumes more bus bandwidth and scales less well than directory‑based approaches.

Preview

In the next section we will discuss practical NUMA considerations on Linux.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

architecture Cache Memory NUMA SMP processor

Written by

Ops Development Stories

Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.