Understanding Multi-Core Processor Architectures: SMP, UMA, NUMA & Cache Hierarchies
This article outlines the main server hardware architectures—SMP, UMA, and NUMA—explains shared-storage models, details multi-core cache structures from private L1 to shared L3, compares access latencies, and discusses inter-core communication mechanisms and cache coherency protocols.
Architecture Overview
From a system architecture perspective, current commercial servers can be divided into three main categories:
Symmetric Multi-Processor (SMP)
Non-Uniform Memory Access (NUMA)
Massive Parallel Processing (MPP)
Shared-storage multiprocessors have two models:
Uniform-Memory-Access (UMA) model
Non-Uniform-Memory-Access (NUMA) model
Cache Structures in Multi-Core Systems
Based on the cache configuration inside a multi-core processor, the organization can be divided into four types:
On-die private L1 cache : Simple multi-core CPUs have a two‑level cache (L1 and L2). Each core has its own private L1 cache, split into instruction (L1‑I) and data (L1‑D) caches. The L2 cache is shared outside the processor chip.
On-die private L2 cache : Each core still has private L1 caches, but the L2 cache is moved on‑die and is also private to each core. Main memory remains outside the chip.
On-die shared L2 cache : Similar to the private L2 structure, but the L2 cache is shared among all cores on the chip while the main memory stays outside. Shared L2 improves overall system performance despite slightly higher latency for each core.
On-die shared L3 cache : With larger on‑die memory resources, high‑performance processors move the L3 cache onto the chip and make it shared among all cores, further boosting performance.
Access speed comparison: L1 cache (instruction and data) is fastest, followed by L2, L3, then RAM. Typical latencies are:
L1: 4 CPU cycles
L2: 11 CPU cycles
L3: 39 CPU cycles
RAM: 107 CPU cycles
Inter-Core Communication Mechanisms
Multi-core processors need efficient communication for data sharing and synchronization. Three common mechanisms are:
Bus‑shared cache : Cores share L2 or L3 cache via a common bus. Simple and fast but limited scalability.
Crossbar switch : Provides high‑bandwidth point‑to‑point connections, avoiding bus contention.
On‑chip network (NoC) : Integrates many cores and resources on a single chip using a network‑on‑chip, employing message passing, routing, and packet switching to overcome bus bottlenecks.
Cache Coherency in Multi-Core Processors
Maintaining coherence across multiple cache levels is essential. In multi-core systems, each core may hold private copies of the same data, leading to potential inconsistencies during writes. Coherency protocols track the state of each cache line and update other caches via mechanisms such as write‑through, write‑back, and bus‑snooping.
UMA (Uniform Memory Access) and SMP
SMP (Symmetric Multi‑Processor) systems have multiple CPUs with equal status, sharing a single physical memory where access latency is uniform—this is also called UMA. Scaling SMP involves adding memory, faster CPUs, more CPUs, or I/O slots. While UMA offers simple design and good load balancing, its scalability is limited because all cores share the same memory bus.
NUMA (Non‑Uniform Memory Access)
NUMA overcomes SMP’s scalability limits by partitioning memory into nodes, each attached to a subset of CPUs via high‑speed interconnects. Access latency varies: remote node > remote CPU within same node > local node. Applications should aim to keep memory accesses within the same NUMA node, often by setting thread affinity.
NUMA retains the simplicity of a single OS and programming model while providing better scalability and performance compared to UMA.
CC‑NUMA (Cache‑Coherent NUMA)
CC‑NUMA systems maintain cache coherence across nodes using specialized hardware, eliminating the need for software‑based coherence. Two common protocols are:
Directory protocol : A centralized controller stores the global state of each cache line in main memory. When a CPU issues a read/write, the controller coordinates data synchronization.
Snoopy protocol : Caches monitor a shared bus and broadcast coherence messages. While simpler, it consumes more bus bandwidth and scales less well than directory‑based approaches.
Preview
In the next section we will discuss practical NUMA considerations on Linux.
Ops Development Stories
Maintained by a like‑minded team, covering both operations and development. Topics span Linux ops, DevOps toolchain, Kubernetes containerization, monitoring, log collection, network security, and Python or Go development. Team members: Qiao Ke, wanger, Dong Ge, Su Xin, Hua Zai, Zheng Ge, Teacher Xia.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.