Fundamentals 31 min read

Understanding CPU Cache: Architecture, Hierarchy, and Optimization Techniques

This article explains the fundamental role of CPU cache in bridging the speed gap between processors and memory, covering cache hierarchy, locality principles, write policies, coherence protocols, and practical code optimizations such as data alignment and loop restructuring to improve performance.

Deepin Linux

Feb 4, 2025

Understanding CPU Cache: Architecture, Hierarchy, and Optimization Techniques

In modern computers the CPU operates at gigahertz frequencies while memory access remains orders of magnitude slower, creating a performance bottleneck that is mitigated by the introduction of a high‑speed CPU cache.

1. Overview of CPU Cache

Because CPU speed doubles roughly every 18 months (Moore's law) while memory improves far more slowly, the gap widens, prompting the use of multi‑level caches (L1‑L3) to store frequently accessed data and instructions.

1.1 Definition and Role

CPU cache is a small, fast storage situated between the processor and main memory, acting as a data "relay station" that reduces the time the CPU spends waiting for memory.

1.2 Cache Hierarchy

Typical CPUs have three cache levels: L1 (closest to the core, ~32‑64 KB, split into data and instruction caches), L2 (larger, ~1 MB, slightly slower), and L3 (shared among cores, up to tens of megabytes). Each level offers increasing capacity but decreasing speed.

2. Core Principles of CPU Cache

2.1 Principle of Locality

Cache performance relies on temporal locality (recently accessed data is likely to be reused) and spatial locality (data near recently accessed addresses is likely to be accessed soon). For example:

int sum = 0;
int arr[100] = {1, 2, 3, ..., 100};
for (int i = 0; i < 100; i++) {
    sum += arr[i];
}

The loop repeatedly accesses sum and sequential elements of arr, allowing the cache to keep them resident.

2.2 Cache Hit and Miss

A cache hit occurs when the requested data is already in the cache, enabling near‑instant access; a miss forces the CPU to fetch data from slower main memory, incurring many clock cycles.

2.3 Cache Line

Data is transferred between memory and cache in fixed‑size blocks called cache lines (commonly 64 bytes). Loading a line brings a contiguous chunk of memory into the cache, improving the likelihood of subsequent hits.

3. Cache Write Policies

3.1 Write‑Through

Every write updates both the cache and main memory simultaneously, guaranteeing data consistency but incurring higher latency and bus traffic.

3.2 Write‑Back

Writes modify only the cache; the modified line (marked "dirty") is written back to memory only when it is evicted. This reduces memory traffic and improves performance, at the cost of added complexity and potential data loss on sudden power failure.

4. Cache Coherence in Multi‑core Systems

4.1 Coherence Problem

When multiple cores have private caches, a write by one core can leave other cores with stale copies, leading to incorrect program behavior.

4.2 Solutions

Two common mechanisms are:

Bus Snooping : each core monitors the bus for write requests and invalidates or updates its own cache lines accordingly.

MESI Protocol : caches track line states—Modified, Exclusive, Shared, Invalid—to coordinate coherence with fewer bus transactions.

5. Optimizing Code for CPU Cache

5.1 Data Alignment

Aligning data structures to their natural word boundaries (e.g., 4‑byte alignment for int) ensures that they fit neatly within cache lines, reducing the chance of a single object spanning two lines.

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

// Unaligned structure
struct UnalignedStruct {
    char a;
    int b;
    char c;
};

// Aligned structure (4‑byte alignment)
struct __attribute__((aligned(4))) AlignedStruct {
    char a;
    int b;
    char c;
};

void testAccessTime(void *arr, size_t numElements, size_t structSize) {
    clock_t start, end;
    double cpu_time_used;
    int i;
    start = clock();
    for (i = 0; i < numElements; i++) {
        char *ptr = (char *)arr + i * structSize;
        char temp = *((char *)ptr);
        temp = *((int *)(ptr + 1));
        temp = *((char *)(ptr + 5));
    }
    end = clock();
    cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC;
    printf("Access time: %f seconds
", cpu_time_used);
}

int main() {
    size_t numElements = 1000000;
    struct UnalignedStruct *unalignedArr = (struct UnalignedStruct *)malloc(numElements * sizeof(struct UnalignedStruct));
    struct AlignedStruct *alignedArr = (struct AlignedStruct *)malloc(numElements * sizeof(struct AlignedStruct));
    if (!unalignedArr || !alignedArr) { perror("malloc failed"); return 1; }
    printf("Testing unaligned struct:
");
    testAccessTime(unalignedArr, numElements, sizeof(struct UnalignedStruct));
    printf("Testing aligned struct:
");
    testAccessTime(alignedArr, numElements, sizeof(struct AlignedStruct));
    free(unalignedArr);
    free(alignedArr);
    return 0;
}

Running this program typically shows faster access for the aligned version, illustrating the impact of alignment on cache efficiency.

5.2 Loop Optimization

Loop structure influences cache behavior. Reducing nesting depth, ordering loops to follow memory layout (row‑major for C arrays), and minimizing stride can significantly raise hit rates.

for (int i = 0; i < N; i++) {
    for (int j = 0; j < M; j++) {
        data[i][j] = data[i][j] * 2; // Poor cache usage if M is large
    }
}

// Improved version – move invariant work outside inner loop
int constant = someComplexCalculation();
for (int i = 0; i < N; i++) {
    int temp = someCalculationBasedOnI(i, constant);
    for (int j = 0; j < M; j++) {
        data[i][j] = temp * data[i][j];
    }
}

// Row‑major access (preferred)
for (int i = 0; i < ROWS; i++) {
    for (int j = 0; j < COLS; j++) {
        sum += matrix[i][j];
    }
}

// Column‑major access (causes more misses)
for (int j = 0; j < COLS; j++) {
    for (int i = 0; i < ROWS; i++) {
        sum += matrix[i][j];
    }
}

Applying these techniques helps programs achieve higher performance, especially in data‑intensive or real‑time scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

CPU cache cache coherence MESI Protocol Memory Architecture cache hierarchy Data Alignment

Written by

Deepin Linux

Research areas: Windows & Linux platforms, C/C++ backend development, embedded systems and Linux kernel, etc.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.