Fundamentals 12 min read

Performance Optimization: Register Access, Assembly Basics, and CPU Pipeline Techniques

The article explains how performance can be dramatically improved by keeping frequently used data in CPU registers instead of memory, understanding basic assembly syntax and instruction types, using branch‑prediction hints, and exploiting the CPU pipeline to reduce stalls and wasted cycles.

OPPO Kernel Craftsman
OPPO Kernel Craftsman
OPPO Kernel Craftsman
Performance Optimization: Register Access, Assembly Basics, and CPU Pipeline Techniques

Performance optimization has been a concern since the birth of computers. This article introduces several principles and methods of optimization from a system perspective.

2. Register Access Instead of Memory References

An example compares two programs (Prog1 and Prog2) that sum elements of an array. Prog2 replaces a memory reference (*dest) with a local variable (DEST) that resides in a register, resulting in noticeably faster execution.

The performance gain is demonstrated with benchmark results (no compiler optimizations applied).

1. Registers

Registers are small storage units inside the CPU. Although faster than L1 cache, they are limited in number. The memory hierarchy is: registers → L1 cache → L2 cache → L3 cache → main memory → disk. Misses at any level cause significant latency, as shown in the timing tables.

For a 3.3 GHz CPU, a single cycle corresponds to about 0.5 m of light travel.

Thus, Prog2 runs faster because it accesses a register instead of a memory location.

2. Assembly Language Overview

Assembly bridges machine code and human readability. Early computers used raw machine code (binary), which was hard to write and debug, leading to the creation of assembly language.

Example source code and its corresponding assembly/machine code are shown in the images.

Assembly provides mnemonic instructions such as mov %rbx, %rax (register‑to‑register move) and push %rbp (push onto stack).

3. Assembly Instruction Types

Operand specifiers: immediate, register, memory reference.

Data transfer: mov family.

Stack operations: push , pop .

Arithmetic/logic operations (see image).

Jump instructions: unconditional jmp and conditional jumps.

Compare/test instructions: CMP , TEST .

3. Improving CPU Branch Prediction Accuracy

An example shows two functions that iterate over a zero‑filled array. The second version adds the likely hint to the branch, resulting in shorter execution time.

Understanding CPU pipelines helps explain the benefit.

3.1 CPU Pipeline Overview

A pipeline splits instruction execution into stages: fetch, decode, execute, write‑back, and PC update. Without pipelining, each stage would be idle for most cycles, wasting resources.

Pipelining allows overlapping execution of multiple instructions, improving throughput but introducing hazards such as data and control dependencies. Mispredicted branches can cost ~19 cycles.

Assembly of the branch example shows a jne instruction that would not be mispredicted because the condition is always false.

4. Summary

From a system viewpoint, performance optimization can focus on:

Obtaining data as efficiently as possible.

Minimizing useless CPU work.

Doing more useful work within limited time.

Reducing the amount of work the CPU must perform for the same task.

References: Computer Systems: A Programmer's Perspective by Randal E. Bryant & David R. O'Hallaron; Systems Performance by Brendan Gregg; CSND blog by Song Baohua.

Performance OptimizationPipelinebranch predictionassembly languageCPU registers
OPPO Kernel Craftsman
Written by

OPPO Kernel Craftsman

Sharing Linux kernel-related cutting-edge technology, technical articles, technical news, and curated tutorials

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.