Operations 44 min read

Master Linux Performance: Boost Throughput, Cut Latency, and Optimize CPU & Memory

This guide explains how high concurrency and fast response depend on throughput and latency, defines key performance metrics, shows how to interpret average load, CPU context switches, and memory usage, and provides practical Linux tools and command‑line examples for diagnosing and tuning system performance.

Efficient Ops
Efficient Ops
Efficient Ops
Master Linux Performance: Boost Throughput, Cut Latency, and Optimize CPU & Memory

Linux Performance Optimization

Performance Metrics

High concurrency and fast response are measured by two core indicators: throughput and latency.

Performance problems arise when system resources hit a bottleneck while request handling is still too slow to sustain more traffic. Performance analysis aims to locate these bottlenecks and mitigate them.

From the application perspective: directly impacts end‑user experience.

From the system resource perspective: resource utilization, saturation, etc.

Key Steps

Select metrics to evaluate application and system performance.

Set performance targets for applications and the system.

Conduct performance baseline testing.

Analyze performance to locate bottlenecks.

Implement performance monitoring and alerts.

Understanding "Average Load"

Average load is the average number of runnable and uninterruptible processes over a time interval; it is not directly comparable to CPU utilization.

Uninterruptible processes are those in kernel‑mode critical paths (e.g., waiting for I/O). They act as a protection mechanism for processes and hardware.

When Is Average Load Reasonable?

Monitor average load in production and compare it with historical trends. If the load rises sharply, investigate promptly. A common rule of thumb is to keep average load below the number of CPU cores (or around 70% of that value).

Average load is often confused with CPU utilization, but they are not equivalent:

CPU‑intensive workloads raise both load and CPU usage.

I/O‑intensive workloads raise load while CPU usage may stay low.

Heavy process scheduling raises both load and CPU usage.

CPU

CPU Context Switch (Upper Part)

A CPU context switch saves the previous task’s registers and program counter, then loads the new task’s context before jumping to its entry point. The saved context resides in the kernel until the task is scheduled again.

Context switches are categorized by task type:

Process context switch

Thread context switch

Interrupt context switch

Process Context Switch

Linux separates kernel space and user space. A system call triggers two context switches: user → kernel (saving user registers, loading kernel registers) and kernel → user (restoring user registers).

System calls are technically privilege‑mode switches, not full process switches.

Process switches occur only when the scheduler runs a process on the CPU, e.g., time‑slice rotation, blocked processes, explicit sleep, preemption by higher‑priority tasks, or hardware interrupts.

Thread Context Switch

Two cases exist:

Threads belong to the same process – only thread‑local data and registers change; virtual memory stays the same.

Threads belong to different processes – same cost as a process switch.

Intra‑process thread switches consume fewer resources, which is why multithreading can be advantageous.

Interrupt Context Switch

Interrupt switches involve only kernel‑mode state (CPU registers, kernel stack, hardware parameters) and never occur simultaneously with process switches because interrupt priority exceeds process priority.

CPU Context Switch (Lower Part)

Use

vmstat

to view overall context‑switch statistics:

<code>vmstat 5            # output every 5 seconds</code>
<code>procs -----------memory---------- ---swap-- -----io---- -system-- ------cpu-----</code>
<code> r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st</code>
<code> 1  0      0 103388 145412 511056    0    0    18    60    1    1  2  1 96  0  0</code>

Key columns:

cs : context switches per second.

in : interrupts per second.

r : length of the run queue (processes ready or running).

b : processes in uninterruptible sleep.

To inspect per‑process switches, use

pidstat -w

:

<code>pidstat -w 5</code>
<code>14:51:16   UID   PID   cswch/s  nvcswch/s  Command</code>
<code>... (sample output) ...</code>

‘cswch’ counts voluntary switches (resource contention); ‘nvcswch’ counts involuntary switches (scheduler preemption).

CPU Performance Indicators

CPU usage (user, system, iowait, soft/hard IRQ, steal/guest).

Average load (ideal ≈ number of logical CPUs).

Process context switches (voluntary vs. involuntary).

CPU cache hit rate (L1/L2/L3).

Performance Tools

Use

uptime

to view average load.

Combine

mpstat

and

pidstat

to pinpoint high‑load processes.

Use

top

for quick CPU usage overview.

Apply

perf top

/

perf record

/

perf report

to drill into hot functions.

For I/O‑related issues, examine

/proc/softirqs

and

sar -r -S

.

Memory

How Linux Memory Works

Only the kernel can access physical RAM directly. Each process receives an isolated virtual address space that appears contiguous, which the kernel maps to physical pages via page tables stored in the MMU.

When a virtual address is not present in the page table, a page‑fault occurs; the kernel allocates a physical page, updates the page table, and resumes the process.

Linux uses multi‑level page tables and HugePages to reduce page‑table overhead.

Virtual Memory Layout

Read‑only segment : code and constants.

Data segment : global variables.

Heap : dynamically allocated memory, grows upward.

Memory‑mapped segment : shared libraries, mmap’ed files, grows downward.

Stack : local variables and call frames, typically 8 MiB.

Allocation & Reclamation

Allocation

brk()

for small allocations (<128 KiB) – moves the heap top.

mmap()

for large allocations (>128 KiB) – reserves address space in the mmap region.

Both allocate virtual memory; physical pages are only committed on first access (minor page fault).

Reclamation

Cache reclamation via LRU.

Swap out rarely used pages.

OOM killer terminates memory‑hogging processes (adjustable via

/proc/&lt;pid&gt;/oom_adj

).

Example to lower OOM score:

<code>echo -16 > /proc/$(pidof myapp)/oom_adj</code>

Viewing Memory Usage

free

– overall system memory.

top

/

ps

– per‑process memory (VIRT, RES, SHR, %MEM).

Buffers vs. Cache

Buffers cache raw disk blocks; cache stores file data. Both accelerate I/O but consume RAM that can be reclaimed when needed.

Cache Hit Rate

Higher cache hit rates mean more requests are served from RAM, improving performance. Tools such as

cachestat

,

cachetop

, and

pcstat

can measure hit rates.

Direct I/O (O_DIRECT)

When a program opens a file with

O_DIRECT

, the kernel bypasses the page cache, leading to slower reads if the data is not already in memory.

<code>strace -p $(pgrep app)</code>

Memory Leaks

Leaks occur when allocated heap memory is never freed, or when out‑of‑bounds accesses cause crashes. Use BCC’s

memleak

tool to trace allocations and identify leaking call stacks.

<code>/usr/share/bcc/tools/memleak -a -p $(pidof app)</code>

Swap

When RAM is scarce, Linux swaps out anonymous pages to disk. Swap activity can be tuned via

/proc/sys/vm/swappiness

(0‑100). NUMA architectures may cause swap usage even when local memory appears sufficient.

Check swap usage with

free

and monitor with

sar -r -S

or

vmstat

.

Quick Memory Bottleneck Analysis

Start with

free

and

top

for a high‑level view.

Use

vmstat

and

pidstat

over time to spot trends.

Drill down with allocation analysis, cache inspection, and per‑process diagnostics.

Common recommendations: avoid swap when possible, lower

swappiness

, use memory pools or HugePages, leverage caches, apply cgroups limits, and adjust OOM scores for critical services.

Source: https://www.ctq6.cn/linux%E6%80%A7%E8%83%BD%E4%BC%98%E5%8C%96/

Memory ManagementCPU optimizationsystem monitoringperformance toolsLinux performance
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.