Cache, Prefetching, False Sharing, Pipeline, and Data Dependency: Benchmarks and Optimizations in Rust
The article shows how row‑major vs column‑major traversal, random access, cache‑set conflicts, false sharing, branch‑prediction failures, and loop‑carried data dependencies each degrade performance by tying Rust (and C++) code patterns to CPU cache behavior, prefetching, pipeline stalls, and vectorization limits, demonstrated through runnable benchmarks.
This article explains why certain code patterns are slower by linking them to hardware concepts such as CPU caches, cache lines, prefetchers, cache associativity, false sharing, pipeline stalls, and data dependencies. It uses a series of runnable benchmarks written in Rust (and some C++) to illustrate each point and to build a simple mental model of how hardware affects performance.
Cache basics : Two functions, row_major_traversal and column_major_traversal , iterate over a square 2‑D vector either row‑wise or column‑wise. Benchmarks show that row‑wise traversal is about ten times faster because the CPU reads a whole cache line (typically 64 bytes) and the next elements of the same row are already in cache, while column‑wise accesses cause frequent cache misses.
Prefetcher : When the access pattern is random, the hardware prefetcher cannot predict which data will be needed, leading to many cache misses. A benchmark comparing random_access with column_major confirms that random access is roughly twice as slow. The article cites Wikipedia’s distinction between hardware‑based and software‑based prefetching.
Cache associativity : Iterating an array with step sizes that are powers of two can cause cache set conflicts. A benchmark that varies the step size shows performance spikes when the step aligns with the cache’s set‑associative mapping. The article explains the three common mapping schemes—fully associative, direct‑mapped, and n‑way set‑associative—and gives concrete numbers for an AMD Ryzen 7 4700U (8‑way L1 cache, 64‑byte lines, 2048 lines total).
False sharing : Four threads increment separate atomic variables ( v ) in share , but because the variables reside on the same cache line they still contend, producing a false‑sharing penalty. Aligning each variable to a cache‑line boundary with #[repr(align(64))] (or the language‑specific equivalent) eliminates the contention. Benchmarks show that the aligned version ( non_share ) is almost twice as fast as the unaligned version.
Pipeline and branch prediction : The article demonstrates that iterating over a vector of trait objects in a sorted order is about 2.67× faster than iterating over a shuffled (unsorted) order. The slowdown is attributed to frequent branch‑prediction failures, which flush the CPU pipeline and force re‑fetching of instructions.
Data dependency : Two functions, dependent and independent , perform similar arithmetic on three vectors but differ in the order of operations. The dependent version creates loop‑carried dependencies, preventing the compiler from vectorizing the loop, while the independent version can be auto‑vectorized. Benchmarks show a three‑fold speed difference, and the article includes the generated assembly to illustrate the vectorized vs. scalar code paths.
Throughout the article, the author provides complete source listings for each benchmark (Rust code for traversal, random access, false sharing, pipeline, and dependency tests, as well as equivalent C++ snippets). The code is wrapped in ... tags to preserve formatting.
Tencent Technical Engineering
Official account of Tencent Technology. A platform for publishing and analyzing Tencent's technological innovations and cutting-edge developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.