Backend Development 16 min read

Performance Optimization Techniques for High‑Throughput Backend Systems

The article outlines seven practical performance‑optimization techniques for high‑throughput back‑ends—including replacing protobuf with native C++ classes, adopting cache‑friendly data structures, using jemalloc/tcmalloc, implementing lock‑free double buffers, simplifying structs for specific scenarios, and leveraging profiling tools—while stressing balanced, incremental improvements.

Tencent Cloud Developer

Apr 28, 2025

Performance Optimization Techniques for High‑Throughput Backend Systems

Performance optimization is an essential means to reduce cost and increase efficiency in backend services. Proper timing and reasonable methods can improve system performance while also cleaning up legacy code. In technical interviews, performance‑related questions are almost always asked.

The article presents a practical guide consisting of seven parts:

01. Replace Protobuf with C++ Class

Protobuf uses an arena allocator and creates many small objects, leading to memory fragmentation and slower destructors. By defining a native C++ class (e.g., class ParamHitInfo) and providing explicit getters, setters, and clear methods, the author achieved up to three‑fold speed improvement. Sample Protobuf definition and the equivalent C++ class are shown, followed by a benchmark that compares copy‑construction costs of ParamHit (protobuf) and ParamHitInfo (class).

02. Cache‑Friendly Data Structures

The author questions the common belief that hash tables are always faster than arrays. Because arrays have better cache locality, they can outperform hash tables in many scenarios. An original hash‑map based HitContext implementation is compared with a cache‑friendly version that stores key‑value pairs in a vector and uses a custom lookup. Benchmarks demonstrate significant latency reduction when using the cache‑friendly version.

03. Use jemalloc/tcmalloc Instead of Default malloc

Standard C++ STL allocators may cause memory fragmentation, poor cache friendliness, and global‑lock contention. Adding jemalloc (or tcmalloc) as a dependency in the build target improves performance by about 20% with minimal development effort. The article includes a cc_library snippet showing how to link jemalloc.

04. Lock‑Free Data Structures

For extremely high request rates (e.g., 2.6 billion API calls per second), a double‑buffer lock‑free design is used. The structure expt_api_new_shm holds two buffers; one is used for reading while the other is written to, and pointer switches are performed atomically. Functions such as SwitchNewShmMemToWrite, SwitchNewShmMemToWriteDone, and SwitchNewShmMemToRead illustrate the workflow. This design eliminates lock contention at the cost of doubled memory usage.

05. Scenario‑Specific Handling

In a “dye” scenario where only experiment ID, group ID, and bucket information are needed, the original expt_param_item struct (with many unused fields) is replaced by a lightweight DyeHitInfo struct. The simplified format reduces memory footprint and improves processing speed, as shown by benchmark graphs.

06. Use Performance Testing Tools

The author recommends tools such as Linux perf, gprof, Valgrind, strace, Godbolt for assembly inspection, and the FlameGraph repository for visualizing hotspots. These tools help identify bottlenecks before and after optimizations.

07. Summary

Additional optimization ideas include choosing proper algorithms, avoiding large object copies, separating I/O from computation, and careful branch prediction. Over‑optimization should be avoided; maintainability must be balanced against performance gains. Continuous monitoring and incremental improvements are key to long‑term success.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Backend Development C Protobuf Cache Friendly jemalloc lock‑free

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.