Backend Development 15 min read

Performance Optimization Techniques for Backend Systems: Replacing Protobuf with C++ Classes, Cache‑Friendly Structures, jemalloc, and Lock‑Free Data Structures

The article presents practical backend performance optimization methods—including substituting Protobuf with native C++ classes, employing cache‑friendly data structures, integrating jemalloc/tcmalloc, using lock‑free double‑buffer designs, and tailoring data formats—to achieve up to three‑fold speed improvements and significant latency reductions.

FunTester
FunTester
FunTester
Performance Optimization Techniques for Backend Systems: Replacing Protobuf with C++ Classes, Cache‑Friendly Structures, jemalloc, and Lock‑Free Data Structures

Performance optimization is essential for reducing costs and improving system efficiency, especially in high‑traffic backend services where interview questions often focus on such techniques.

1. Replace Protobuf with C++ classes – Protobuf’s arena allocator can cause memory fragmentation and slower destructors. Native C++ classes avoid these issues. The article provides a Protobuf definition and an equivalent C++ class implementation, then benchmarks copying 1,000 objects, showing a 3× speedup for class‑based structures.

message Param { optional string name = 1; optional string value = 2; }
class ParamHitInfo { public: ParamHitInfo(); ~ParamHitInfo() = default; /* getters/setters */ };

2. Cache‑friendly data structures – Hash tables are not always faster than arrays due to cache locality. The article demonstrates using std::unordered_map versus a vector of std::pair , measuring lookup times and showing that array‑based access can outperform hash‑based lookup in certain scenarios.

class HitContext { inline void update_hash_key(const std::string& key, const std::string& val) { hash_keys_[key] = val; } };

3. Use jemalloc/tcmalloc – Replacing the default allocator with jemalloc reduces memory fragmentation, improves cache friendliness, and eliminates global locks. Adding a dependency in the build system (e.g., Bazel) is sufficient. Benchmarks indicate a performance gain of over 20% .

cc_library(
    name = "mmexpt_dye_api",
    srcs = ["mmexpt_dye_api.cc"],
    deps = ["//mm3rd/jemalloc:jemalloc"],
    copts = ["-O3", "-std=c++11"],
)

4. Lock‑free double‑buffer design – For extremely high QPS (e.g., 2.6 billion calls/s), a double‑buffer structure separates read and write buffers, eliminating locks. The article outlines the struct definition, switch functions, and discusses memory trade‑offs.

struct expt_api_new_shm { volatile int* p_mem_switch; /* 0: uninit, 1: mem1, 2: mem2 */ };
void SwitchNewShmMemToWrite(expt_api_new_shm* pstShmData) { /* select write buffer */ }
void SwitchNewShmMemToRead(expt_api_new_shm* pstShmData) { /* select read buffer */ }

5. Tailor data formats for specific scenarios – By stripping unnecessary fields from experiment parameter structs, the article reduces payload size and improves processing speed. A simplified DyeHitInfo struct keeps only expt_id , group_id , and bucket_src , yielding measurable performance gains.

struct DyeHitInfo { int expt_id, group_id; uint64_t bucket_src; };

6. Performance testing tools – Recommended tools include Linux perf , gprof , Valgrind, strace , Godbolt for assembly inspection, and FlameGraph for visualizing profiling data.

In summary, continuous monitoring and incremental optimization—while avoiding over‑optimization that harms maintainability—are key to sustaining high‑performance backend systems.

Performance OptimizationBackend DevelopmentC++Protobufcache-friendlyjemalloclock-free
FunTester
Written by

FunTester

10k followers, 1k articles | completely useless

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.