Backend Development 18 min read

Performance Optimization Strategies for High‑Throughput Backend Services

The article outlines practical, continuous performance‑optimization tactics for high‑throughput back‑end services—replacing Protobuf with lightweight C++ classes, using cache‑friendly data structures, adopting jemalloc/tcmalloc, employing lock‑free double buffers, tailoring data formats, and leveraging profiling tools—to achieve multi‑fold speedups while balancing maintainability.

Tencent Cloud Developer

Feb 29, 2024

Performance Optimization Strategies for High‑Throughput Backend Services

Performance optimization is an essential means to reduce cost and increase efficiency. Applying the right techniques at the right time can improve system performance and also provide an opportunity to clean up legacy code. In interviews, performance‑related questions are almost always asked.

Optimization should be continuous; premature or excessive optimization and the ROI of optimization must be evaluated. This article summarizes common performance problems and proposes several optimization solutions, ranging from a 3× speedup to handling 26 billion API calls per second. Three useful performance testing tools are also recommended.

1. Replace Protobuf with C++ Classes

When an API is called frequently, using Protobuf (which relies on an arena allocator) can be slower than plain C++ classes because of frequent small allocations, memory fragmentation, and complex destructors. The article provides a Protobuf definition and a corresponding C++ class implementation, followed by a GoogleTest benchmark that shows the class‑based version is about three times faster.

message Param {
  optional string name = 1;
  optional string value = 2;
}

message ParamHit {
  enum Type { Unknown = 0; WhiteList = 1; LaunchLayer = 2; BaseAB = 3; DefaultParam = 4; }
  optional Param param = 1;
  optional uint64 group_id = 2;
  optional uint64 expt_id = 3;
  optional uint64 launch_layer_id = 4;
  optional string hash_key_used = 5;
  optional string hash_key_val_used = 6;
  optional Type type = 7;
  optional bool is_hit_mbox = 8;
}

class ParamHitInfo {
 public:
  class Param {
   public:
    Param() = default;
    ~Param() = default;
    const std::string& name() const { return name_; }
    void set_name(const std::string& name) { name_ = name; }
    void clear_name() { name_.clear(); }
    const std::string& value() const { return value_; }
    void set_value(const std::string& value) { value_ = value; }
    void clear_value() { value_.clear(); }
    void Clear() { clear_name(); clear_value(); }
   private:
    std::string name_, value_;
  };
  // ... other members and methods ...
};

The benchmark creates 1 000 hits, copies them repeatedly, and measures the time. Results show a clear advantage for the class‑based representation.

2. Cache‑Friendly Data Structures

Hash table look‑ups are not always faster than array scans because arrays have better cache locality. The article presents a naive unordered_map<std::string, std::string> implementation and an optimized version that stores SNS hash keys in a vector of pair<uint32_t, uint32_t> for fast sequential access. Micro‑benchmarks demonstrate the vector‑based version is significantly faster.

class HitContext {
 public:
  inline void update_hash_key(const std::string& key, const std::string& val) {
    hash_keys_[key] = val;
  }
  const std::string* search_hash_key(const std::string& key) const {
    auto it = hash_keys_.find(key);
    return it != hash_keys_.end() ? &it->second : nullptr;
  }
 private:
  std::unordered_map<std::string, std::string> hash_keys_;
};

class HitContext {
 public:
  inline void update_hash_key(const std::string& key, const std::string& val) {
    if (Misc::IsSnsHashKey(key)) {
      auto sns_id = Misc::FastAtoi(key.c_str() + Misc::SnsHashKeyPrefix().size());
      sns_hash_keys_.emplace_back(sns_id, Misc::LittleEndianBytesToUInt32(val));
      return;
    }
    hash_keys_[key] = val;
  }
  const std::string search_hash_key(const std::string& key, bool& find) const {
    if (Misc::IsSnsHashKey(key)) {
      auto sns_id = Misc::FastAtoi(key.c_str() + Misc::SnsHashKeyPrefix().size());
      auto it = std::find_if(sns_hash_keys_.rbegin(), sns_hash_keys_.rend(),
                             [sns_id](const auto& v){ return v.first == sns_id; });
      find = it != sns_hash_keys_.rend();
      return find ? Misc::UInt32ToLittleEndianBytes(it->second) : "";
    }
    auto it = hash_keys_.find(key);
    find = it != hash_keys_.end();
    return find ? it->second : "";
  }
 private:
  std::unordered_map<std::string, std::string> hash_keys_;
  std::vector<std::pair<uint32_t, uint32_t>> sns_hash_keys_;
};

3. Use jemalloc/tcmalloc Instead of Default malloc

Standard STL allocators can cause memory fragmentation, poor cache behavior, and global locks. Adding jemalloc as a dependency (e.g., via a Bazel cc_library rule) improves allocation speed and reduces fragmentation. Benchmarks in a real load‑business scenario show a performance increase of over 20% with minimal code changes.

cc_library(
    name = "mmexpt_dye_api",
    srcs = ["mmexpt_dye_api.cc"],
    hdrs = ["mmexpt_dye_api.h"],
    deps = ["//mm3rd/jemalloc:jemalloc"],
    copts = ["-O3", "-std=c++11"],
    visibility = ["//visibility:public"],
)

4. Lock‑Free Data Structures (Double Buffer)

For extremely high‑throughput scenarios (e.g., 26 billion API calls per second), a double‑buffer design separates read and write paths, eliminating locks. The article defines a shared memory structure with a switch flag and shows functions to initialize, reset, and switch buffers safely.

struct expt_api_new_shm {
  void* p_shm_data;
  volatile int* p_mem_switch; // 0: uninit, 1: mem1, 2: mem2
  uint32_t* p_crc_sum;
  expt_new_context* p_new_context;
  parameter2business* p_param2business;
  char* p_business_cache;
  HashTableWithCache hash_table;
};

int InitExptNewShmData(expt_api_new_shm* pstShmData, void* pData) { /* ... */ }
void SwitchNewShmMemToWrite(expt_api_new_shm* pstShmData) { /* ... */ }
void SwitchNewShmMemToWriteDone(expt_api_new_shm* pstShmData) { /* ... */ }
void SwitchNewShmMemToRead(expt_api_new_shm* pstShmData) { /* ... */ }

The double‑buffer approach provides lock‑free reads while writes occur on the alternate buffer, at the cost of doubled memory usage.

5. Tailor Solutions for Specific Scenarios

Sometimes the data format itself is a bottleneck. In a “dye” scenario, the original expt_param_item struct contained many fields unnecessary for the task. By redesigning the struct to keep only expt_id, group_id, and bucket_src, the memory footprint shrank and processing speed increased dramatically, as shown by benchmark graphs.

struct expt_param_item { /* many fields ... */ };

struct DyeHitInfo {
  int expt_id, group_id;
  uint64_t bucket_src;
  DyeHitInfo() {}
  DyeHitInfo(int e, int g, uint64_t b) : expt_id(e), group_id(g), bucket_src(b) {}
  bool operator<(const DyeHitInfo& hit) const { /* compare logic */ }
  bool operator==(const DyeHitInfo& hit) const { return expt_id==hit.expt_id && group_id==hit.group_id && bucket_src==hit.bucket_src; }
  std::string ToString() const { char buf[1024]; sprintf(buf, "expt_id: %u, group_id: %u, bucket_src: %lu", expt_id, group_id, bucket_src); return std::string(buf); }
};

6. Effective Performance Testing Tools

Common Linux tools such as perf, gprof, Valgrind, and strace help locate bottlenecks. The article also recommends online resources like Compiler Explorer for inspecting generated assembly, and the FlameGraph project for visualizing stack traces.

7. Summary

Beyond the techniques discussed, further gains can be achieved by choosing appropriate algorithms, avoiding unnecessary copies, separating I/O from computation, and leveraging branch prediction. Optimization is an ongoing process; avoid over‑optimizing at the expense of code maintainability, and always weigh the performance benefit against the effort required. Continuous monitoring and incremental improvements are key to sustaining high‑performance backend services.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

backend Performance Optimization Protobuf Cache Friendly jemalloc lock‑free C++

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.