Performance Optimization Strategies for High‑Throughput Backend Services
The article outlines practical, continuous performance‑optimization tactics for high‑throughput back‑end services—replacing Protobuf with lightweight C++ classes, using cache‑friendly data structures, adopting jemalloc/tcmalloc, employing lock‑free double buffers, tailoring data formats, and leveraging profiling tools—to achieve multi‑fold speedups while balancing maintainability.
Performance optimization is an essential means to reduce cost and increase efficiency. Applying the right techniques at the right time can improve system performance and also provide an opportunity to clean up legacy code. In interviews, performance‑related questions are almost always asked.
Optimization should be continuous; premature or excessive optimization and the ROI of optimization must be evaluated. This article summarizes common performance problems and proposes several optimization solutions, ranging from a 3× speedup to handling 26 billion API calls per second. Three useful performance testing tools are also recommended.
1. Replace Protobuf with C++ Classes
When an API is called frequently, using Protobuf (which relies on an arena allocator) can be slower than plain C++ classes because of frequent small allocations, memory fragmentation, and complex destructors. The article provides a Protobuf definition and a corresponding C++ class implementation, followed by a GoogleTest benchmark that shows the class‑based version is about three times faster.
message Param {
optional string name = 1;
optional string value = 2;
}
message ParamHit {
enum Type { Unknown = 0; WhiteList = 1; LaunchLayer = 2; BaseAB = 3; DefaultParam = 4; }
optional Param param = 1;
optional uint64 group_id = 2;
optional uint64 expt_id = 3;
optional uint64 launch_layer_id = 4;
optional string hash_key_used = 5;
optional string hash_key_val_used = 6;
optional Type type = 7;
optional bool is_hit_mbox = 8;
} class ParamHitInfo {
public:
class Param {
public:
Param() = default;
~Param() = default;
const std::string& name() const { return name_; }
void set_name(const std::string& name) { name_ = name; }
void clear_name() { name_.clear(); }
const std::string& value() const { return value_; }
void set_value(const std::string& value) { value_ = value; }
void clear_value() { value_.clear(); }
void Clear() { clear_name(); clear_value(); }
private:
std::string name_, value_;
};
// ... other members and methods ...
};The benchmark creates 1 000 hits, copies them repeatedly, and measures the time. Results show a clear advantage for the class‑based representation.
2. Cache‑Friendly Data Structures
Hash table look‑ups are not always faster than array scans because arrays have better cache locality. The article presents a naive unordered_map implementation and an optimized version that stores SNS hash keys in a vector of pair<uint32_t, uint32_t> for fast sequential access. Micro‑benchmarks demonstrate the vector‑based version is significantly faster.
class HitContext {
public:
inline void update_hash_key(const std::string& key, const std::string& val) {
hash_keys_[key] = val;
}
const std::string* search_hash_key(const std::string& key) const {
auto it = hash_keys_.find(key);
return it != hash_keys_.end() ? &it->second : nullptr;
}
private:
std::unordered_map
hash_keys_;
}; class HitContext {
public:
inline void update_hash_key(const std::string& key, const std::string& val) {
if (Misc::IsSnsHashKey(key)) {
auto sns_id = Misc::FastAtoi(key.c_str() + Misc::SnsHashKeyPrefix().size());
sns_hash_keys_.emplace_back(sns_id, Misc::LittleEndianBytesToUInt32(val));
return;
}
hash_keys_[key] = val;
}
const std::string search_hash_key(const std::string& key, bool& find) const {
if (Misc::IsSnsHashKey(key)) {
auto sns_id = Misc::FastAtoi(key.c_str() + Misc::SnsHashKeyPrefix().size());
auto it = std::find_if(sns_hash_keys_.rbegin(), sns_hash_keys_.rend(),
[sns_id](const auto& v){ return v.first == sns_id; });
find = it != sns_hash_keys_.rend();
return find ? Misc::UInt32ToLittleEndianBytes(it->second) : "";
}
auto it = hash_keys_.find(key);
find = it != hash_keys_.end();
return find ? it->second : "";
}
private:
std::unordered_map
hash_keys_;
std::vector
> sns_hash_keys_;
};3. Use jemalloc/tcmalloc Instead of Default malloc
Standard STL allocators can cause memory fragmentation, poor cache behavior, and global locks. Adding jemalloc as a dependency (e.g., via a Bazel cc_library rule) improves allocation speed and reduces fragmentation. Benchmarks in a real load‑business scenario show a performance increase of over 20% with minimal code changes.
cc_library(
name = "mmexpt_dye_api",
srcs = ["mmexpt_dye_api.cc"],
hdrs = ["mmexpt_dye_api.h"],
deps = ["//mm3rd/jemalloc:jemalloc"],
copts = ["-O3", "-std=c++11"],
visibility = ["//visibility:public"],
)4. Lock‑Free Data Structures (Double Buffer)
For extremely high‑throughput scenarios (e.g., 26 billion API calls per second), a double‑buffer design separates read and write paths, eliminating locks. The article defines a shared memory structure with a switch flag and shows functions to initialize, reset, and switch buffers safely.
struct expt_api_new_shm {
void* p_shm_data;
volatile int* p_mem_switch; // 0: uninit, 1: mem1, 2: mem2
uint32_t* p_crc_sum;
expt_new_context* p_new_context;
parameter2business* p_param2business;
char* p_business_cache;
HashTableWithCache hash_table;
};
int InitExptNewShmData(expt_api_new_shm* pstShmData, void* pData) { /* ... */ }
void SwitchNewShmMemToWrite(expt_api_new_shm* pstShmData) { /* ... */ }
void SwitchNewShmMemToWriteDone(expt_api_new_shm* pstShmData) { /* ... */ }
void SwitchNewShmMemToRead(expt_api_new_shm* pstShmData) { /* ... */ }The double‑buffer approach provides lock‑free reads while writes occur on the alternate buffer, at the cost of doubled memory usage.
5. Tailor Solutions for Specific Scenarios
Sometimes the data format itself is a bottleneck. In a “dye” scenario, the original expt_param_item struct contained many fields unnecessary for the task. By redesigning the struct to keep only expt_id , group_id , and bucket_src , the memory footprint shrank and processing speed increased dramatically, as shown by benchmark graphs.
struct expt_param_item { /* many fields ... */ };
struct DyeHitInfo {
int expt_id, group_id;
uint64_t bucket_src;
DyeHitInfo() {}
DyeHitInfo(int e, int g, uint64_t b) : expt_id(e), group_id(g), bucket_src(b) {}
bool operator<(const DyeHitInfo& hit) const { /* compare logic */ }
bool operator==(const DyeHitInfo& hit) const { return expt_id==hit.expt_id && group_id==hit.group_id && bucket_src==hit.bucket_src; }
std::string ToString() const { char buf[1024]; sprintf(buf, "expt_id: %u, group_id: %u, bucket_src: %lu", expt_id, group_id, bucket_src); return std::string(buf); }
};6. Effective Performance Testing Tools
Common Linux tools such as perf , gprof , Valgrind, and strace help locate bottlenecks. The article also recommends online resources like Compiler Explorer for inspecting generated assembly, and the FlameGraph project for visualizing stack traces.
7. Summary
Beyond the techniques discussed, further gains can be achieved by choosing appropriate algorithms, avoiding unnecessary copies, separating I/O from computation, and leveraging branch prediction. Optimization is an ongoing process; avoid over‑optimizing at the expense of code maintainability, and always weigh the performance benefit against the effort required. Continuous monitoring and incremental improvements are key to sustaining high‑performance backend services.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.