How ByteDance Optimized Its Metrics Agent for 70% CPU Savings
This article details how ByteDance's cloud‑native observability team tackled performance bottlenecks in their metricserver2 Agent—reducing memory copies, merging tiny packets, applying SIMD for tag parsing, and switching compression libraries—to cut CPU usage by over 10% and memory usage by nearly 20% while handling petabyte‑scale metric data.
Observability applications enable enterprises to turn data characteristics into competitive advantage; in ByteDance, metricserver2 (referred to as Agent) works with the time‑series database ByteTSD to collect user metric data at physical‑machine granularity.
Rapid business growth created two major technical challenges for the Agent: it is deployed on more than one million service nodes, and it must parse, aggregate, compress, and transmit metric data, making it a CPU‑ and memory‑intensive service that accounts for over 70% of the monitoring cost.
Agent Basic Architecture
The Agent consists of the following components (see the architecture diagram):
Receiver – listens on UDP sockets and receives metric data from SDKs.
Msg‑Parser – deserializes packets, discards malformed data, and stores points in Storage.
Storage – supports seven metric types.
Flusher – snapshots Storage each send interval, aggregates metrics, and encodes them.
Compress – compresses the encoded packets.
Sender – transmits data via HTTP or TCP.
Data Reception
Agent uses MsgPack for serialization. By switching from string copies to
std::string_viewduring deserialization, the two copy steps (MsgPack object → vector) are eliminated, keeping data in the original buffer and reducing memory traffic.
<code>static inline bool tryMerge(std::string& merge_buf, std::string& recv_buf, int msg_size, int merge_buf_cap) {
uint16_t big_endian_len, host_endian_len, cur_msg_len;
memcpy(&big_endian_len, (void*)&merge_buf[1], sizeof(big_endian_len));
host_endian_len = ntohs(big_endian_len);
cur_msg_len = recv_buf[0] & 0x0f;
if ((recv_buf[0] & 0xf0) != 0x90 || merge_buf.size() + msg_size > merge_buf_cap || host_endian_len + cur_msg_len > 0xffff) {
return false;
}
host_endian_len += cur_msg_len;
merge_buf.append(++recv_buf.begin(), recv_buf.begin() + msg_size);
big_endian_len = htons(host_endian_len);
memcpy((void*)&merge_buf[1], &big_endian_len, sizeof(big_endian_len));
return true;
}</code>Merging many tiny packets into a larger one reduces the number of tasks submitted to the asynchronous thread pool, cutting context‑switch overhead.
Data Processing
Tag parsing was a hotspot because each metric’s tags are concatenated into a string and then split by ‘|’ and ‘=’. A SIMD‑based parser dramatically speeds up key/value detection.
<code>#if defined(__SSE__)
static size_t find_key_simd(const char *str, size_t end, size_t idx) {
if (idx >= end) return 0;
for (; idx + 16 <= end; idx += 16) {
__m128i v = _mm_loadu_si128((const __m128i*)(str + idx));
__m128i is_tag = _mm_or_si128(_mm_cmpeq_epi8(v, _mm_set1_epi8('|')),
_mm_cmpeq_epi8(v, _mm_set1_epi8(' ')));
__m128i is_kv = _mm_cmpeq_epi8(v, _mm_set1_epi8('='));
int tag_bits = _mm_movemask_epi8(is_tag);
int kv_bits = _mm_movemask_epi8(is_kv);
bool has_tag_first = ((kv_bits - 1) & tag_bits) != 0;
if (has_tag_first) return 0;
if (kv_bits) return idx + __builtin_ctz(kv_bits);
}
for (; idx < end; ++idx) {
if (str[idx] == '=') return idx;
if (str[idx] == '|' || str[idx] == ' ') return 0;
}
return 0;
}
#endif</code>To avoid repeated tag comparisons during map look‑ups, the team replaced the original
TagSet(a vector of
std::string) with a
TagViewSetthat stores all tag data in a single contiguous buffer and compares buffers directly.
<code>struct TagViewSet {
std::vector<TagView> tags;
std::string tags_buffer;
bool operator==(const TagViewSet &other) const {
if (tags.size() != other.tags.size()) return false;
return tags_buffer == other.tags_buffer;
}
};
</code>Data Sending
Compression became a bottleneck as metric volume grew. The team benchmarked several algorithms (zlib‑cloudflare, zlib‑1.2.11, zstd, snappy, bzip2) and found that cloudflare‑optimized zlib reduced CPU cost to 37.5% of the official branch, while zstd offered better compression with lower CPU, and snappy delivered the lowest CPU when compression ratio is less critical.
Short‑term the team switched to cloudflare‑zlib; long‑term they plan to adopt zstd.
Conclusion
After deploying the optimizations, CPU peak usage dropped by 10.26% (average down 6.27% ) and memory peak usage fell by 19.67% (average down 19.81% ). Ongoing work includes profile‑guided optimization (PGO) and clang thinLTO to squeeze further performance.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.