Cloud Native 15 min read

How ByteDance Optimized Its Metrics Agent for 70% CPU Savings

This article details how ByteDance's cloud‑native observability team tackled performance bottlenecks in their metricserver2 Agent—reducing memory copies, merging tiny packets, applying SIMD for tag parsing, and switching compression libraries—to cut CPU usage by over 10% and memory usage by nearly 20% while handling petabyte‑scale metric data.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
How ByteDance Optimized Its Metrics Agent for 70% CPU Savings

Observability applications enable enterprises to turn data characteristics into competitive advantage; in ByteDance, metricserver2 (referred to as Agent) works with the time‑series database ByteTSD to collect user metric data at physical‑machine granularity.

Rapid business growth created two major technical challenges for the Agent: it is deployed on more than one million service nodes, and it must parse, aggregate, compress, and transmit metric data, making it a CPU‑ and memory‑intensive service that accounts for over 70% of the monitoring cost.

Agent Basic Architecture

The Agent consists of the following components (see the architecture diagram):

Receiver – listens on UDP sockets and receives metric data from SDKs.

Msg‑Parser – deserializes packets, discards malformed data, and stores points in Storage.

Storage – supports seven metric types.

Flusher – snapshots Storage each send interval, aggregates metrics, and encodes them.

Compress – compresses the encoded packets.

Sender – transmits data via HTTP or TCP.

Data Reception

Agent uses MsgPack for serialization. By switching from string copies to

std::string_view

during deserialization, the two copy steps (MsgPack object → vector) are eliminated, keeping data in the original buffer and reducing memory traffic.

<code>static inline bool tryMerge(std::string&amp; merge_buf, std::string&amp; recv_buf, int msg_size, int merge_buf_cap) {
    uint16_t big_endian_len, host_endian_len, cur_msg_len;
    memcpy(&amp;big_endian_len, (void*)&amp;merge_buf[1], sizeof(big_endian_len));
    host_endian_len = ntohs(big_endian_len);
    cur_msg_len = recv_buf[0] &amp; 0x0f;
    if ((recv_buf[0] &amp; 0xf0) != 0x90 || merge_buf.size() + msg_size > merge_buf_cap || host_endian_len + cur_msg_len > 0xffff) {
        return false;
    }
    host_endian_len += cur_msg_len;
    merge_buf.append(++recv_buf.begin(), recv_buf.begin() + msg_size);
    big_endian_len = htons(host_endian_len);
    memcpy((void*)&amp;merge_buf[1], &amp;big_endian_len, sizeof(big_endian_len));
    return true;
}</code>

Merging many tiny packets into a larger one reduces the number of tasks submitted to the asynchronous thread pool, cutting context‑switch overhead.

Data Processing

Tag parsing was a hotspot because each metric’s tags are concatenated into a string and then split by ‘|’ and ‘=’. A SIMD‑based parser dramatically speeds up key/value detection.

<code>#if defined(__SSE__)
static size_t find_key_simd(const char *str, size_t end, size_t idx) {
    if (idx >= end) return 0;
    for (; idx + 16 <= end; idx += 16) {
        __m128i v = _mm_loadu_si128((const __m128i*)(str + idx));
        __m128i is_tag = _mm_or_si128(_mm_cmpeq_epi8(v, _mm_set1_epi8('|')),
                                     _mm_cmpeq_epi8(v, _mm_set1_epi8(' ')));
        __m128i is_kv = _mm_cmpeq_epi8(v, _mm_set1_epi8('='));
        int tag_bits = _mm_movemask_epi8(is_tag);
        int kv_bits = _mm_movemask_epi8(is_kv);
        bool has_tag_first = ((kv_bits - 1) & tag_bits) != 0;
        if (has_tag_first) return 0;
        if (kv_bits) return idx + __builtin_ctz(kv_bits);
    }
    for (; idx < end; ++idx) {
        if (str[idx] == '=') return idx;
        if (str[idx] == '|' || str[idx] == ' ') return 0;
    }
    return 0;
}
#endif</code>

To avoid repeated tag comparisons during map look‑ups, the team replaced the original

TagSet

(a vector of

std::string

) with a

TagViewSet

that stores all tag data in a single contiguous buffer and compares buffers directly.

<code>struct TagViewSet {
    std::vector<TagView> tags;
    std::string tags_buffer;
    bool operator==(const TagViewSet &amp;other) const {
        if (tags.size() != other.tags.size()) return false;
        return tags_buffer == other.tags_buffer;
    }
};
</code>

Data Sending

Compression became a bottleneck as metric volume grew. The team benchmarked several algorithms (zlib‑cloudflare, zlib‑1.2.11, zstd, snappy, bzip2) and found that cloudflare‑optimized zlib reduced CPU cost to 37.5% of the official branch, while zstd offered better compression with lower CPU, and snappy delivered the lowest CPU when compression ratio is less critical.

Short‑term the team switched to cloudflare‑zlib; long‑term they plan to adopt zstd.

Conclusion

After deploying the optimizations, CPU peak usage dropped by 10.26% (average down 6.27% ) and memory peak usage fell by 19.67% (average down 19.81% ). Ongoing work includes profile‑guided optimization (PGO) and clang thinLTO to squeeze further performance.

performance optimizationObservabilityC++SIMDcompressionmsgpack
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.