Cloud Native 37 min read

Applying eBPF for Cloud‑Native Observability and Continuous Profiling

By deploying eBPF agents as DaemonSets that hook kernel network and performance events, the Xiaohongshu observability team extended cloud‑native monitoring from the application to the kernel, delivering real‑time traffic analysis and low‑overhead continuous profiling for C++ services, aggregating data into centralized collectors for dashboards, flame‑graphs, and rapid root‑cause diagnosis.

Xiaohongshu Tech REDtech

Sep 9, 2024

Applying eBPF for Cloud‑Native Observability and Continuous Profiling

In the current cloud‑native era, the widespread adoption of micro‑service architectures has sparked extensive discussion on observability. Building observability capabilities helps track, understand, and diagnose production issues, supporting risk tracing, experience accumulation, and fault warning, thereby improving system reliability. Beyond traditional Metrics, Logging, and Tracing, new requirements arise for real‑time traffic analysis and low‑overhead profiling.

The Xiaohongshu observability team explored eBPF technology for these challenges. By leveraging eBPF, they extended observability from the application layer to the kernel, enabling generic traffic analysis and continuous profiling without modifying application code.

Typical production problems include sudden traffic spikes that cause CPU and memory exhaustion, often without clear knowledge of the upstream callers. Traditional observability lacks a universal method for real‑time traffic analysis, and Linux perf‑based profiling suffers from high overhead and long processing times, especially for C++ services.

eBPF, which runs sandboxed programs at kernel hook points, can monitor network packets, performance metrics, and security events. The team deployed eBPF agents as DaemonSets, loading eBPF programs that hook TCP send/receive syscalls (e.g., tcp_sendmsg, tcp_cleanup_rbuf) and other relevant kernel events. The agents collect raw traffic data, aggregate it into metrics, and forward it to a centralized eBPF‑Collector.

In the profiling scenario, eBPF agents capture CPU‑cycle performance events, perform stack unwinding directly in the kernel (using frame‑pointer or DWARF .eh_frame information), aggregate stack samples in eBPF maps, and expose them to user‑space. The user‑space component periodically reads the maps, converts the data to pprof format, and sends it to a collector service that resolves symbols (using both .symtab and DWARF), generates flame graphs, and stores the results in ClickHouse for further analysis.

The architecture consists of three layers:

Kernel‑side: eBPF programs attached to tracepoints, kprobes, and socket syscalls capture traffic and profiling data, write to eBPF maps, and optionally use BTF/CO‑RE for portable binaries.

User‑side (Agent): Loads eBPF bytecode, passes process IDs and executable mappings, reads maps, performs aggregation, and pushes metrics or profiling samples to the collector.

Collector: Central service that merges metrics, enriches them with CMDB metadata, performs cache‑based symbol lookup, generates Prometheus metrics, flame graphs, and stores raw samples.

Productization includes real‑time traffic dashboards (L4/L7 flow size, QPS, RPC method), service topology graphs derived from eBPF‑collected flow data, and continuous profiling dashboards with near‑real‑time flame graphs for C++ services. Case studies demonstrate rapid identification of unknown upstream traffic sources and pinpointing performance regressions in Redis and recommendation services.

Future work aims to extend traffic analysis to multi‑language topologies, add off‑CPU and memory‑leak profiling events, and enable on‑demand flame‑graph queries over arbitrary time ranges.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Observability Kubernetes performance monitoring eBPF Profiling

Written by

Xiaohongshu Tech REDtech

Official account of the Xiaohongshu tech team, sharing tech innovations and problem insights, advancing together.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.