Operations 10 min read

Performance Bottleneck Analysis and Optimization of an Erlang Service with High CPU Usage

The article details a performance bottleneck investigation of an Erlang‑based service experiencing high CPU usage, describing the use of recon tools, pressure testing, analysis of Kafka and Nginx impacts, and the subsequent optimizations that doubled throughput to meet business requirements.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Performance Bottleneck Analysis and Optimization of an Erlang Service with High CPU Usage

The service faced a sudden increase in request volume, with peak QPS reaching 5000; it was deployed on nine virtual machines across three data centers, and three machines in one data center handled about 1350 QPS each, causing CPU utilization to exceed 90% while memory remained low.

Although the Erlang‑based service is not compute‑intensive and shows no signs of dead loops, the high CPU prompted an investigation into what the scheduler was executing, using the recon toolset.

Key inspection commands included checking scheduler usage, the top memory‑consuming processes, and the processes with the most reductions: recon:scheduler_usage(1000). recon:proc_count(memory, 20). recon:proc_count(reductions, 20). recon:proc_window(reductions, 5, 1000). [rp({recon:info(P), Num}) || {P, Num, _} <- recon:proc_window(reductions, 20, 5000)]. recon:node_stats_print(5, 500).

After fixing load‑balancing issues, each machine’s peak QPS dropped below 500, and the CPU load was traced to specific processes, notably the TCP acceptor and Kafka writer processes. Sample process information was captured with recon, showing details such as memory usage, reductions, and stack traces for both the TCP acceptor and Kafka writer.

To reproduce the issue, a pressure‑test environment was prepared, highlighting four major differences from production: machine specifications, the presence of an Nginx proxy, additional logging for troubleshooting, and a Kafka write plugin. The analysis identified Kafka as the most likely performance bottleneck.

Pressure testing confirmed that the production environment’s Kafka integration caused a three‑fold performance degradation compared to the test setup, while Nginx and extra logging also contributed noticeable slowdowns.

Data analysis of the test runs revealed that log‑writing processes consumed significant CPU resources, and Kafka‑related processes showed high reduction counts. Example recon output for a log‑writing process was included to illustrate the load.

The conclusions were: (1) excessive logging adds considerable CPU overhead; (2) Kafka’s synchronous writes are a major bottleneck; (3) Nginx proxy settings also affect performance, though to a lesser extent.

Optimization steps focused on three areas: removing non‑essential logs, reviewing and tuning Nginx parameters, and exploring asynchronous or batched Kafka writes to smooth CPU spikes.

After eliminating unnecessary logs and re‑running the pressure test, performance more than doubled, satisfying current business demands, with further refinements ongoing.

In summary, locating the exact source of a bottleneck—whether logging, Kafka, or Nginx—and applying targeted optimizations can quickly restore service performance to acceptable levels.

monitoringPerformanceoperationsKafkaload testingCPUErlang
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.