Backend Development 20 min read

Troubleshooting Golang GC Performance Issues Causing Request Timeout Spikes

The article details how a Go service’s default GOGC setting caused overly frequent garbage‑collection pauses that spiked request timeouts, and how adjusting GOGC dynamically with debug.SetGCPercent and setting memory limits reduced GC CPU usage, extended pause intervals, and eliminated timeout spikes.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Troubleshooting Golang GC Performance Issues Causing Request Timeout Spikes

This article documents the complete troubleshooting process for a Golang GC (Garbage Collection) problem that caused request timeout spikes in a production service.

Problem Phenomenon: After callers used an interface for a period of time, timeout rate spikes occurred frequently, with success rate occasionally dropping below 99.5% and triggering business alerts. The client set a strict 150ms timeout based on the low average latency (18ms) observed during testing.

Investigation Process: Through monitoring analysis, the team discovered high GC CPU usage (averaging over 2%, with some machines exceeding 4%) and GC Pause times occasionally exceeding 50ms or even 100ms. Using pprof for CPU and memory profiling, they identified that runtime.mallocgc occupied 15% of CPU time. The runtime trace revealed a sawtooth pattern in heap memory with GC occurring every 550ms (twice per second), while the actual heap usage was only around 60MB.

Root Cause: The default GOGC value of 100 caused excessive GC frequency. With only ~30MB of live data, the formula NextGC = LiveData × (1 + GOGC/100) meant GC triggered whenever memory doubled. Combined with the 2GB pod memory limit, this resulted in GC running every 550ms, causing STW (Stop-The-World) pauses that occasionally exceeded 50ms and led to timeout spikes.

Solution: The team implemented dynamic GOGC adjustment using debug.SetGCPercent() and added memory limits using debug.SetMemoryLimit() (Go 1.19+ feature). They gradually increased GOGC from 100 to 16000, which extended GC interval from 550ms to 12.8s and reduced GC CPU usage from over 2% to approximately 0.12%.

Results: After optimization, client disconnection events due to timeouts decreased significantly, and the overall curve became more stable with fewer spikes.

backend developmentgolangpprofgc optimizationGo Memory ManagementGOGCperformance troubleshootingRuntime TraceSTW Pause
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.