Performance Optimization and Profiling of Go Services Using pprof and trace
The article outlines why high‑load Go services need performance tuning and presents a systematic workflow—preparation, analysis with Linux tools and Go’s pprof/trace, targeted optimizations such as goroutine pooling, Redis MSET, efficient JSON handling and slice resizing—demonstrating how these changes boost throughput, lower latency, and stabilize memory usage while offering broader Go‑specific best‑practice recommendations.
This article explains why performance optimization is necessary for high‑load Go services and outlines a systematic approach to identify and resolve bottlenecks.
Why Optimize
Two typical scenarios trigger optimization:
Continuous high load that requires frequent scaling.
Architectural limitations that prevent further business growth, requiring refactoring and performance tuning.
General Optimization Steps
Preparation – discover performance problems and define optimization goals.
Analysis – use tools to locate bottlenecks.
Tuning – apply fixes based on the identified bottlenecks.
Testing – verify the effect of the changes; repeat if necessary.
Linux Performance‑Analysis Tools
Common tools include vmstat , iostat , mpstat , netstat , sar , top , gprof , perf , strace , ltrace , pstack , ptree , pmap , dmesg . For Go programs, perf top and pprof are especially useful.
perf top Example
The system shows a low load average (~2.5) and many soft‑interrupts. Functions such as runtime.scanobject and mallocgc indicate frequent small‑object allocations.
Go Program Profiling with pprof
Collecting CPU profiles (on‑cpu) works by registering a timer that fires 100 times per second and records stack traces. The following code enables the built‑in pprof HTTP handlers:
import _ "net/http/pprof"
gofunc() {
http.ListenAndServe("0.0.0.0:8080", nil)
}()The pprof.go init registers several handlers:
func init() {
http.HandleFunc("/debug/pprof/", Index)
http.HandleFunc("/debug/pprof/cmdline", Cmdline)
http.HandleFunc("/debug/pprof/profile", Profile)
http.HandleFunc("/debug/pprof/symbol", Symbol)
http.HandleFunc("/debug/pprof/trace", Trace)
}During profiling a profileBuilder object aggregates data:
type profileBuilder struct {
start time.Time
end time.Time
havePeriod bool
period int64
m profMap
// encoding state
w io.Writer
zw *gzip.Writer
pb protobuf
strings []string
stringMap map[string]int
locs map[uintptr]locInfo
funcs map[string]int
mem []memMap
deck pcDeck
}
// and runtime.MemProfileRecord // goroutine stack info
// runtime.MemStats // memory statisticsTypical ways to view profiles:
// view heap profile
go tool pprof http://localhost:6060/debug/pprof/heap
// view 30‑second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
// view blocking profile (after runtime.SetBlockProfileRate)
go tool pprof http://localhost:6060/debug/pprof/block
// view mutex contention profile (after runtime.SetMutexProfileFraction)
go tool pprof http://localhost:6060/debug/pprof/mutex
// collect a 5‑second execution trace
wget -O trace.out http://localhost:6060/debug/pprof/trace?seconds=5
go tool trace trace.outA simple heap profile example:
go tool pprof http://localhost:6060/debug/pprof/heap?debug=1
// runtime.MemProfileRecord
heap profile: 71: 35532256 [15150: 492894072] @ heap/1048576
1: 31203328 [1: 31203328] @ 0xc2c83a 0x49ed0a ...
# runtime.MemStats
# Alloc = 83374072
# TotalAlloc = 8261199880
# Sys = 216980496
# ...Load Testing and Issue Diagnosis
Initial load tests on the “投放服务” (ad‑delivery service) revealed high CPU, excessive goroutine creation (≈10 per request), and memory pressure. The following observations were made:
Many goroutines were parked ( mcall→park_m ), indicating scheduler overload.
GC accounted for ~6 % of CPU.
Further profiling showed that most CPU time was spent in Redis‑related asynchronous tasks, and that slice allocations were oversized.
Optimizations Applied
Introduce a goroutine pool (e.g., ants ) to limit concurrent goroutine count.
Merge multiple Redis SET operations using MSET to reduce request count.
Replace inefficient JSON library with github.com/json-iterator/go and eliminate reflection‑based reporting.
Resize oversized slices and avoid unnecessary allocations.
After these changes, the service sustained 200 qps with p90 latency ≈ 30‑40 ms and stable memory usage.
Further Real‑World Testing
In production‑like tests (500 qps), initial runs still showed CPU spikes and memory growth. Additional steps were taken:
Switch to configuration‑file based Redis addressing to avoid L5 lookup overhead.
Refactor key construction to use simple integer IDs instead of long strings.
Reduce reflection usage in reporting paths.
Post‑optimisation results:
CPU remained high but within acceptable limits.
Memory usage stabilized; no OOM events.
Goroutine count dropped dramatically, as confirmed by trace analysis.
General Go Optimization Recommendations
Combine many small objects into larger structs to reduce allocation overhead.
Avoid unnecessary pointer indirection; prefer value types when possible.
When local variables escape, aggregate them into a single struct to cut the number of heap objects.
Pre‑allocate []byte buffers when the final size is known.
Use the smallest suitable integer type (e.g., int8 ) for counters.
Prefer sync.Pool for reusable objects.
Replace maps with slices when the key space is dense and predictable.
Tencent Music Tech Team
Public account of Tencent Music's development team, focusing on technology sharing and communication.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.