Backend Development 25 min read

Performance Optimization and Profiling of Go Services Using pprof and trace

The article outlines why high‑load Go services need performance tuning and presents a systematic workflow—preparation, analysis with Linux tools and Go’s pprof/trace, targeted optimizations such as goroutine pooling, Redis MSET, efficient JSON handling and slice resizing—demonstrating how these changes boost throughput, lower latency, and stabilize memory usage while offering broader Go‑specific best‑practice recommendations.

Tencent Music Tech Team

May 13, 2021

Performance Optimization and Profiling of Go Services Using pprof and trace

This article explains why performance optimization is necessary for high‑load Go services and outlines a systematic approach to identify and resolve bottlenecks.

Why Optimize

Two typical scenarios trigger optimization:

Continuous high load that requires frequent scaling.

Architectural limitations that prevent further business growth, requiring refactoring and performance tuning.

General Optimization Steps

Preparation – discover performance problems and define optimization goals.

Analysis – use tools to locate bottlenecks.

Tuning – apply fixes based on the identified bottlenecks.

Testing – verify the effect of the changes; repeat if necessary.

Linux Performance‑Analysis Tools

Common tools include vmstat, iostat, mpstat, netstat, sar, top, gprof, perf, strace, ltrace, pstack, ptree, pmap, dmesg. For Go programs, perf top and pprof are especially useful.

perf top Example

The system shows a low load average (~2.5) and many soft‑interrupts. Functions such as runtime.scanobject and mallocgc indicate frequent small‑object allocations.

Go Program Profiling with pprof

Collecting CPU profiles (on‑cpu) works by registering a timer that fires 100 times per second and records stack traces. The following code enables the built‑in pprof HTTP handlers:

import _ "net/http/pprof"

gofunc() {
    http.ListenAndServe("0.0.0.0:8080", nil)
}()

The pprof.go init registers several handlers:

func init() {
    http.HandleFunc("/debug/pprof/", Index)
    http.HandleFunc("/debug/pprof/cmdline", Cmdline)
    http.HandleFunc("/debug/pprof/profile", Profile)
    http.HandleFunc("/debug/pprof/symbol", Symbol)
    http.HandleFunc("/debug/pprof/trace", Trace)
}

During profiling a profileBuilder object aggregates data:

type profileBuilder struct {
    start      time.Time
    end        time.Time
    havePeriod bool
    period     int64
    m          profMap
    // encoding state
    w         io.Writer
    zw        *gzip.Writer
    pb        protobuf
    strings   []string
    stringMap map[string]int
    locs      map[uintptr]locInfo 
    funcs     map[string]int 
    mem       []memMap
    deck      pcDeck
}

// and runtime.MemProfileRecord // goroutine stack info
// runtime.MemStats // memory statistics

Typical ways to view profiles:

// view heap profile
go tool pprof http://localhost:6060/debug/pprof/heap

// view 30‑second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30

// view blocking profile (after runtime.SetBlockProfileRate)
go tool pprof http://localhost:6060/debug/pprof/block

// view mutex contention profile (after runtime.SetMutexProfileFraction)
go tool pprof http://localhost:6060/debug/pprof/mutex

// collect a 5‑second execution trace
wget -O trace.out http://localhost:6060/debug/pprof/trace?seconds=5
go tool trace trace.out

A simple heap profile example:

go tool pprof http://localhost:6060/debug/pprof/heap?debug=1

// runtime.MemProfileRecord
heap profile: 71: 35532256 [15150: 492894072] @ heap/1048576
1: 31203328 [1: 31203328] @ 0xc2c83a 0x49ed0a ...
# runtime.MemStats
# Alloc = 83374072
# TotalAlloc = 8261199880
# Sys = 216980496
# ...

Load Testing and Issue Diagnosis

Initial load tests on the “投放服务” (ad‑delivery service) revealed high CPU, excessive goroutine creation (≈10 per request), and memory pressure. The following observations were made:

Many goroutines were parked ( mcall→park_m), indicating scheduler overload.

GC accounted for ~6 % of CPU.

Further profiling showed that most CPU time was spent in Redis‑related asynchronous tasks, and that slice allocations were oversized.

Optimizations Applied

Introduce a goroutine pool (e.g., ants ) to limit concurrent goroutine count.

Merge multiple Redis SET operations using MSET to reduce request count.

Replace inefficient JSON library with github.com/json-iterator/go and eliminate reflection‑based reporting.

Resize oversized slices and avoid unnecessary allocations.

After these changes, the service sustained 200 qps with p90 latency ≈ 30‑40 ms and stable memory usage.

Further Real‑World Testing

In production‑like tests (500 qps), initial runs still showed CPU spikes and memory growth. Additional steps were taken:

Switch to configuration‑file based Redis addressing to avoid L5 lookup overhead.

Refactor key construction to use simple integer IDs instead of long strings.

Reduce reflection usage in reporting paths.

Post‑optimisation results:

CPU remained high but within acceptable limits.

Memory usage stabilized; no OOM events.

Goroutine count dropped dramatically, as confirmed by trace analysis.

General Go Optimization Recommendations

Combine many small objects into larger structs to reduce allocation overhead.

Avoid unnecessary pointer indirection; prefer value types when possible.

When local variables escape, aggregate them into a single struct to cut the number of heap objects.

Pre‑allocate []byte buffers when the final size is known.

Use the smallest suitable integer type (e.g., int8) for counters.

Prefer sync.Pool for reusable objects.

Replace maps with slices when the key space is dense and predictable.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Performance Optimization Concurrency Go pprof Linux Profiling Trace

Written by

Tencent Music Tech Team

Public account of Tencent Music's development team, focusing on technology sharing and communication.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.