Backend Development 14 min read

Investigation and Optimization of High‑Frequency Go Service Timeout in Containers

The study traced severe Redis latency spikes in an 8‑core container to the Go runtime mistakenly reading the host’s CPU count, which created hundreds of threads and scheduler contention; limiting GOMAXPROCS to the container’s quota eliminated the jitter, while further gains were suggested by reducing HMGET calls, tuning GC, and upgrading Go.

Didi Tech
Didi Tech
Didi Tech
Investigation and Optimization of High‑Frequency Go Service Timeout in Containers

Background : A Go service running in an 8‑core, 16 GB container exhibited severe Redis latency spikes (read/write >100 ms) after deployment. The service performs intensive I/O, scheduled tasks, and in‑memory caching, resulting in thousands of Redis HMGET/Get calls per request.

Solution Approach : A step‑by‑step elimination method was applied:

Check network latency and Redis SLOWLOG .

Capture packets and correlate Redis keys with TCP dump.

Analyze host and container CPU/memory load.

Inspect the Redis SDK for possible blocking (pool logic).

Monitor Go runtime for goroutine spikes and GC stop‑the‑world (STW) pauses.

Collect a trace to examine scheduler behavior.

On‑site Analysis

Network analysis : Slowlog showed GET latency <10 ms, confirming Redis itself was fast. Packet captures revealed the reported 126 ms latency was actually <2 ms on the wire.

Load analysis : Host‑level CPU usage was normal, but the container process showed large CPU jitter, especially during scheduled tasks. Memory usage was high due to a local cache, potentially increasing GC pressure.

Redis SDK investigation : No blocking was found in the SDK source or runtime stack. However, the high‑frequency HMGET calls were identified as a major cost.

Runtime investigation : The service ran on Go 1.8 without runtime monitoring. A demo program was used to reproduce timeouts under 50 concurrent requests:

package main

import (
    "flag"
    "fmt"
    "net/http"
    _ "net/http/pprof"
    "sync"
    "time"
    "github.com/gomodule/redigo/redis"
)

var redisAddr string

func main() {
    flag.StringVar(&redisAddr, "redis", "127.0.0.1:6379", "-redis use redis addr ")
    flag.Parse()
    go func() { http.ListenAndServe("0.0.0.0:8003", nil) }()
    wg := sync.WaitGroup{}
    wg.Add(1)
    for i := 0; i < 200; i++ {
        go go_get(wg)
    }
    wg.Wait()
}

func go_get(wg sync.WaitGroup) {
    client, err := redis.Dial("tcp", redisAddr)
    if err != nil { fmt.Println("connect redis error :", err); return }
    defer client.Close()
    for {
        start := time.Now()
        client.Do("GET", "test1234")
        if cost := time.Since(start); cost > 10*time.Millisecond {
            fmt.Printf("time cost %v \n", cost)
        }
    }
}

GC heap profiling (via curl http://localhost:8003/debug/pprof/heap?debug=1 ) showed frequent STW pauses, but their impact did not fully explain the latency spikes.

Trace analysis : A 5‑minute trace highlighted scheduler wait times and an unexpectedly high number of OS threads (200+), far exceeding the container’s 8‑core limit. The Go runtime had incorrectly detected the host CPU count via runtime.NumCPU() , which uses sched_getaffinity to read the host’s CPU set.

func getproccount() int32 {
    const maxCPUs = 64 * 1024
    var buf [maxCPUs/8]byte
    r := sched_getaffinity(0, unsafe.Sizeof(buf), &buf[0])
    if r < 0 { return 1 }
    n := int32(0)
    for _, v := range buf[:r] {
        for v != 0 {
            n += int32(v & 1)
            v >>= 1
        }
    }
    if n == 0 { n = 1 }
    return n
}

Setting GOMAXPROCS to the container’s CPU limit (8) reduced the thread count to one‑third of the original and eliminated most jitter, as shown by subsequent monitoring graphs.

Principle Analysis : An incorrect CPU core count inflates scheduler contention for latency‑sensitive services. Go determines CPU count via runtime.NumCPU() , which reads the host’s affinity mask. In container environments, the correct limit should be derived from cgroup quotas (e.g., /sys/fs/cgroup/cpu/cpu.cfs_quota_us divided by cpu.cfs_period_us ), as implemented by Uber’s automaxprocs library.

func (cg CGroups) CPUQuota() (float64, bool, error) {
    cpuCGroup, exists := cg[_cgroupSubsysCPU]
    if !exists { return -1, false, nil }
    cfsQuotaUs, err := cpuCGroup.readInt(_cgroupCPUCFSQuotaUsParam)
    if err != nil || cfsQuotaUs <= 0 { return -1, false, err }
    cfsPeriodUs, err := cpuCGroup.readInt(_cgroupCPUCFSPeriodUsParam)
    if err != nil { return -1, false, err }
    return float64(cfsQuotaUs) / float64(cfsPeriodUs), true, nil
}

By injecting GOMAXPROCS based on cgroup limits and optionally using automaxprocs , the service aligns its scheduler with the actual CPU resources, eliminating the timeout spikes.

Summary : The investigation identified three root causes – excessive goroutine/thread count due to wrong CPU detection, heavy HMGET usage, and sub‑optimal GC settings. The primary fix was to set the correct GOMAXPROCS value, which immediately reduced latency jitter. Additional recommendations include adding runtime monitoring, optimizing Redis SDK usage (pipeline, reduce allocations), and upgrading Go version (≥1.9) to benefit from improved scheduler behavior.

PerformanceRedisGoContainerCPUtracinggomaxprocs
Didi Tech
Written by

Didi Tech

Official Didi technology account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.