Backend Development 14 min read

Debugging Service Discovery Failures in a Go RPCX Microservice Using Zookeeper

This article analyzes a microservice client’s connection‑timeout errors in a Kubernetes cluster, investigates Zookeeper‑based service discovery, uncovers one‑time watch semantics and a concurrency bug in rpcx’s channel handling, and presents a fix to ensure IP list updates.

Xueersi Online School Tech Team

Jan 22, 2021

Debugging Service Discovery Failures in a Go RPCX Microservice Using Zookeeper

On 2020‑12‑25 a microservice client in a Kubernetes cluster reported connection timeouts when dialing a service, indicating a service discovery update failure.

The environment uses rpcx with Zookeeper as the registry. Initial investigation checks the IP address of failing pods, discovers the IP is not present in Zookeeper, and suspects stale IP lists.

Network traces with netstat -anp | grep 2181 and tcpdump show two established TCP connections to Zookeeper, one for registration and one for discovery, both exchanging regular heartbeat packets.

Further analysis reveals that Zookeeper watches are one‑time; after a change the watch is removed, so subsequent changes are not notified.

Review of the rpcx discovery code shows the use of ChildrenW (watching) and Children (no watch). The watch‑once behavior can cause missed updates during rapid rolling upgrades.

A concurrency bug is identified: the discovery component creates a channel for each watcher and appends it to a slice without synchronization, leading to lost channels when multiple goroutines register concurrently.

func (d *ZookeeperDiscovery) WatchService() chan [] *KVPair {
    ch := make(chan []*KVPair, 10)
    d.chans = append(d.chans, ch)
    return ch
}

An example Go program reproduces the race condition, showing that concurrent appends to a slice can result in missing elements.

package main

import (
    "fmt"
    "sync"
)

func main() {
    ok := true
    for i := 0; i < 1000; i++ {
        var arr []int
        wg := sync.WaitGroup{}
        for j := 0; j < 2; j++ {
            wg.Add(1)
            go func() {
                defer wg.Done()
                arr = append(arr, i)
            }()
        }
        wg.Wait()
        if len(arr) < 2 {
            fmt.Printf("error:%d 
", i)
            ok = false
            break
        }
    }
    if ok {
        fmt.Println("ok")
    }
}
//error:261

Removing the disaster‑recovery logic that relied on the unreliable watch resolves the issue, and proper synchronization (e.g., mutexes) should be used when modifying shared slices.

The article concludes with recommendations to handle Zookeeper watch semantics correctly and to guard shared slice modifications with proper synchronization to avoid similar concurrency bugs.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Concurrency Go Zookeeper rpcx

Written by

Xueersi Online School Tech Team

The Xueersi Online School Tech Team, dedicated to innovating and promoting internet education technology.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.