Design and Go Implementation of a Service Circuit Breaker
This article explains the design and Go implementation of a microservice circuit breaker, covering fault‑tolerance mechanisms, state transitions, configurable trip strategies, metrics collection, testing, and deployment patterns such as centralized gateways and service mesh.
He Peng, currently working at an internet finance company, focuses on architecture and development management, especially in distributed systems and risk control.
I. Summary
In microservice architectures, service timeouts or communication failures often lead to cascading failures (avalanche effect). Rate limiting and circuit breaking are essential solutions. A previous article discussed various rate‑limiting implementations.
II. Microservice Fault‑Tolerance Mechanism
Microservice dependencies can cause complex failure cascades. When Service C fails, Service B may repeatedly retry, exhausting resources and causing Service A to become unavailable—a classic avalanche scenario.
To prevent this, a robust fault‑tolerance mechanism is needed: redundancy through clustering, load balancing, and retry strategies.
Failover – redirect to a healthy instance.
Failback – notify of failure.
Failsafe – ensure safe degradation.
Failfast – abort quickly on error.
Besides clustering, both circuit breaking and rate limiting are required. Rate limiting protects upstream services from overload, while circuit breaking blocks calls to a downstream service that is failing.
III. Circuit Breaker Design and Implementation
Design Idea
The circuit breaker concept originates from electrical fuses and has been applied to financial markets. In microservices, the idea is similar: automatically stop calls when a service is unhealthy.
type ServiceBreaker struct {
mu sync.RWMutex
name string
state State
windowInterval time.Duration
metrics Metrics
tripStrategyFunc TripStrategyFunc
halfMaxCalls uint64
stateOpenTime time.Time
sleepTimeout time.Duration
stateChangeHook func(name string, fromState State, toState State)
}The struct fields include a read‑write lock, name, current state, window interval, metrics, a configurable trip strategy, half‑open call limit, timestamps, and an optional state‑change hook.
type State int
const (
StateClosed State = iota
StateOpen
StateHalfOpen
)
func (s State) String() string {
switch s {
case StateClosed:
return "closed"
case StateHalfOpen:
return "half-open"
case StateOpen:
return "open"
default:
return fmt.Sprintf("unknown state: %d", s)
}
}The Call method wraps the execution with beforeCall and afterCall hooks, handling panics and updating metrics.
func (breaker *ServiceBreaker) Call(exec func() (interface{}, error)) (interface{}, error) {
err := breaker.beforeCall()
if err != nil {
return nil, err
}
defer func() {
if r := recover(); r != nil {
breaker.afterCall(false)
panic(r)
}
}()
breaker.metrics.OnCall()
result, err := exec()
breaker.afterCall(err == nil)
return result, err
}Before Call Check
func (breaker *ServiceBreaker) beforeCall() error {
breaker.mu.Lock()
defer breaker.mu.Unlock()
now := time.Now()
switch breaker.state {
case StateOpen:
if breaker.stateOpenTime.Add(breaker.sleepTimeout).Before(now) {
log.Printf("%s cooldown passed, trying half‑open", breaker.name)
breaker.changeState(StateHalfOpen, now)
return nil
}
log.Printf("%s is open, request blocked", breaker.name)
return ErrStateOpen
case StateHalfOpen:
if breaker.metrics.CountAll >= breaker.halfMaxCalls {
log.Printf("%s half‑open, too many calls blocked", breaker.name)
return ErrTooManyCalls
}
default: // Closed
if !breaker.metrics.WindowTimeStart.IsZero() && breaker.metrics.WindowTimeStart.Before(now) {
breaker.nextWindow(now)
return nil
}
}
return nil
}The method decides whether a request can proceed based on the current state and configured limits.
After Call Processing
func (breaker *ServiceBreaker) afterCall(success bool) {
breaker.mu.Lock()
defer breaker.mu.Unlock()
if success {
breaker.onSuccess(time.Now())
} else {
breaker.onFail(time.Now())
}
}Success updates success counters; failure triggers state transitions according to the trip strategy.
Metrics and Sliding Window
type Metrics struct {
WindowBatch uint64
WindowTimeStart time.Time
CountAll uint64
CountSuccess uint64
CountFail uint64
ConsecutiveSuccess uint64
ConsecutiveFail uint64
}
func (m *Metrics) NewBatch() { m.WindowBatch++ }
func (m *Metrics) OnCall() { m.CountAll++ }
func (m *Metrics) OnSuccess() { m.CountSuccess++; m.ConsecutiveSuccess++; m.ConsecutiveFail = 0 }
func (m *Metrics) OnFail() { m.CountFail++; m.ConsecutiveFail++; m.ConsecutiveSuccess = 0 }
func (m *Metrics) OnReset() { m.CountAll, m.CountSuccess, m.CountFail = 0, 0, 0; m.ConsecutiveSuccess, m.ConsecutiveFail = 0, 0 }The sliding window groups metrics into batches; a new window resets counters and sets the next window start time based on the breaker state.
func (breaker *ServiceBreaker) nextWindow(now time.Time) {
breaker.metrics.NewBatch()
breaker.metrics.OnReset()
var zero time.Time
switch breaker.state {
case StateClosed:
if breaker.windowInterval == 0 {
breaker.metrics.WindowTimeStart = zero
} else {
breaker.metrics.WindowTimeStart = now.Add(breaker.windowInterval)
}
case StateOpen:
breaker.metrics.WindowTimeStart = now.Add(breaker.sleepTimeout)
default: // HalfOpen
breaker.metrics.WindowTimeStart = zero
}
}State Transition Logic
func (breaker *ServiceBreaker) changeState(state State, now time.Time) {
if breaker.state == state {
return
}
prev := breaker.state
breaker.state = state
breaker.nextWindow(time.Now())
if state == StateOpen {
breaker.stateOpenTime = now
}
if breaker.stateChangeHook != nil {
breaker.stateChangeHook(breaker.name, prev, state)
}
}When a state changes, a new metrics window is started and an optional hook is invoked.
Trip Strategies
type TripStrategyFunc func(Metrics) bool
func ConsecutiveFailTripFunc(threshold uint64) TripStrategyFunc {
return func(m Metrics) bool { return m.ConsecutiveFail >= threshold }
}
func FailTripFunc(threshold uint64) TripStrategyFunc {
return func(m Metrics) bool { return m.CountFail >= threshold }
}
func FailRateTripFunc(rate float64, minCalls uint64) TripStrategyFunc {
return func(m Metrics) bool {
if m.CountAll == 0 {
return false
}
currRate := float64(m.CountFail) / float64(m.CountAll)
return m.CountAll >= minCalls && currRate >= rate
}
}
func ChooseTrip(op *TripStrategyOption) TripStrategyFunc {
switch op.Strategy {
case ConsecutiveFailTrip:
return ConsecutiveFailTripFunc(op.ConsecutiveFailThreshold)
case FailTrip:
return FailTripFunc(op.FailThreshold)
case FailRateTrip:
fallthrough
default:
return FailRateTripFunc(op.FailRate, op.MinCall)
}
}Three strategies are supported: consecutive failures, total failures, and failure‑rate with a minimum call threshold.
Configuration Options
type TripStrategyOption struct {
Strategy uint
ConsecutiveFailThreshold uint64
FailThreshold uint64
FailRate float64
MinCall uint64
}
type Option struct {
Name string
WindowInterval time.Duration
HalfMaxCalls uint64
SleepTimeout time.Duration
StateChangeHook func(name string, fromState State, toState State)
TripStrategy TripStrategyOption
}These options allow fine‑grained control over window size, half‑open call limits, cooldown periods, and the chosen trip strategy.
Testing the Breaker
func initBreaker() *ServiceBreaker {
tripOp := TripStrategyOption{Strategy: FailRateTrip, FailRate: 0.6, MinCall: 3}
opt := Option{Name: "breaker1", WindowInterval: 5*time.Second, HalfMaxCalls: 3, SleepTimeout: 6*time.Second, TripStrategy: tripOp, StateChangeHook: stateChangeHook}
breaker, _ := NewServiceBreaker(opt)
return breaker
}Unit tests simulate successful calls, a burst of failures, and recovery, both sequentially and with five concurrent goroutines, demonstrating state transitions and the effect of the configured thresholds.
Deployment Patterns
The article discusses three service‑call patterns:
Direct calls between services.
Centralized gateway (proxy) where all traffic passes through a gateway that can enforce rate limiting and circuit breaking.
Service‑mesh (side‑car) architecture, where a lightweight proxy runs alongside each service instance, providing transparent fault tolerance without code intrusion.
Both gateway‑based and mesh‑based approaches can offload metrics collection and decision making to asynchronous components, reducing latency impact on the request path.
Conclusion
The circuit breaker design presented combines state management, configurable trip strategies, and metrics windows to protect microservices from cascading failures. The implementation is available at https://github.com/skyhackvip/service_breaker .
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.