Operations 12 min read

Five Patterns to Make Your Microservice Fault‑Tolerant

This article explains essential fault‑tolerance patterns for microservices—including timeouts, retries, circuit breakers, distributed deadlines, and rate limiting—detailing their basic forms, drawbacks, and practical implementation strategies to improve reliability and prevent cascading failures.

Architects Research Society
Architects Research Society
Architects Research Society
Five Patterns to Make Your Microservice Fault‑Tolerant

In this article we introduce fault tolerance in microservices, defining it as the ability of a system to continue operating when some components fail.

Timeouts

Timeouts specify a maximum waiting period for an event. The article discusses the shortcomings of socket‑level SO_TIMEOUT and recommends using end‑to‑end request timeouts, with examples such as JDK 11, OkHttp, and Go’s standard library.

Retries

Retries are useful when transient failures occur. The article warns about retry storms in a chain of services and suggests distinguishing retryable from non‑retryable errors and using an error budget to limit retries.

Circuit Breaker

Circuit breakers act as a stricter form of error budgeting: when the error rate exceeds a threshold, calls are short‑circuited and a fallback is returned. Hystrix and its successor resilience4j are mentioned.

Deadlines / Distributed Timeouts

Distributed deadlines propagate a deadline timestamp or remaining timeout through downstream services, allowing each service to stop processing when the overall deadline is reached. The article explains how to calculate remaining time and the challenges of clock skew.

Rate Limiter

Rate limiting protects services from overload by limiting inbound requests (rate) or concurrent executions. Both static and dynamic limiters are described; dynamic limiters adjust limits based on metrics such as latency percentiles using an AIMD algorithm.

if healthy {
limit = limit + increase;
} else {
limit = limit * decreaseRatio; // 0 < decreaseRatio < 1.0
}

The article concludes that applying these patterns together with good observability can greatly improve service reliability.

distributed systemsMicroservicesfault toleranceRate Limitingcircuit-breakertimeoutsretries
Architects Research Society
Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.