Foundations of High Availability: Defining and Managing Strong and Weak Service Dependencies
The article defines strong versus weak service dependencies, outlines governance through discovery, fault injection, and refactoring, recommends front‑end and back‑end fault‑tolerance measures such as timeouts and circuit breakers, describes isolation and artificial degradation switches, verifies classifications, and notes current middleware gaps and hiring information.
1. Definition of Strong and Weak Dependencies
As the company’s business expands, the system becomes increasingly complex, with front‑end reliance on back‑end services and inter‑service dependencies. Without clear strong/weak dependency definitions, it is difficult to perform circuit breaking, degradation, or rate limiting, and to continuously improve system stability.
S1 Core S2 Secondary Core S3 Non‑core S4 Others S1: Affects core business processes and user experience. S2: Not core but service outage causes widespread user impact. S3: Non‑core, negligible user impact (e.g., avatar, profile edit). S4: Almost no impact on online services (e.g., internal operation back‑ends).
Strong dependency: when an exception affects core business processes or system availability. Weak dependency: when an exception does not affect core processes or overall availability.
2. Governance of Strong and Weak Dependencies
Governance means continuously obtaining dependency relationships, traffic, and strength data, detecting potential failure points early, and preventing dependency‑related incidents from degrading user experience.
2.1 Discovery
2.1.1 Manual梳理 (Manual Review)
Initially, a large amount of manpower was invested to read code and list all dependencies in the core ride‑service chain.
Identify the main business and evaluate whether each dependent service impacts it. Example: For the user‑scan QR code flow, many pre‑checks (user eligibility, vehicle status, etc.) are involved.
Through manual analysis, it was found that only a few APIs (create order, start order, end order, query order) are core, while dozens of other services (Redis, DB, MQ, etc.) are weak dependencies that unnecessarily increase the failure surface.
2.1.2 Fault Injection
Service configuration files are used to list dependencies. Fault injection is performed offline by injecting exceptions into each dependency to verify whether the main business remains functional, thereby distinguishing strong from weak dependencies.
We are currently injecting failures into two non‑S1 services to identify strong/weak dependencies and prevent degradation of the core chain.
2.2 Refactoring & Contingency Plans
2.2.1 Front‑end Fault Tolerance
Decouple non‑core backend calls so that failures do not block core flows. Example: The “Confirm Unlock” page fetches pricing rules; if this call is a strong dependency, a failure blocks the button and the core flow. If treated as a weak dependency, the page can still proceed with partial data.
2.2.2 Back‑end Fault Tolerance
For weak dependencies, implement proper timeout, circuit breaking, and rate limiting.
Timeout : Configure based on 95th/99th percentile response times, considering serialization and network latency.
Circuit Breaker : When a service fails, break the circuit to avoid cascading timeouts and provide fallback values.
Rate Limiting : Protect core service interfaces from traffic spikes to maintain stability.
2.3 Isolation of Core and Non‑core Business
Techniques include thread‑level isolation (semaphores, thread pools), process‑level isolation, business splitting, and group deployment.
In the ride‑service case, thread‑level isolation was infeasible due to the SOA framework, and group deployment could not avoid instability caused by code changes. Therefore, business splitting was chosen: core business was extracted from the service while non‑core logic remained.
2.4 Artificial Degradation Switch
For each scenario, a controllable fallback is configured via dynamic switches, allowing a one‑click switch to fallback logic when exceptions occur.
3. Verification
Service configuration files are used again to inject failures offline and verify that the main business remains available, confirming the classification of dependencies.
4. Current Status & Issues
Too much focus on service‑to‑service dependencies, neglecting middleware (Redis/HBase/MQ) dependencies.
Emphasis is placed on runtime, overlooking start‑up and shutdown phases.
Recruitment Information
We are the “Two‑Round Technical Risk” team at HelloBike, focusing on high‑traffic, high‑concurrency system stability. We are hiring; interested candidates can send resumes to [email protected].
The End
HelloTech
Official Hello technology account, sharing tech insights and developments.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.