Comprehensive Dependency Governance for High‑Availability Backend Systems
This article outlines a systematic approach to dependency governance in high‑traffic backend services, covering service classification, rate limiting, Dubbo, HTTP, database, and message‑queue management to enhance availability, reduce failure impact, and improve overall system stability.
Background
The authors previously shared a cache governance practice and now extend the stability governance to cover system‑level dependencies such as external components, interfaces, and the services they expose (Dubbo, HTTP, DB, MQ, etc.).
Governance Plan
Service Classification and Dependency Governance
1) Applications are graded (P1, P2, P3) based on business core importance and impact, and dependencies are mapped accordingly.
2) P1 services must be deployed across multiple data centers, ensuring that no single data center holds more than half of the online instances, thereby reducing the impact of a single‑site failure.
3) Strong dependencies are weakened to enable degradation; weak dependencies are made asynchronous to allow circuit‑breaking. Critical‑to‑critical calls receive pre‑planned fallback strategies, while non‑critical calls are isolated to prevent cascading failures.
Rate Limiting
The team adopts a unified Sentinel component for traffic control, providing dynamic rate limiting for Dubbo and HTTP interfaces, business‑level throttling based on request parameters, and optional cluster‑wide limits. Proper rate limiting is applied judiciously to avoid degrading user experience during normal traffic spikes.
Dubbo Governance
Key measures include monitoring Dubbo thread pools, isolating core and non‑core interfaces into separate thread pools, and configuring reasonable timeout values on both provider and consumer sides.
HTTP Governance
Practices involve setting appropriate timeout thresholds, encouraging asynchronous calls, implementing controlled retries, and isolating thread pools and clients to prevent cross‑interference.
Database Governance
High availability is ensured through multi‑replica storage, rapid recovery mechanisms, and removal of unnecessary data. Monitoring of query performance and MyBatis interceptors are employed for early detection of issues.
MQ Governance
The approach handles single‑MQ failures or message backlogs by enabling fast failover to alternative channels, using multiple topics or MQ clusters, and guaranteeing idempotent consumption to avoid data loss.
Additional Practices
Monitoring is enhanced for Dubbo, HTTP, and DB operations; dashboards include app‑code dimensions for quick inspection; and timeout configurations are regularly reviewed for optimal values.
Governance Process
The workflow mirrors previous cache governance: identify scenarios, define solutions, develop and test, deploy, and conduct online drills with iterative improvements. Deployment is staged, first adding rate‑limiting components and monitoring, then optimizing based on observed metrics.
Summary
Post‑incident reviews drive proactive measures that reduce failure frequency, duration, and impact. Dependency governance is an ongoing effort, with future plans to automate dependency tagging and integrate with dedicated service‑governance platforms for dynamic detection and rapid response.
Qunar Tech Salon
Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.