Application Monitoring Principles and Non‑Intrusive Data Collection at Huya
This article explains the fundamentals of distributed application monitoring, describes Huya's non‑intrusive data‑collection techniques using SDKs and plugins, outlines the design and correlation of observable metrics, and demonstrates practical results and troubleshooting scenarios for backend services.
The article begins with an overview of monitoring types, focusing on application monitoring and its importance for observing service calls, process execution, and business‑related metrics.
It then analyzes distributed application monitoring principles, highlighting cross‑process monitoring challenges such as linking request metrics across services to avoid isolated data and illustrating the need to correlate upstream and downstream request information.
Next, it discusses non‑intrusive data‑collection approaches, evaluating options like log collection, port probing, network‑packet monitoring, and finally selecting SDK/plugin‑based instrumentation for its zero‑code‑change capability and extensibility.
The implementation details describe how plugins intercept various frameworks (e.g., Spring MVC, OkHttp) and use ThreadLocal to propagate context across synchronous and asynchronous threads, enabling request correlation without modifying business code.
Metric design is covered, distinguishing basic call metrics (QPS, latency, success rate) from process‑load indicators (thread pool capacity, active threads, waiting threads) and introducing a thread‑load‑rate calculation to reflect CPU usage per request.
Alert aggregation and metric‑correlation techniques are presented to consolidate alarms from multiple instances and to identify root causes by linking related metric trends.
Practical results showcase dashboards displaying request metrics, thread‑load distribution, and examples of abnormal scenarios such as sudden traffic spikes, high‑latency requests, and database connection pool bottlenecks.
A Q&A section addresses SDK applicability to front‑end, comparisons with SkyWalking and Jaeger, and how thread‑pool data is obtained from containers like Tomcat.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.