Operations 16 min read

Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

This article explains the fundamentals of distributed application monitoring, describes Huya's non‑intrusive data‑collection techniques using SDKs and plugins, outlines the design and correlation of observable metrics, and demonstrates practical results and troubleshooting scenarios for backend services.

DataFunSummit
DataFunSummit
DataFunSummit
Application Monitoring Principles and Non‑Intrusive Data Collection at Huya

The article begins with an overview of monitoring types, focusing on application monitoring and its importance for observing service calls, process execution, and business‑related metrics.

It then analyzes distributed application monitoring principles, highlighting cross‑process monitoring challenges such as linking request metrics across services to avoid isolated data and illustrating the need to correlate upstream and downstream request information.

Next, it discusses non‑intrusive data‑collection approaches, evaluating options like log collection, port probing, network‑packet monitoring, and finally selecting SDK/plugin‑based instrumentation for its zero‑code‑change capability and extensibility.

The implementation details describe how plugins intercept various frameworks (e.g., Spring MVC, OkHttp) and use ThreadLocal to propagate context across synchronous and asynchronous threads, enabling request correlation without modifying business code.

Metric design is covered, distinguishing basic call metrics (QPS, latency, success rate) from process‑load indicators (thread pool capacity, active threads, waiting threads) and introducing a thread‑load‑rate calculation to reflect CPU usage per request.

Alert aggregation and metric‑correlation techniques are presented to consolidate alarms from multiple instances and to identify root causes by linking related metric trends.

Practical results showcase dashboards displaying request metrics, thread‑load distribution, and examples of abnormal scenarios such as sudden traffic spikes, high‑latency requests, and database connection pool bottlenecks.

A Q&A section addresses SDK applicability to front‑end, comparisons with SkyWalking and Jaeger, and how thread‑pool data is obtained from containers like Tomcat.

ObservabilitySREDistributed Tracingapplication monitoringMetrics Designnon‑intrusive data collection
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.