How a Unified Metadata Platform Boosts SRE Efficiency and Cuts Costs
This article describes how Huya built a unified metadata platform to break data silos across its SRE systems, enabling standardized data ingestion, correlation, and analysis that improve resource governance, root‑cause diagnosis, and overall cost‑efficiency for large‑scale live streaming services.
Kuang Lingxuan, head of the SRE observability platform at Huya Live, leads the design and implementation of a unified metadata platform that integrates resource delivery, containerization, build and release, monitoring, and alerting systems.
Project Background
Pain Points
Separate SRE systems created severe data silos with no unified metadata model, hindering data understanding and usage.
Business cost control became difficult due to lack of insight into resource and cross‑region traffic usage.
Root‑cause analysis was hampered by missing correlations among monitoring metrics, traces, and alerts.
Key Insight
Horizontal linkage: connect application‑to‑application call relationships.
Vertical linkage: connect applications to the resources they consume.
Combined, they form a comprehensive metadata association network.
Metadata Types
Application services: service name, IP/Port, API, dependencies, frameworks, code repo.
Monitoring metrics: CPU, memory, network utilization, request volume, latency, error rates.
Infrastructure: containers, data centers, domains, network types, resource usage.
Middleware: databases, caches, message queues, real‑time and batch compute.
Solution Practice
Design Thinking
Use application services as the core of metadata association and build a unified metadata network.
Metadata Network Overview
a) Trace analysis generates client‑to‑service call chains, e.g. Huya App → GiftServer → AuthServer / MoneyServer .
b) Deployment data links services to resources, e.g. GiftServer → container(192.168.1.1) → physical machine → Guangzhou data center.
c) Monitoring metrics are correlated across business, application, and infrastructure layers.
d) Service‑to‑middleware links, e.g. MoneyServer → Mysql/Redis/Kafka and their host machines.
Design Summary
Define metadata ingestion standards and association models.
Connect applications, resources, and middleware across the entire network.
Provide visualization, search, and analysis capabilities.
Metadata Architecture
a) Output: web UI for visualizing metadata and an open platform for data access.
b) Coverage: the Meta Hub platform ingests metadata from all SRE systems.
c) Core modules include data conversion, association storage in a graph DB, SDK/OpenAPI/Gremlin for queries, and resource replay for usage statistics.
Graph DB stores the vertex/edge model of the metadata network; OLAP DB keeps multi‑dimensional snapshots for large‑scale analysis.
Application Scenarios
Multi‑Dimensional Resource Analysis
Shows historical resource usage and utilization trends for each application service, enabling rationality checks and governance.
Cross‑Data‑Center Traffic Governance
Detects and visualizes cross‑region calls, pinpointing which services, instances, and interfaces cause inter‑data‑center traffic.
Multi‑Tag Classification
Implements hierarchical tags stored in a graph model, generated from trace links and AIOps‑derived application portraits, enabling flexible queries.
Full‑Link Root‑Cause定位
Combines business, application, and infrastructure metrics with resource relationships to locate root causes, e.g., diagnosing low gift‑sending success rates.
Future Outlook
Extending the platform to cover the entire DevOps lifecycle—from code repository to build, release, and runtime—so that metadata can assist in security patch tracking (e.g., Log4j) and change‑impact analysis.
Huya Tech Engineering
Official Huya Tech account. Technical insights, engineering practice, and frontier innovation all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.