Building an Observability System: Practices and Solutions from Yanhuang Data
This article explains how to build a robust observability system for cloud‑native microservice architectures, detailing the three core signals—metrics, traces, and logs—common challenges such as complexity and data silos, and presents Yanhuang Data’s integrated platform with unified data collection, storage, analysis, and visualization solutions.
Observability has become a hot topic because a well‑observable system can improve production efficiency, product quality, and user satisfaction, especially as container‑based cloud‑native microservice architectures increase system complexity and fault‑diagnosis difficulty.
The article is organized into four parts: (1) how to build an observability system, (2) pain points enterprises face when constructing such systems, (3) an introduction to Yanhuang Data and its products, and (4) the practice of building observability for the YHP platform itself.
Observability relies on three primary signals—Metrics, Traces, and Logs—often called the “three pillars.” While all three are useful, the choice of signals should be driven by actual user needs; for some scenarios, Metrics plus Logs may be sufficient.
In container‑native microservice environments, all three signals are necessary: Metrics monitor overall health, Traces locate the exact failure point, and Logs record what happened. Additional signals such as dumps, profiles, and events are mentioned, with dumps providing core‑dump information useful for deep debugging.
The article outlines a three‑step approach to building observability: clarify requirements, locate data sources, and select appropriate tools. Prometheus is recommended for Metrics, OpenTelemetry for Traces, and various log collectors (e.g., Fluentd, Elasticsearch) for Logs. Tool selection should consider the specific data type and usage scenario.
Common enterprise pain points include high system complexity, data silos, and poor user experience. Complexity arises from needing multiple data types and tools, leading to higher development and operational costs. Data silos prevent seamless correlation of Metrics, Traces, and Logs, reducing the value of observability data.
To address these challenges, Yanhuang Data proposes an integrated observability platform that unifies data collection, storage, and analysis. The platform uses OpenTelemetry for unified data ingestion, supports both DaemonSet and Deployment deployment models in Kubernetes, and stores each signal type in separate datasets to optimize performance and lifecycle management.
The YHP platform’s architecture includes a cloud‑native microservice stack orchestrated by Kubernetes, a decoupled compute‑storage engine, RESTful APIs, and rich visualization components such as dashboards, alerts, and ad‑hoc queries. It supports mixed modeling (read‑time and write‑time), standard SQL queries across all signal types, and provides a unified user interface for querying, visualizing, and alerting.
Specific features highlighted are: a self‑developed data collector (DataScale) that gathers Metrics, Traces, and Logs; metadata enrichment at the collection stage; unified storage that enables fast cross‑signal queries; and visualization dashboards that combine metrics, traces, and logs for comprehensive health monitoring.
The platform also handles dumps data by storing core‑dump files on a shared Kubernetes PV, enriching them with metadata, and visualizing dump statistics and backtraces in dashboards.
Kubernetes health monitoring is achieved by integrating the Kuberhealthy tool, exporting health check results via Prometheus, and visualizing them in the platform’s dashboards.
Overall, the integrated observability platform reduces system complexity, lowers operational costs, and improves user experience by providing a single pane of glass for all observability data.
The latest version of the Yanhuang Data platform (YHP) is 2.11, released on August 5 2023, with both community (free) and enterprise editions available.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.