Operations 15 min read

Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization

This article details how Huolala built a comprehensive full‑chain monitoring and tracing platform, covering the historical evolution of observability tools, the company’s multi‑stage monitoring architecture, bytecode‑enhanced instrumentation, trace sampling strategies, and a "what‑you‑see‑is‑what‑you‑get" visualization approach.

DataFunSummit
DataFunSummit
DataFunSummit
Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization

As business complexity and transaction volume increase, early detection of production issues and rapid troubleshooting become critical; this talk shares how Huolala implements a full‑chain observability solution.

The evolution of monitoring in the internet industry is traced from eBay's CAL (2002) and Google’s Dapper (2010) to open‑source projects such as CAT, Zipkin, Eagleeye, Pinpoint, SkyWalking, and Uber’s Jaeger, highlighting key milestones and lessons.

Huolala’s own monitoring journey is divided into four phases: 1.0 – isolated Prometheus instances per team with low efficiency; 2.0 – standardized metrics, zero‑code bytecode‑enhanced instrumentation, and the introduction of a unified trace service; 3.0 – deep iteration on metrics, trace, and log integration, reducing storage costs by 60%; 3.x – replacement of HBase with a custom KV store, cutting storage and compute costs by 90% and enabling complete trace sampling.

The current architecture stacks Prometheus for metric collection, VictoriaMetrics as a time‑series store, and a customized SkyWalking‑based trace service; bytecode‑enhanced instrumentation covers client, server, infrastructure, and database components, while Grafana dashboards provide real‑time visualization.

For instrumentation, three bytecode‑enhancement frameworks (ASM, Javassist, ByteBuddy) are compared; Huolala selected ByteBuddy for its ease of use, and the instrumentation is applied via a Java Agent that transforms class bytes before they are loaded into the JVM.

Trace architecture evolved from a 1.0 design using native SkyWalking with Elasticsearch storage, to a 2.0 design that decouples collection and consumption via Kafka and stores traces in HBase (with ES as an index), and finally a 3.0 design that stores traces in a custom KV store and introduces refined sampling: conventional ID‑based sampling, span‑level threshold sampling, and complete sampling using delayed Kafka consumption combined with a Redis‑backed Bloom filter.

The visualization layer follows a "what‑you‑see‑is‑what‑you‑get" philosophy: metric dashboards allow clicking on QPS/RT curves to open detailed trace pages, and trace IDs link to associated logs; business identifiers (order ID, user ID, etc.) are embedded as trace tags, enabling end‑to‑end correlation across metrics, traces, and logs.

In summary, the integrated monitoring system dramatically improves platform stability, accelerates fault isolation, and supports micro‑service governance; future work includes further custom KV and time‑series storage development, root‑cause analysis automation, and tighter metric‑alert integration, while continuing to give back to the open‑source community.

monitoringMicroservicesobservabilityPrometheustracingbytecode-instrumentationSkyWalking
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.