Operations 15 min read

Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization

This article details how Huolala built a comprehensive full‑chain monitoring and tracing platform, covering the historical evolution of observability tools, the company’s multi‑stage monitoring architecture, bytecode‑enhanced instrumentation, trace sampling strategies, and a "what‑you‑see‑is‑what‑you‑get" visualization approach.

DataFunSummit

Mar 4, 2023

Full‑Chain Monitoring and Trace System at Huolala: Evolution, Architecture, and Visualization

As business complexity and transaction volume increase, early detection of production issues and rapid troubleshooting become critical; this talk shares how Huolala implements a full‑chain observability solution.

The evolution of monitoring in the internet industry is traced from eBay's CAL (2002) and Google’s Dapper (2010) to open‑source projects such as CAT, Zipkin, Eagleeye, Pinpoint, SkyWalking, and Uber’s Jaeger, highlighting key milestones and lessons.

Huolala’s own monitoring journey is divided into four phases: 1.0 – isolated Prometheus instances per team with low efficiency; 2.0 – standardized metrics, zero‑code bytecode‑enhanced instrumentation, and the introduction of a unified trace service; 3.0 – deep iteration on metrics, trace, and log integration, reducing storage costs by 60%; 3.x – replacement of HBase with a custom KV store, cutting storage and compute costs by 90% and enabling complete trace sampling.

The current architecture stacks Prometheus for metric collection, VictoriaMetrics as a time‑series store, and a customized SkyWalking‑based trace service; bytecode‑enhanced instrumentation covers client, server, infrastructure, and database components, while Grafana dashboards provide real‑time visualization.

For instrumentation, three bytecode‑enhancement frameworks (ASM, Javassist, ByteBuddy) are compared; Huolala selected ByteBuddy for its ease of use, and the instrumentation is applied via a Java Agent that transforms class bytes before they are loaded into the JVM.

Trace architecture evolved from a 1.0 design using native SkyWalking with Elasticsearch storage, to a 2.0 design that decouples collection and consumption via Kafka and stores traces in HBase (with ES as an index), and finally a 3.0 design that stores traces in a custom KV store and introduces refined sampling: conventional ID‑based sampling, span‑level threshold sampling, and complete sampling using delayed Kafka consumption combined with a Redis‑backed Bloom filter.

The visualization layer follows a "what‑you‑see‑is‑what‑you‑get" philosophy: metric dashboards allow clicking on QPS/RT curves to open detailed trace pages, and trace IDs link to associated logs; business identifiers (order ID, user ID, etc.) are embedded as trace tags, enabling end‑to‑end correlation across metrics, traces, and logs.

In summary, the integrated monitoring system dramatically improves platform stability, accelerates fault isolation, and supports micro‑service governance; future work includes further custom KV and time‑series storage development, root‑cause analysis automation, and tighter metric‑alert integration, while continuing to give back to the open‑source community.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

microservices Observability Prometheus Tracing bytecode instrumentation skywalking

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.