Operations 14 min read

Heimdall Exception Statistics System: Architecture, Implementation, and Practice

This article describes the design, implementation, and evolution of Heimdall, an exception‑statistics platform built on Kafka, Flink, and HBase that provides minute‑level anomaly aggregation, stack trace querying, and integration with release and alerting workflows to improve service reliability across thousands of micro‑services.

Qunar Tech Salon
Qunar Tech Salon
Qunar Tech Salon
Heimdall Exception Statistics System: Architecture, Implementation, and Practice

Background : As micro‑service granularity increased at Qunar, the complexity of inter‑service calls made fault isolation difficult. Existing monitoring (Watcher) could alert but not pinpoint the root cause quickly. To shorten MTTR, an automated exception‑statistics system, Heimdall, was created to collect, aggregate, and present exception data in near real‑time.

Stage 1 – Real‑time Log Collection : The initial architecture used Kafka, the open‑source CAT tool, and Flink. Logs are shipped by agents to Kafka, consumed by the clog module, which parses ERROR‑level entries, stores metadata in Elasticsearch for indexing, keeps raw blocks on local disk, and persists cold data to HDFS. Queries first hit ES, then local disk or HDFS, and remote nodes via HTTP if needed.

Core Modules – clog : Handles receipt, storage, and retrieval of exception stacks. Storage hierarchy: local disk (temporary) → Elasticsearch (index) → HDFS (cold). Block naming follows ip‑retention‑timestamp‑offset . Each block holds up to 8 MB and is uploaded when disk usage exceeds 75%.

Core Modules – flink task : Parses exception logs, extracts type, application, machine, and aggregates counts per minute, emitting metrics to the monitoring system.

Challenges : Container migration broke log collection; real‑time logs suffered up to 1 hour latency causing ~10 % count error; massive non‑ERROR logs wasted resources; only a subset of services had real‑time collection; lack of environment isolation mixed test data with production.

Stage 2 – Base‑Component Refactor : Shifted data source from generic log collection to a lightweight agent ( logger‑spi ) that intercepts logs at the application side, filters only exception‑stack logs, performs early aggregation (per minute, per type), and reports via Kafka. This reduced Kafka partitions from 60 to 14 and message rate from 486 K/s to 106 K/s, and eliminated the Flink parsing job.

Core Modules – logger‑spi : Instrumented via an agent, it captures logs, distinguishes BusinessError from SystemError, aggregates identical exceptions per minute, samples details, and throttles when memory limits are reached. Sample payload:

{
  "ip": "xx.xx.xx.xx",
  "sendTime": 1634802724460,
  "host": "l-xxxxxxx",
  "appCode": "xxx",
  "envName": "proda",
  "exType": "com.xx.xx.xx.xx.xxException",
  "count": 100,
  "details": [
    {
      "level": "ERROR",
      "fileName": "/home/xxx/logs/xx.log",
      "content": "2021-10-21.15:52:04.040 INFO ...",
      "timestamp": 1634802724458,
      "traceId": "xxx_211021.155203.xx.xx.xx.xx.6786.xxx_1",
      "stackTrace": "com.qunar.xxx.QProcessException: api|BookingRule Error: ...\n    at com.qunar.xxx.xxx(xxx.java:197)\n    ..."
    }
  ]
}

Core Modules – heimdall‑statistic : Consumes aggregated data, maintains in‑memory minute‑level counters, and periodically merges results into HBase, ensuring order‑independent accurate statistics across four dimensions (total per minute, per type, per machine, per machine‑type).

Effect Showcase : Provides time‑range exception type counts with trend comparison, detailed stack trace and trace ID views, and supports custom queries.

Application Scenarios : Used for service governance (weekly exception reports per owner), release‑time health checks, automated anomaly alerts, and root‑cause analysis, contributing to a reduction of fault‑handling timeout rate from 60.9 % to 38.8 %.

Conclusion & Outlook : Heimdall now serves over 1300 applications, becoming a key quality metric. Future plans include leveraging exception baselines for test‑environment filtering and proactive issue detection during integration testing.

BackendFlinkObservabilityKafkaexception monitoringlog aggregation
Qunar Tech Salon
Written by

Qunar Tech Salon

Qunar Tech Salon is a learning and exchange platform for Qunar engineers and industry peers. We share cutting-edge technology trends and topics, providing a free platform for mid-to-senior technical professionals to exchange and learn.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.