Big Data 38 min read

Design and Evolution of a Unified Log Platform

NetEase Yanxuan built a unified log platform that consolidates collection, delivery, processing, storage, analysis, and alerting for near‑real‑time and offline logs, using Flume, a Golang‑based Loggie agent, Kafka, Flink, HBase and Elasticsearch to ensure high performance, data‑quality, container‑native isolation, full‑link traceability, and automated scaling, with Loggie now open‑sourced.

NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
NetEase Yanxuan Technology Product Team
Design and Evolution of a Unified Log Platform

Log data records the state and behavior of devices or programs. Its application scenarios can be divided into near‑real‑time and offline, and a log platform must address log collection, delivery, processing, storage, and monitoring/alerting.

At NetEase Yanxuan, developers interact with logs daily for business monitoring, online issue diagnosis, user‑behavior analysis, and audit. Consequently, how to efficiently use log data and build long‑term log‑related infrastructure has become a critical technical topic.

In the early stage, Yanxuan relied on open‑source solutions (Flume for VM environments, Filebeat for containers) to quickly build log use cases. However, as the team grew and business complexity increased, the low‑value repetitive configuration work (e.g., comparing log completeness, modifying collection scripts) consumed massive development and operation resources, prompting the need for a self‑built, unified log platform.

The platform’s design goals are:

Convenient and efficient one‑stop solution for log collection, delivery, processing, storage, analysis, and alerting.

Low resource consumption with high performance to handle PB‑level daily logs.

High data‑quality assurance, ensuring no loss and fast completeness checks.

Full support for containerized log collection in hybrid‑cloud scenarios.

Key capabilities include unified log‑metadata management, data‑quality management, log retrieval, log monitoring, and business monitoring. Log metadata is abstracted into four parts: ownership (tenant, product, service), file information (directory, name), data model (fields, types, constraints, indexing), and extensions (sensitivity, audit flag, retention).

Data‑quality management enforces log standards, validates logs against defined models, and generates alerts for anomalies. It also integrates with data‑lake concepts to separate access logs, business logs, and application logs, applying stricter retention and protection for lake‑ingested logs.

Log retrieval offers three modes: interactive search (similar to grep with regex and multi‑condition support), real‑time log streaming, and full‑link trace using TraceID to reconstruct end‑to‑end call chains.

Log monitoring is based on text‑event monitoring, supporting log rotation detection, regex matching, and error‑level alerts, with visual dashboards for real‑time status.

Beyond monitoring, the platform supports extensive business scenarios such as order volume statistics, payment success rates, QPS monitoring, and both server‑side and client‑side event tracking (embedding logs into data‑lake for offline analysis).

Architecturally, the platform consists of:

Log collection: Flume agents for VMs and the custom Loggie agent (Golang‑based) for containers, offering multi‑pipeline isolation and low CPU usage.

Message queue layer: Kafka for decoupling collection and processing, providing high throughput, scalability, fault tolerance, and data retention for replay.

Data routing: a gateway layer that isolates agents from Kafka, performs traffic shaping, protocol conversion, and resource isolation.

Real‑time processing: Flink jobs for format validation, text matching, and business metric aggregation, sharing the same cluster as other real‑time workloads.

Storage: HBase (data lake) for offline analysis, Elasticsearch for search and monitoring, with careful shard and refresh tuning.

Key challenges addressed include:

Rate limiting and resource isolation at the agent level (using interceptors, back‑pressure, and container CPU/memory limits).

Full‑link observability: standardized metrics, anomaly codes, and health dashboards for agents, routers, Kafka, Flink, and ES.

Automation: elastic scaling of data‑router and Flink resources, automatic large‑log detection, and safe log file cleanup after successful ingestion.

Looking forward, the underlying log‑collection engine Loggie has been open‑sourced (https://github.com/loggie-io/loggie). Compared with existing open‑source collectors, Loggie offers strong isolation via multi‑pipeline design, lightweight high‑performance Golang implementation, pluggable architecture for sources/interceptors/sinks, native Kubernetes support, and comprehensive observability with Prometheus metrics and Grafana dashboards.

cloud-nativebig dataobservabilityLog Managementlog collectionlog platform
NetEase Yanxuan Technology Product Team
Written by

NetEase Yanxuan Technology Product Team

The NetEase Yanxuan Technology Product Team shares practical tech insights for the e‑commerce ecosystem. This official channel periodically publishes technical articles, team events, recruitment information, and more.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.