Big Data 11 min read

Log Classification and Real-Time Aggregation Architecture Using Flink and Kafka

This article describes a real‑time log‑classification pipeline built on Flink and Kafka that pre‑filters, structures, classifies, and aggregates heterogeneous logs, enabling efficient frequency‑based alerts and statistical analysis without storing raw log data at scale.

NetEase Game Operations Platform
NetEase Game Operations Platform
NetEase Game Operations Platform
Log Classification and Real-Time Aggregation Architecture Using Flink and Kafka

Introduction

In many operational scenarios we need to monitor the frequency of specific log events and trigger alerts when a threshold is reached. Simple cron scripts work for small, isolated cases, but they become hard to manage, especially in distributed services.

The ELK stack solves this by ingesting raw logs into Elasticsearch and querying them, but storing massive raw logs is wasteful and Elasticsearch write throughput is limited.

A better approach is to aggregate logs before storage, persisting only the statistical results needed for alerting.

Pattern Overview

Overall Architecture

The pipeline is built on Flink. Logs are collected by a unified agent, sent to Kafka, and processed in real time through four stages: log pre‑filtering, log structuring, log classification, and aggregation.

Log Pre‑Filtering

Since a Kafka topic may contain logs from many modules with different formats, the first step separates logs into streams based on lightweight keyword filters, allowing downstream components to use a unified parsing method.

Log Structuring

After pre‑filtering, logs have relatively fixed patterns and can be parsed using JSON, delimiter splitting, or regular expressions. Regular expressions are the most flexible but can cause catastrophic backtracking if poorly written.

Example log line:

[conn4210183] command game-star.post command: findAndModify { findAndModify: "post", query: { _id: ObjectId('5c92494bab6dcc6481046944') }, update: { $inc: { browse_num: 1 } }, new: true } ... 391ms

Corresponding parsing regex:

\[conn(?
\d+)\] command (?
\S+)\.(?
\S+).* command: (?
\S+) (?
\{.*?\}) [\w:]+ .* (?
\d+)ms

The regex extracts structured fields such as connection ID, database, collection, command, commandInfo, and query time.

Log Classification

Structured logs can be classified statically (keyword, threshold, regex matching) or dynamically (field‑value or regex‑field classification). Static classification examples include counting ERROR‑level logs. Dynamic classification can extract error types and modules from stack traces using regexes.

[2019-05-17 12:41:40 +0000] [22] [ERROR] Error handling request /api/v1/loghub/rules
Traceback (most recent call):
  ...
  File "code/web/rest.py", line 99, in post_request
    "%s" % (args, body, (time.time() - request.start_time))
AttributeError: 'Request' object has no attribute 'start_time'

From this we can parse AttributeError as ErrorType and code/web/rest.py as Module , then combine them for classification.

Dynamic classification requires logs to follow a consistent format; otherwise, similarity‑based methods (edit distance, Jaccard, TF) or stripping dynamic parts can be used, though they are less suitable for high‑volume real‑time streams.

Aggregation

Aggregations compute metrics such as occurrence counts, averages, min/max, and quantiles within time windows. The results can feed alerting logic or be persisted for visualization, eliminating the need to store raw logs.

Overall flow: Log Pre‑Filtering → Log Structuring → Log Classification → Aggregation . After these steps, alerts can be generated by comparing aggregated metrics against thresholds.

Conclusion

The presented solution demonstrates how a simple log‑alerting scenario can be addressed with a log‑classification pipeline that scales to large data volumes, supports both static and dynamic classification, and integrates seamlessly with real‑time aggregation for effective monitoring.

Flinkreal-time analyticskafkalog-processingAggregationlog classification
NetEase Game Operations Platform
Written by

NetEase Game Operations Platform

The NetEase Game Automated Operations Platform delivers stable services for thousands of NetEase titles, focusing on efficient ops workflows, intelligent monitoring, and virtualization.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.