Big Data 14 min read

Real-Time Log Clustering Architecture and Continuous Clustering Algorithm

This article presents a comprehensive overview of a log clustering system, detailing its background, architecture based on Filebeat, Kafka, Flink, Elasticsearch, and Grafana, and introduces a continuous clustering algorithm using SimHash and Hamming distance for real‑time log governance and anomaly detection.

ZhongAn Tech Team

Sep 3, 2024

Real-Time Log Clustering Architecture and Continuous Clustering Algorithm

Background – In early 2023 the daily log volume grew from 9 TB to 20 TB, exceeding 100% growth. Existing storage optimizations have reached their limits, prompting the need for governance tools, with log clustering being the most critical component.

Log System Overview – A scalable solution combines Filebeat for log collection, Kafka as a high‑throughput message queue, Flink for real‑time parsing, Elasticsearch for storage and search, and Grafana for visualization. This pipeline enables efficient ingestion, processing, and monitoring of massive log data.

Log Clustering Role – Clustering provides metrics for log standardization and helps quickly identify issues, reducing manual analysis. Future extensions include intelligent alerts and root‑cause analysis, forming a core part of AIOps.

Algorithm Comparison – Common unsupervised clustering methods were evaluated:

Algorithm

Pros

Cons

K-Means

Simple, low time complexity

Requires predefined K, sensitive to outliers and initialization

Hierarchical

No need for K, discovers hierarchy

High computational cost, outlier sensitive

Density‑based

Insensitive to outliers, stable initialization

Poor performance on uneven density, slow convergence, complex tuning

Given the real‑time, large‑scale nature of log data, a custom continuous clustering algorithm was designed to achieve low latency, incremental updates, and automation.

Continuous Clustering Design – The algorithm consists of three steps:

SimHash feature extraction: logs are tokenized, weighted, hashed, and reduced to a compact binary fingerprint.

Similarity matching: Hamming distance is used to compare new log fingerprints with existing cluster centroids.

Cluster‑center update & pattern extraction: centroids are updated via mean aggregation, and the longest common substring extracts representative log patterns.

Code for Hamming similarity:

private static double hammingSimilarity(double[] v1, double[] v2) {
    int hammingDistance = hammingDistance(v1, v2);
    int maxLength = Math.max(v1.length, v2.length);
    return 1.0 - ((double) hammingDistance / maxLength);
}

Flink Real‑Time Clustering – In the parsing layer, Flink processes logs in windows. When a window fills, the cluster centroids are updated, patterns are extracted, and results are stored in MySQL and visualized in Grafana.

Practical Issues & Optimizations – Optimizations include feature weighting (e.g., TF‑IDF), parameter tuning (similarity threshold, window size), more efficient pattern extraction, and leveraging large language models offline to generate regex‑based cluster signatures.

Applications – The clustered logs support two main use cases: (1) real‑time anomaly detection and alerting, and (2) predictive maintenance by analyzing historical patterns to forecast potential failures.

Conclusion – Continuous clustering effectively reduces log complexity, improves monitoring and alerting, and enhances system stability and reliability in the face of ever‑growing, heterogeneous log data.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Flink stream processing real-time analytics Log Clustering SimHash

Written by

ZhongAn Tech Team

China's first online insurer. Through tech innovation we make insurance simpler, warmer, and more valuable. Powered by technology, we support 50 billion RMB of policies and serve 600 million users with smart, personalized solutions. ZhongAn's hardcore tech and article shares are here.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.