Real-Time Log Clustering Architecture and Continuous Clustering Algorithm
This article presents a comprehensive overview of a log clustering system, detailing its background, architecture based on Filebeat, Kafka, Flink, Elasticsearch, and Grafana, and introduces a continuous clustering algorithm using SimHash and Hamming distance for real‑time log governance and anomaly detection.
Background – In early 2023 the daily log volume grew from 9 TB to 20 TB, exceeding 100% growth. Existing storage optimizations have reached their limits, prompting the need for governance tools, with log clustering being the most critical component.
Log System Overview – A scalable solution combines Filebeat for log collection, Kafka as a high‑throughput message queue, Flink for real‑time parsing, Elasticsearch for storage and search, and Grafana for visualization. This pipeline enables efficient ingestion, processing, and monitoring of massive log data.
Log Clustering Role – Clustering provides metrics for log standardization and helps quickly identify issues, reducing manual analysis. Future extensions include intelligent alerts and root‑cause analysis, forming a core part of AIOps.
Algorithm Comparison – Common unsupervised clustering methods were evaluated:
Algorithm
Pros
Cons
K-Means
Simple, low time complexity
Requires predefined K, sensitive to outliers and initialization
Hierarchical
No need for K, discovers hierarchy
High computational cost, outlier sensitive
Density‑based
Insensitive to outliers, stable initialization
Poor performance on uneven density, slow convergence, complex tuning
Given the real‑time, large‑scale nature of log data, a custom continuous clustering algorithm was designed to achieve low latency, incremental updates, and automation.
Continuous Clustering Design – The algorithm consists of three steps:
SimHash feature extraction: logs are tokenized, weighted, hashed, and reduced to a compact binary fingerprint.
Similarity matching: Hamming distance is used to compare new log fingerprints with existing cluster centroids.
Cluster‑center update & pattern extraction: centroids are updated via mean aggregation, and the longest common substring extracts representative log patterns.
Code for Hamming similarity:
private static double hammingSimilarity(double[] v1, double[] v2) {
int hammingDistance = hammingDistance(v1, v2);
int maxLength = Math.max(v1.length, v2.length);
return 1.0 - ((double) hammingDistance / maxLength);
}Flink Real‑Time Clustering – In the parsing layer, Flink processes logs in windows. When a window fills, the cluster centroids are updated, patterns are extracted, and results are stored in MySQL and visualized in Grafana.
Practical Issues & Optimizations – Optimizations include feature weighting (e.g., TF‑IDF), parameter tuning (similarity threshold, window size), more efficient pattern extraction, and leveraging large language models offline to generate regex‑based cluster signatures.
Applications – The clustered logs support two main use cases: (1) real‑time anomaly detection and alerting, and (2) predictive maintenance by analyzing historical patterns to forecast potential failures.
Conclusion – Continuous clustering effectively reduces log complexity, improves monitoring and alerting, and enhances system stability and reliability in the face of ever‑growing, heterogeneous log data.
ZhongAn Tech Team
China's first online insurer. Through tech innovation we make insurance simpler, warmer, and more valuable. Powered by technology, we support 50 billion RMB of policies and serve 600 million users with smart, personalized solutions. ZhongAn's hardcore tech and article shares are here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.