Tencent Eagle Eye Distributed Logging System Cloud Migration Practice
Tencent’s Eagle Eye distributed real‑time monitoring and log analysis platform was migrated to the cloud by rebuilding its LogSender and Kafka‑to‑ES components, switching to cloud CKafka and Elasticsearch, which boosted throughput fourfold, cut resource usage by about half, saved roughly 20 million RMB annually, and set the stage for further enhancements such as comprehensive monitoring and exactly‑once delivery.
This article introduces the cloud migration solution of Tencent's Eagle Eye (鹰眼) distributed real-time monitoring and log analysis system, developed by PCG Technical Operations Department.
1. Eagle Eye Platform Overview
Eagle Eye is a massive-scale distributed real-time monitoring and log analysis system supporting multi-language data reporting. It pulls data from ATTA (which supports JAVA, Python, C++ reporting) and writes to Elasticsearch. Using ES's inverted index mechanism, it provides second-level query capability for billions of records. Core features include: real-time log query service, data analysis capabilities via API for OLAP, error log alerting with minute-level notifications based on different error codes, and Grafana-based real-time analysis and alerting.
2. Cloud Migration Background
Following the company's strategic adjustment to establish a new Cloud Business Group and launch "Open Source Collaboration" and "Business Cloud Migration" initiatives, the Eagle Eye team identified significant benefits: business value (focus on business, improve R&D efficiency, accelerate technology upgrades, use better cloud open-source components, achieve resource reuse with elastic scaling and cost optimization, standardize CI/CD); engineer value (broaden technical vision, more valuable skills, contribute to cloud to increase influence); and Tencent Cloud value (export business cloud migration experience, help refine cloud components).
3. Architecture Selection for Cloud Migration
The main data pipeline remained largely unchanged: Kafka directly used cloud CKAFKA, ES directly used cloud ES. Other components required reconstruction.
LogSender Reconstruction: The producer program had severe performance bottlenecks during peak hours causing data loss. Issues included: (1) IP resolution bottleneck - the C++ version had lock contention during IP parsing; solved by implementing binary search algorithm and removing locks; (2) Kafka performance bottleneck - single producer reading many topics caused queue locking; solved by assigning independent Kafka clients per BOSSID. After optimization, single-node processing improved from 130k to 550k records per minute (4x performance improvement).
Kafka Selection: Higher versions support more features like transactions and inter-disk data transfer without performance degradation. Cloud Kafka achieves 400MB/s per machine vs 100MB/s for self-built (4x improvement).
Hangout Reconstruction: For ES writing, the team developed a custom Kafka-to-ES component (Logstash had insufficient performance). Core optimizations reduced disk IO significantly, achieving 2x+ performance improvement and 6x overall ES write performance improvement.
ES Selection: Cloud ES 6.8.2 was adopted. TCP writing is faster but HTTP is more stable for load balancing. After migration: writing 1TB data requires 3 × (16 cores, 64GB, 5TB) on cloud vs 80 cores, 256GB memory, 12TB disk (BX1) on-premise - saving approximately 50% resources.
4. Post-Migration Results
After cloud migration, there are 50+ ES clusters and 12 Kafka clusters. Benefits include: reduced workload (building 50+ clusters on-premise would require 200+ person-days); reduced costs (overall performance improved 2-3x, saving at least 20 million RMB annually); and more focused work (focusing on write performance optimization, establishing monitoring systems with data reconciliation, and implementing backup index mechanisms for surge handling).
5. Future Architecture Evolution
Future plans include: building comprehensive monitoring systems with logs and metrics for different modules to enable quick access to underlying data (CPU/Mem), metrics, and logs during anomalies; and continuous architecture upgrades to achieve exactly-once delivery using Flink's checkpoint mechanism (currently only at-least-once is guaranteed).
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.