Cloud Native 21 min read

Design and Implementation of the Yunji Logging System Using Flink and ClickHouse

The article presents the Yunji logging system, a Flink+ClickHouse-based cloud-native platform for real-time ingestion, storage, querying, analysis, and monitoring of massive heterogeneous logs, covering its architecture, configuration center, storage design, processing flow, monitoring features, and future enhancements.

Yiche Technology
Yiche Technology
Yiche Technology
Design and Implementation of the Yunji Logging System Using Flink and ClickHouse

The Yunji logging system is a core component of the Ark (方舟) cloud-native DevOps platform, designed to handle the massive volume of heterogeneous logs generated by Easycar’s internal services. It addresses requirements such as automated collection of application and non‑application logs, log merging, extraction, splitting, real‑time ingestion of 400‑500 billion log entries per day, and providing query, analysis, and monitoring capabilities.

The authors evaluated traditional ELK but chose a Flink+ClickHouse stack. ClickHouse offers columnar storage with high compression (3‑5× smaller than Elasticsearch), high‑throughput write suited for log ingestion, TTL‑based lifecycle management, per‑log‑file table isolation, strong single‑table query performance, and vectorized computation. Flink provides exactly‑once semantics, low latency, state management, fault tolerance, and can replace Logstash for stream processing.

The Logging Configuration Center manages all log configurations and permissions. It integrates with the custom bitauto‑agent, which dynamically senses configuration changes and collects logs from containers, VMs, and physical machines. Supported log types include application logs, network logs (NAT), system logs (messages), middleware logs (MySQL), and access layer logs (Nginx, CDN, Traefik). The center stores configurations in a centralized service and pushes updates to agents.

In the storage architecture, Flink jobs continuously consume log messages from Kafka. Each message carries a unique key; on first sight the job pulls the latest extraction and storage policies from the configuration center, caches them in Flink memory, and registers an ETCD listener for policy updates. Using the key‑based policy, Flink transforms each log line into an INSERT statement (example shown below) and batches writes—every 50 000 rows or 30 seconds—into ClickHouse via an Nginx proxy that distributes writes across shards and replicas. The target table engine is ReplicatedMergeTree, which provides primary‑key sorting, optional partitioning, replication, atomic block inserts, automatic deduplication of identical blocks, and sampling support.

INSERT INTO ark_log.ck_json_xxxx_local ( `id_key`,`date`,`ip`,`Timestamp`,`ID`,`Content` ) VALUES (?,?,?,?,?,?)

The data processing flow consists of three steps: (1) read raw log data from Kafka into Flink; (2) apply the intermediate‑data conversion principles (group fields pass through unchanged, count functions become 1/0, sum implemented via Sum, time fields truncated to minutes, etc.) and batch‑write the transformed data to ClickHouse; (3) a periodic task generates metric‑aggregation SQL based on monitoring rules, runs the aggregation, builds metric objects, and pushes them to the Tianyan monitoring system.

Log querying follows a simple flow: the frontend sends a query request, the backend retrieves data from ClickHouse, and the result is displayed via a Kibana‑inspired UI that emphasizes usability. The system provides dashboards for access‑layer log analysis, showing PV, UV, status‑code distribution, top URLs, response‑time trends, client‑IP rankings, bandwidth, and source‑traffic statistics.

For log monitoring, the platform defines factor expressions (e.g., 5XX = 500 or 502 / PV × 100) that are compiled at runtime using Janino. The intermediate‑data conversion principle maps raw logs to numeric factors, which are then aggregated per minute. An example SQL snippet for metric calculation is:

SELECT http_host, log_factor_70 / log_factor_77 * 100 AS log_metric_63 FROM ( SELECT http_host, sum(log_factor_70) AS log_factor_70, sum(log_factor_77) AS log_factor_77 FROM log_monitor.monitor_filter_XXXXXXXXXX_local_all WHERE date = '2022-02-21 19:50:00' GROUP BY http_host )

The resulting metric values are pushed to Tianyan for alerting and visualization. Janino enables dynamic compilation of Java expressions without restarting the Flink job, allowing real‑time updates to monitoring rules.

Performance metrics show end‑to‑end latency of 5‑10 seconds from log collection to query‑able state, and 90 % of multi‑condition aggregation queries finish within 1‑10 seconds on a 400 billion‑row dataset. The system’s concurrent load remains stable due to separate ClickHouse clusters per log type.

Future work includes applying machine learning for log‑based anomaly detection, enhancing Flink and ClickHouse monitoring and fault tolerance, integrating Yunji with Ark Tianyan and APM for full‑stack observability, and continuing to optimize user experience.

Cloud NativeFlinkClickHouseLog MonitoringJaninoYunji
Yiche Technology
Written by

Yiche Technology

Official account of Yiche Technology, regularly sharing the team's technical practices and insights.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.