Risk Insight Platform Architecture and ClickHouse Implementation for Real-Time Risk Monitoring
The article presents a comprehensive risk insight platform built on ClickHouse, Flink, and intelligent algorithms, detailing its architecture, technical challenges, solutions, real-time data modeling, practical applications in fraud detection and user behavior analysis, and future optimization directions.
The risk insight platform, authored by Li Danfeng from JD Technology Risk Management Center, leverages ClickHouse, Flink, and intelligent algorithms to establish a multi-layered, real-time risk monitoring system supporting over ten core risk scenarios such as fraud, credit, and anti‑money‑laundering, handling up to 34 million messages per minute during peak periods.
Technical Challenges
High‑throughput real‑time data ingestion, with peak traffic reaching 120 million events per minute.
High‑performance real‑time querying over massive datasets.
Complex aggregation capabilities beyond simple single‑table operations.
Maintaining computational performance at massive data scales.
Solution Overview
After evaluating major OLAP engines (Presto, Impala, Druid, Kylin), ClickHouse was selected for its MPP architecture, LSM‑based in‑memory sorting, high compression (up to 20:1), columnar storage, sparse indexes, and vectorized execution, delivering sub‑second query responses on billions of rows. Combined with Flink for pre‑aggregation, the architecture achieves high‑throughput ingestion, fast queries, complex aggregations, and reduced storage costs.
Overall Architecture
The platform consists of modular data sources (ClickHouse, MySQL, Presto, etc.), an event bus with source‑transform‑sink plugins supporting Groovy/Python scripts, a three‑tier ClickHouse storage model (RODS, RDWM, RDWS), unified data modeling via ANSI‑SQL, algorithm services for anomaly detection and attribution, and upper‑layer risk insight applications (alerts, reports, strategy analysis).
ClickHouse Real‑Time Data Model Design
Shortened hierarchy: direct Flink‑driven pipelines from raw (RODS) to summarized (RDWM) and aggregated (RDWS) layers.
Dimension flattening: frequently used dimension fields are denormalized into wide fact tables to avoid costly joins.
Flink pre‑aggregation: real‑time metric aggregation reduces downstream query load, especially during high‑traffic events.
Practical Applications
1. Marketing Anti‑Fraud during Large‑Scale Promotions
Challenges included massive, complex MQ messages (≈17 KB each) and traffic spikes up to 60 times normal volume. Solutions involved splitting consumption and ClickHouse clusters by business domain, implementing dynamic batch‑write strategies, and applying Flink‑based minute‑level pre‑aggregation, which reduced raw data size by 95 % and improved query efficiency.
2. User Behavior Path Analysis
To handle massive event logs, a common table is generated on‑the‑fly to limit query scope, ClickHouse’s primary sparse index (e.g., on PIN) is leveraged to avoid full scans, and bitmap functions (bitmapCardinality, bitmapAndCardinality, etc.) are used for efficient deduplication.
3. ClickHouse Production Operations
Adjusted Zookeeper log and snapshot retention to mitigate disk alerts.
Used local tables with VIP writes to balance disk I/O and avoid merge bottlenecks.
Explored RaftKeeper as a replacement for Zookeeper to improve cluster stability.
Future Outlook
ClickHouse demonstrates significant performance advantages for large‑scale read/write workloads in real‑time risk analysis. Ongoing efforts focus on addressing high concurrency and Zookeeper instability through cluster segmentation, SQL optimizations, and adopting newer features such as RaftKeeper to further enhance reliability and query throughput.
-end-
JD Tech Talk
Official JD Tech public account delivering best practices and technology innovation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.