Big Data 13 min read

Performance Optimization of WeChat's Multi‑Dimensional Monitoring Platform

By analyzing that most queries were time‑series and older than a day, the WeChat monitoring team split large Druid queries into per‑day/hour sub‑queries, introduced a multi‑granularity Redis cache and sub‑dimension tables, boosting cache hits above 85 % and cutting average latency from over 1000 ms to about 140 ms while reducing Druid load to roughly 10 % of its original volume.

Tencent Cloud Developer
Tencent Cloud Developer
Tencent Cloud Developer
Performance Optimization of WeChat's Multi‑Dimensional Monitoring Platform

WeChat's multi‑dimensional monitoring platform processes massive volumes of data (up to 45 billion events per minute and 4 trillion records per day). The original data‑layer query latency exceeded 1000 ms with a high failure rate. By analyzing user query patterns and the Druid‑based storage architecture, the team reduced average query time to the 100 ms range.

Background : The platform provides flexible data ingestion, custom dimensions, and metric aggregation for real‑time monitoring. It supports two API types – dimension enumeration and time‑series queries – and aggregates over 45 billion events per minute.

Data‑layer architecture : Apache‑Druid is used as the OLAP engine. Key components include Overlord (ingestion control), Coordinator (segment distribution), MiddleManager/Peon (real‑time segment creation), Historical (segment storage), DeepStorage, MetaDataStorage, and Zookeeper.

Analysis of query behavior :

Time‑series queries account for >99 % of traffic.

~90 % of queries target data older than one day.

Large segment I/O and oversized segments cause high latency.

These findings indicated that most queries could be served from cache, and that reducing broker‑wide time‑range scans and segment I/O would yield the biggest gains.

Optimization design :

Split large queries into finer‑grained sub‑queries (e.g., per‑day or per‑hour).

Introduce a Redis cache for sub‑query results, storing both the data and a cache_update_time to assess freshness.

Implement multi‑level cache granularity (day, 4 h, 1 h) to match different query resolutions.

Create sub‑dimension tables for high‑cardinality dimensions, reducing segment size.

Sub‑query request example:

{
    "biz_id": 1, // query protocol table ID
    "formula": "avg_cost_time", // metric to aggregate
    "keys": [
        {"field": "xxx_id", "relation": "eq", "value": "3"}
    ],
    "start_time": "2020-04-15 13:23",
    "end_time": "2020-04-17 12:00"
}

Results :

Cache hit rate >85 % (full hit 86 %, partial hit 98.8 %).

Average query latency reduced from >1000 ms to ~140 ms; P95 from >5000 ms to ~220 ms.

Requests to Druid dropped to ~10 % of the original volume.

Conclusion : By decomposing queries, adding a Redis‑backed sub‑query cache, and employing sub‑dimension tables, the platform achieved a dramatic performance improvement while maintaining accurate real‑time monitoring.

monitoringperformance optimizationBig DatacachingwechatDruid
Tencent Cloud Developer
Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.