Evolution and Architecture of JD VOP Message Warehouse: From V1.0 to V3.0
This article details the design, scaling challenges, and successive upgrades of JD's VOP message warehouse—from early V1.0 bottlenecks through V2.0 sharding to the current V3.0 MongoDB‑based architecture—highlighting performance improvements, traffic governance, cost reduction, and future outlook for handling billions of daily messages.
Introduction
VOP is JD's external API platform for enterprise procurement, aiming to digitize purchasing and leverage JD's intelligent supply‑chain capabilities. The message (warehouse) system is a core component that has grown to serve over 200 internal message sources, 80+ external message types, and 3,000+ enterprise customers across product, address, invoice, order, after‑sale, and logistics domains.
Message Consumption Modes
The system supports two consumption patterns: server‑push and client‑pull. This article focuses on the client‑pull architecture and shares practical experiences from its evolution.
Message Warehouse V1.0
Early versions faced database bottlenecks under high read/write concurrency, suffering from master‑slave latency, limited slave capacity, and TPS spikes. Promotional events (e.g., 618, Double‑11) caused message surges, leading to throttling, caching, and delayed synchronization issues.
Message Warehouse V2.0
To overcome V1.0 limits, the architecture adopted sharding and database partitioning. A switch based on ducc and clientId determines whether data is written to the new or old database, with read fallback and deletion logic based on ID thresholds. This multi‑master design eliminated master‑slave latency and allowed near‑infinite horizontal scaling.
Pain Points After V2.0
Massive data growth and extended retention (from 2‑3 days to 7 days, later to 45 days).
Field expansion leading to large JSON payloads and schema‑change difficulties.
High availability and scalability challenges due to hotspot writes.
High operational cost and lack of audit data for developers.
Goal
Build a reusable, extensible enterprise message center that ensures data integrity, low cost, high throughput, strong scalability, and zero‑impact migration for customers.
Solution Analysis
Two storage options were evaluated: MySQL + Elasticsearch vs. MongoDB. The comparison considered storage cost, development/ops effort, and performance.
1. Storage Cost
MongoDB offers data compression and eliminates redundancy, reducing total storage by over 50% compared to MySQL + ES.
2. Development & Operations Cost
MongoDB requires no data sync, simplifying development and ops; schema changes are less risky than MySQL DDL operations. MongoDB also supports seamless, unlimited scaling, whereas MySQL scaling involves hash consistency and lengthy migration.
3. Performance
Benchmarking on identical 4C8G machines shows comparable write performance for both solutions at large data volumes. MySQL reads achieve ~6,000 QPS per shard, ES ~800 QPS, while MongoDB reads reach ~30,000 QPS per shard.
Message Warehouse V3.0
MongoDB sharded cluster was selected as the storage backbone. The system was re‑architected into four stages:
Message Reception (vop‑worker) : Ingests messages from ~100 internal sources, filters, cleans, and packages them for downstream processing.
Message Transit (JMQ cluster) : Prioritizes messages into four levels, allocating consumer threads per level to prevent low‑priority traffic from affecting critical messages.
Message Storage (vop‑msg‑store) : Performs dual writes to MongoDB and Elasticsearch, handling >5 billion daily writes, 6 wTPS, and 1 wQPS, with tenant‑aware sharding and 45‑day TTL.
Message Visualization (vop‑support‑platform) : Provides dashboards for operational insight, improving troubleshooting and customer support.
Key outcomes include daily write volume of 5 billion messages, 6 wTPS, 1 wQPS, latency reduction (TP99 from 100 ms to 40 ms), extended data retention (7 days → 45 days), and zero IT cost increase.
Traffic Governance, Stability, and Cost Reduction
Three focus areas were addressed:
Traffic Governance : Optimized upstream calls, filtered inactive clients, and introduced deduplication filters to cut redundant data.
System Stability : Refactored hot‑path code (e.g., using Set.contains() instead of List.contains() ), batch writes, async processing, and introduced proactive downgrade queues to mitigate shard hotspots.
Cost Reduction : Implemented serverless auto‑scaling based on message rate and CPU thresholds, achieving a 52% cost drop during non‑peak periods.
Summary & Outlook
After multiple iterations, the VOP message warehouse now reliably handles peak promotional traffic, with MongoDB delivering >20 kQPS writes and >43 kQPS reads. Future work will continue to refine data governance, push‑based delivery standards, and memory‑first processing for ultra‑short‑lifecycle messages.
JD Tech
Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.