Real-Time Computing at Dianping: Architecture, Use Cases, and Best Practices
During a detailed live session, senior Dianping engineer Wang Xinchun explains the company's real‑time computing platform built on Apache Storm, covering use cases such as dashboards, search and recommendation, system architecture, data ingestion tools like Blackhole and Puma, performance tuning, monitoring, and practical best‑practice recommendations.
Wang Xinchun, a senior engineer of Dianping's data platform, introduces the real‑time computing platform that powers the company's online services, focusing on streaming and distributed system technologies.
Key business scenarios at Dianping include dashboards (e.g., Beidou reporting, WeChat public account, CloudMap), real‑time DAU across multiple apps, new user activation counts, and real‑time transaction volume for flash sales and group buying.
Industry examples are also presented: Alibaba's JStorm for Double‑11 transaction data, 360's Storm‑based solutions for captcha recognition, image thumbnail generation, intrusion detection, and hot‑word recommendation, Tencent's TDProcess built on Storm for KV storage and stream processing, and JD's Samza for order‑status analytics.
Dianping's end‑to‑end real‑time platform consists of data sources (logs, MySQL binlog, MQ), ingestion layers (Blackhole – a Kafka‑like log collector, Puma – MySQL binlog spout, Swallow – internal MQ), a unified Spout abstraction, and a data‑service that hides storage details (Redis, HBase) behind RPC interfaces.
Storm fundamentals are explained: Nimbus and Supervisor communicate via ZooKeeper, Spout emits tuples, Bolt processes them, and the Ack mechanism guarantees at‑least‑once delivery. The topology runs continuously, unlike batch‑oriented Hadoop jobs.
Storm’s advantages are highlighted: ease of development with clear Topology/Spout/Bolt patterns, linear scalability by increasing component parallelism, automatic worker replacement for fault tolerance, and accurate processing through the Acker and transactional mechanisms.
Performance tuning tips cover worker count (optimal around 12 workers), Netty thread and buffer settings, grouping strategies (fieldsGrouping, localOrShuffleGrouping), MaxSpoutPending sizing, and resource isolation via CGroup to prevent a single component from exhausting CPU or memory.
Reliability is ensured through extensive monitoring: integration with Dianping’s Cat system, metrics such as Execute latency, Process latency, and Capacity, and alert rules for TPS drops or resource saturation.
Practical best‑practice recommendations include decoupling modules, designing stateless components, balancing throughput versus latency, focusing on hotspot optimization, and always measuring before tuning.
The Q&A session addresses common concerns: differences between Blackhole and Swallow, log format conventions, handling of transactions and idempotency, client‑side latency, MQ implementation (modified ActiveMQ with MongoDB persistence), cluster sizing (1 Nimbus + 9 Supervisors handling ~400k TPS), and storage back‑ends (primarily Redis, with HBase and MySQL as supplements).
High Availability Architecture
Official account for High Availability Architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.