58.com Big Data Application Practice: Architecture, Challenges, and Solutions
This article presents 58.com’s large‑scale big data platform, detailing its business scope, the WMDA one‑stop analytics system, the Wanxiang user‑portrait service, the technical challenges of massive daily data ingestion, multi‑dimensional analysis, OLAP engine selection (Kylin, Druid), bitmap‑based user‑group processing, scheduling, and overall data service architecture.
58.com, the leading classified‑information site in China, handles tens of billions of daily events, requiring a robust big‑data infrastructure to support strategic, investment, and operational decisions.
Business Scope
Multiple verticals such as real‑estate, recruitment, classifieds, social, second‑hand goods, pets, vehicles, finance, and community.
Daily traffic reaches tens of millions of UV and hundreds of billions of new records.
WMDA – One‑Stop User Behavior Analysis Platform
Provides intelligent data collection (zero‑code, manual, cross‑platform) and scenario‑driven analysis (real‑time flow, multi‑dimensional reports, retention, conversion, ad monitoring, channel operations, behavior trace).
Supports both approximate and precise UV counting using count‑min sketch and HyperLogLog.
Adopted Druid as the core OLAP engine after evaluating Kylin, Point, and ClickHouse, offering roll‑up, pre‑aggregation, columnar storage, and sub‑second multi‑dimensional queries.
Technical Challenges and Optimizations
Massive daily data volume (hundreds of billions of records) and hundreds of analysis dimensions.
Real‑time and offline data consistency.
Cube construction time and storage overhead in Kylin, leading to a shift toward Druid.
Bitmap‑based user‑group processing using RoaringBitmap and count‑sketch to enable fast set operations and ID reverse lookup.
Segment merging, cache tuning, hot‑cold data separation, and file‑size control to improve query performance.
Wanxiang – Intelligent User‑Portrait Platform
Offers DMP + UDP capabilities: tagging, analysis, insight, and outreach, with APIs for online and offline usage.
Architecture includes data ingestion, computation, and service layers, supporting multi‑tenant isolation.
Handles high‑throughput user‑group extraction via bitmap, Elasticsearch, and Parquet + Spark engines.
Provides scheduling (periodic and trigger‑based) using 58DP, Kettle, and TaskServer.
Data Service Layer
Defines hundreds of unified APIs for detail queries, distribution analysis, file download, and batch traversal, ensuring accurate, timely, and performant data delivery.
Conclusion
The evolution from monolithic to modular big‑data architectures at 58.com demonstrates the necessity of scalable storage, real‑time‑offline hybrid processing, and flexible service interfaces to meet diverse analytical needs.
DataFunSummit
Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.