Building Big Data Infrastructure at Baidu Aifanfan: Architecture Practices and Lessons Learned
At Baidu Aifanfan, the data team built a unified real‑time and offline big‑data platform—leveraging Watt, Bigpipe, Fengge, AFS and Palo within Lambda/Kappa patterns and a fast‑slow parallel rollout—that cut OLAP query latency from 18 minutes to under 15 seconds, enabled self‑service analytics, and standardized metrics across 15 agile teams.
This article describes how Baidu Aifanfan's data team built real-time and offline big data infrastructure platforms to empower business operations with valuable data products and services. It covers the journey of addressing challenges in business, technology, and organization dimensions.
Key Terminology:
Watt: Data flow open platform for integrating MySQL and BaikalDB Binlog logs, supporting multi-table Join and UDF extensions.
Fengge Platform: Data asset center and R&D governance platform built by Baidu's Commercial Platform R&D department.
Bigpipe (BP): Distributed data transmission system for real-time messaging and log transfer with decoupling capabilities.
AFS: Big data file storage system similar to HDFS.
Palo: MPP data warehouse based on Apache Doris, supporting high-concurrency low-latency queries for PB-level datasets.
Challenges Faced:
Business challenges include providing valuable data products with timeliness and richness. Technical challenges involve handling large-scale business data across sharded databases, scattered metadata, redundant data storage, limited development resources, and ensuring stability of hundreds of scheduled tasks. Organizational challenges include supporting 15 agile teams with varying OKRs and inconsistent metric definitions.
Architecture Evolution:
The team adopted Lambda and Kappa architectures as foundational patterns. They implemented a "fast and slow parallel" approach: Version 1.0 addressed urgent business needs using Watt for CDC, while Version 2.0 integrated with the Fengge platform for comprehensive data warehouse management, metadata governance, and self-service query capabilities.
Key Technical Implementations:
1) Real-time processing improvements using Spark Streaming/Flink with Bigpipe for disaster recovery
2) Data warehouse modeling using Kimball dimensional methodology with star schema design
3) Data governance covering asset management, quality monitoring, lineage tracking, and permission control
4) Marketing effectiveness analysis migrated from Impala+Kudu to Palo (Doris) for better performance
5) Real-time capabilities enhanced with Flink to Palo and Kafka to Palo data pipelines using Stream Load and Routine Load methods
Business Impact:
The architecture reduced OLAP query latency from 18 minutes to 10-15 seconds, enabled self-service data queries for product and operations teams, unified metric definitions across the organization, and significantly improved data product value for customers.
Baidu Geek Talk
Follow us to discover more Baidu tech insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.