Building and Optimizing the Offline Computing Platform at Autohome: Challenges, Solutions, and Future Plans
This article details the evolution of Autohome's offline computing platform from a 50‑node cluster in 2013 to a multi‑thousand‑node Hadoop ecosystem, describing performance and stability challenges, multi‑tenant operational issues, low resource utilization, and the comprehensive technical solutions and future roadmap implemented to address them.
The presentation introduces the development history of Autohome's offline computing platform, which started in 2013 with about 50 nodes for advertising data analysis and grew to over 3,000 nodes by 2021, handling up to 35 万 daily jobs and several petabytes of data.
Current operational status shows stable performance: core data‑warehouse tasks finish an hour early, small‑file ratio is decreasing, and average MapReduce job duration hovers around 100 seconds.
Key problems encountered during platform construction include:
Excessive Hive partitions causing slow queries.
Massive small files and non‑standard user behavior leading to Namenode overload.
Rapid business growth outpacing compute resources, resulting in SLA violations.
Low overall utilization of offline compute resources due to tidal usage patterns.
Solutions implemented:
Metastore routing layer: a unified routing service forwards Hive DB requests to appropriate Metastore instances, supporting horizontal scaling and multi‑cluster queries.
Migration from ViewFS to RBF (Router‑Based Federation): centralized configuration, hot‑reload without service restarts, multi‑mount support, and cross‑cluster data moves.
Big‑data governance: defining storage, queue, and table standards; scoring teams on storage, compute, and task usage; providing self‑service tools for small‑file cleanup and data compression.
Resource optimization: Cgroup mode switched from strict to share, enabling 30 % over‑provisioning without CPU throttling; offline‑online mixed deployment to improve utilization.
Automatic resource scaling: nightly expansion of offline resources and graceful container reclamation.
Future plans include upgrading all clusters to Hadoop 3.2.1, implementing Namenode read/write separation, advanced offline‑online isolation features, and launching AI on Hadoop to migrate CPU‑intensive training jobs to the offline cluster.
The Q&A session addressed HiveServer2 OOM concerns, Cgroup impact on users, AI‑on‑Hadoop comparison, and rolling upgrade strategies.
Finally, the presenters thanked the audience and promoted the DataFunTalk community for big‑data and AI knowledge sharing.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.