Architecture and Implementation of Autohome's Machine Learning Platform
The article presents a comprehensive overview of Autohome's one‑stop machine learning platform, detailing its background, architecture, resource scheduling, data processing, model training (including distributed deep learning), deployment, real‑world applications such as purchase‑intent and recommendation models, and future development directions.
Autohome, a leading automotive internet service platform, built a one‑stop machine learning platform to support a full AI workflow—including data ingestion, preprocessing, model development, training, evaluation, and online serving—aiming to accelerate intelligent business capabilities.
Background : Early on, algorithm teams operated isolated servers, leading to inefficient resource usage and duplicated effort. The platform was created to unify compute scheduling and provide a visual modeling environment for both engineers and product operators.
Key Goals : • Generalized development – reusable components to avoid reinventing wheels. • Simplified modeling – drag‑and‑drop pipelines for data import, preprocessing, modeling, and evaluation. • Data visualization – visual support for datasets, analysis, computation graphs, training progress, and model performance.
Overall Architecture : The platform integrates high‑performance CPU clusters for traditional ML and cloud GPU clusters for deep learning. Resource scheduling uses YARN for Spark‑based ML jobs and Kubernetes for TensorFlow/PyTorch jobs. Storage layers (Hive, HDFS) host sample, feature, and model libraries. Over 100 algorithm components (preprocessing, feature engineering, classification, regression, etc.) are exposed via a unified API, and an interactive Notebook enhances developer productivity.
Machine Learning Modeling Process : Users configure pipelines via a web UI or CLI, which are serialized to JSON and submitted to the backend. Spark‑submit executes ML jobs on YARN, while the ML Engine orchestrates component execution across Spark executors.
Deep Learning Training : GPU resources are allocated via Kubernetes. The platform supports TensorFlow, PaddlePaddle, MXNet, Caffe, and others, enabling both single‑node and distributed training. Distributed training follows a parameter‑server architecture, with workers performing computation and PS nodes aggregating parameters.
Example of a TensorFlow cluster specification used by the platform:
tf.train.ClusterSpec({"worker": ["worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222"], "ps": ["ps0.example.com:2222", "ps1.example.com:2222"]})
Model Deployment and Management : Trained models are exported in PMML format to HDFS and can be deployed with a single click. Deep learning models are served via a ModelZoo using standard serving images or custom containers, supporting automatic scaling and dynamic resource release.
Platform Impact and Applications : The platform powers various business scenarios, including a GBDT‑based purchase‑intent model (165 million records, 85/15 train‑test split) and a recommendation ranking pipeline supporting traditional models (LR, GBDT, XGB) and deep models (FM, Wide&Deep, DeepFM, DIN). Real‑time training pipelines ingest logs via Flink, retrain every ten minutes, evaluate, and update models automatically, achieving minute‑level iteration.
Future Outlook : Planned enhancements include adding more algorithm components (e.g., online learning) and implementing GPU sharing through virtualization or custom Kubernetes scheduling to improve utilization for both training and serving workloads.
Authors: Tian Dongtao, Wang Ruoyu, Fang Ju – senior algorithm engineers at Autohome. Contact: [email protected].
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.