Artificial Intelligence 19 min read

Architecture and Implementation of Autohome's Machine Learning Platform

The article presents a comprehensive overview of Autohome's one‑stop machine learning platform, detailing its background, architecture, resource scheduling, data processing, model training (including distributed deep learning), deployment, real‑world applications such as purchase‑intent and recommendation models, and future development directions.

DataFunTalk

Jul 1, 2020

Architecture and Implementation of Autohome's Machine Learning Platform

Autohome, a leading automotive internet service platform, built a one‑stop machine learning platform to support a full AI workflow—including data ingestion, preprocessing, model development, training, evaluation, and online serving—aiming to accelerate intelligent business capabilities.

Background : Early on, algorithm teams operated isolated servers, leading to inefficient resource usage and duplicated effort. The platform was created to unify compute scheduling and provide a visual modeling environment for both engineers and product operators.

Key Goals : • Generalized development – reusable components to avoid reinventing wheels. • Simplified modeling – drag‑and‑drop pipelines for data import, preprocessing, modeling, and evaluation. • Data visualization – visual support for datasets, analysis, computation graphs, training progress, and model performance.

Overall Architecture : The platform integrates high‑performance CPU clusters for traditional ML and cloud GPU clusters for deep learning. Resource scheduling uses YARN for Spark‑based ML jobs and Kubernetes for TensorFlow/PyTorch jobs. Storage layers (Hive, HDFS) host sample, feature, and model libraries. Over 100 algorithm components (preprocessing, feature engineering, classification, regression, etc.) are exposed via a unified API, and an interactive Notebook enhances developer productivity.

Machine Learning Modeling Process : Users configure pipelines via a web UI or CLI, which are serialized to JSON and submitted to the backend. Spark‑submit executes ML jobs on YARN, while the ML Engine orchestrates component execution across Spark executors.

Deep Learning Training : GPU resources are allocated via Kubernetes. The platform supports TensorFlow, PaddlePaddle, MXNet, Caffe, and others, enabling both single‑node and distributed training. Distributed training follows a parameter‑server architecture, with workers performing computation and PS nodes aggregating parameters.

Example of a TensorFlow cluster specification used by the platform:

tf.train.ClusterSpec({"worker": ["worker0.example.com:2222", "worker1.example.com:2222", "worker2.example.com:2222"], "ps": ["ps0.example.com:2222", "ps1.example.com:2222"]})

Model Deployment and Management : Trained models are exported in PMML format to HDFS and can be deployed with a single click. Deep learning models are served via a ModelZoo using standard serving images or custom containers, supporting automatic scaling and dynamic resource release.

Platform Impact and Applications : The platform powers various business scenarios, including a GBDT‑based purchase‑intent model (165 million records, 85/15 train‑test split) and a recommendation ranking pipeline supporting traditional models (LR, GBDT, XGB) and deep models (FM, Wide&Deep, DeepFM, DIN). Real‑time training pipelines ingest logs via Flink, retrain every ten minutes, evaluate, and update models automatically, achieving minute‑level iteration.

Future Outlook : Planned enhancements include adding more algorithm components (e.g., online learning) and implementing GPU sharing through virtualization or custom Kubernetes scheduling to improve utilization for both training and serving workloads.

Authors: Tian Dongtao, Wang Ruoyu, Fang Ju – senior algorithm engineers at Autohome. Contact: [email protected].

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

deep learning Kubernetes distributed training AutoML Spark Machine Learning Platform

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.