EasyRec Recommendation Algorithm Training and Inference Optimization
This article presents a comprehensive overview of EasyRec's recommendation system architecture, detailing training and inference optimizations, distributed deployment strategies, operator fusion techniques, online learning pipelines, and network-level improvements to enhance performance and scalability.
Before introducing EasyRec's training and inference architecture, the article discusses recent trends in recommendation models, such as the explosion of feature numbers, larger embeddings, longer sequences, and increasingly complex dense layers, which lead to challenges in compute resources and high training/inference costs.
The EasyRec training and inference framework consists of data, embedding, dense, and output layers, and can run on platforms like MaxCompute, EMR, and DLC. It offers configurable, componentized design with deep Keras support, distributed training, online learning (ODL), NNI auto‑tuning, multi‑optimizer settings, feature hot‑start, work‑queue checkpoint recovery, and a distributed evaluator for large‑scale model evaluation.
The inference side is powered by the PAI‑REC engine, a Go‑based modular system that links stages such as recall, ranking, re‑ranking, and shuffling, providing user‑friendly interfaces for A/B experiments, feature‑consistency diagnostics, and feature/experiment analysis.
EasyRecProcessor handles online inference for recall and ranking models, comprising an item feature cache, feature generator, and TensorFlow model. It applies extensive CPU and GPU optimizations, including feature caching to reduce network traffic, incremental model updates, and optimized feature generation pipelines.
Training optimizations include sequence‑feature deduplication (reducing batch size by 90‑95%), embedding parallelism that moves from PS‑Worker to All‑Reduce or hybrid schemes, lock‑free hash tables on CPU, GPU‑cached sparse embeddings via HugeCTR, and AMX‑based BF16 acceleration achieving up to 16× speedup on matrix operations.
Inference optimizations focus on operator fusion and AVX acceleration, BF16 quantization with minimal AUC impact, XLA and TensorRT (TRT) integration for kernel launch reduction and dynamic‑shape handling, and batch processing that aggregates small batches before GPU execution, dramatically improving QPS and latency.
Network‑level improvements replace Nginx load‑balancing with direct pod IP connections to cut round‑trip time, and request compression using Snappy/ZSTD reduces bandwidth usage by up to fivefold in high‑throughput scenarios.
Online learning is realized through a real‑time pipeline: logs flow back via PAI‑REC to SLS, features are collected in Datahub, Flink aggregates samples and labels, and the training system periodically pulls data, updates incremental parameters, and syncs them to the processor, employing LZ4 compression, feature‑consistency checks, and delay‑correction mechanisms.
The article concludes with references to EasyRec and Processor documentation, highlighting the end‑to‑end solution for large‑scale recommendation systems on Alibaba Cloud.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.