Artificial Intelligence 15 min read

EasyRec Recommendation Algorithm Training and Inference Optimization

This article presents a comprehensive overview of EasyRec’s recommendation system architecture, detailing training and inference optimizations, embedding parallelism, CPU/GPU placement strategies, online learning pipelines, and network compression techniques that together improve scalability, latency, and cost efficiency.

DataFunSummit
DataFunSummit
DataFunSummit
EasyRec Recommendation Algorithm Training and Inference Optimization

The presentation introduces the EasyRec recommendation algorithm training and inference architecture, outlining the growing complexity of modern recommendation models—more features, larger embeddings, longer sequences, and deeper dense layers—and the resulting challenges of compute scarcity and high training/inference costs.

EasyRec’s overall framework consists of a data layer, embedding layer, dense layer, and output layer, supporting configurable, component‑based deployment on platforms such as MaxCompute, EMR, and DLC, with features like multi‑optimizer, learning‑rate scheduling, feature hot‑start, large‑scale negative sampling, and work‑queue‑based checkpoint recovery.

The PAI‑REC inference engine, written in Go, links stages like recall, ranking, re‑ranking, and shuffling, offering modularity, A/B testing UI, and feature‑consistency diagnostics.

Training optimizations include sequence‑feature deduplication (reducing batch size by 90‑95%), embedding parallelism that shards sparse parameters across workers while using All‑Reduce for dense parameters, lock‑free hash tables on CPU, and GPU‑accelerated sparse embeddings via HugeCTR, achieving up to 3.5 steps/s in PS mode and significant speedups in EmbeddingParallel mode.

CPU‑side enhancements leverage Intel AMX for BF16 matrix multiplication, delivering ~16× compute boost for MatMul‑heavy models.

Inference optimizations focus on operator fusion and AVX‑accelerated embedding lookups, BF16 quantization with minimal AUC impact, and feature‑layer improvements such as AVX‑based string split, CRC/Xor hashing, compact sequence‑feature storage (80 % memory reduction), and automatic broadcast tiling that raises QPS by 30‑50 %.

GPU placement strategies move dense computations to GPU while keeping lightweight embedding ops on CPU, using Min‑Cut graph partitioning to minimize H2D transfers; XLA and TensorRT are applied for operator fusion and dynamic‑shape handling, further reducing latency and increasing throughput.

Network optimizations replace Nginx load‑balancing with direct pod connections, cutting round‑trip time by ~5 ms, and employ compression (Snappy, ZSTD) to lower high‑throughput traffic on dedicated lines by up to fivefold.

Online learning is realized through real‑time log back‑flow, Flink‑based sample aggregation, DataHub storage, and periodic incremental parameter updates to OSS and the EasyRec processor, with feature‑level compression and deduplication to maintain model freshness in fast‑changing scenarios.

The talk concludes with references to EasyRec and Processor documentation, the PAI‑REC recommendation pipeline, and feature‑engineering resources that underpin Alibaba Cloud’s end‑to‑end recommendation system.

distributed systemsrecommendationinference optimizationonline learningTraining OptimizationEasyRec
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.