Artificial Intelligence 15 min read

EasyRec Deep Dive: Training & Inference Architecture, Optimizations, and Online Learning

This article explains EasyRec's end‑to‑end recommendation system, covering its training‑inference architecture, a series of CPU/GPU and distributed optimizations, and a real‑time online‑learning pipeline that together improve throughput, latency, and model freshness.

DataFunSummit

Jun 20, 2025

EasyRec Deep Dive: Training & Inference Architecture, Optimizations, and Online Learning

01 EasyRec Training & Inference Architecture

Recommendation models now handle thousands of features, large embeddings, and deep dense layers, creating severe compute and latency challenges. EasyRec addresses these by providing a configurable, component‑based architecture consisting of a data layer, embedding layer, dense layer, and output layer. The framework runs on MaxCompute, EMR, and the DLC container platform, supports Keras components, distributed training, online‑learning (ODL), and NNI‑driven hyper‑parameter search. It also offers multi‑optimizer settings, feature hot‑start, large‑scale negative sampling, and a work‑queue mechanism for fault‑tolerant training resume.

02 EasyRec Training Optimization

Key optimizations include SequenceFeature deduplication (reducing batch size to 5‑10% of the original), embedding sharding (EmbeddingParallel) that moves sparse parameters to workers while keeping dense parameters synchronized via All‑Reduce, and lock‑free hash tables on CPU that outperform Google’s dense hash table. On GPU, Hugectr’s Sok embedding caches hot embeddings to cut H2D transfer. Intel AMX BF16 acceleration boosts matrix‑multiply performance by ~16×, and further gains are expected from a C++ implementation of the deduplication logic.

Feature‑layer improvements use AVX‑accelerated StringSplit, replace MurmurHash with CrcHash/XorHash, and introduce a compact storage format for SequenceFeature that cuts memory usage by over 80 %. TensorFlow ops are wrapped to enable parallel execution and overlap of feature generation with embedding lookup, reducing runtime by ~20 %.

03 EasyRec Inference Optimization

The PAI‑REC inference engine, written in Go, connects recall, ranking, re‑ranking, and shuffling stages and provides a user‑friendly UI for A/B testing and feature‑consistency diagnostics. EasyRecProcessor handles online inference through an item feature cache, a feature generator, and a TensorFlow model, applying CPU/GPU optimizations such as feature‑cache reduction, incremental model updates, and GPU‑side dense computation.

Inference speed is further improved by fusing small embedding‑related ops, using AVX for parallel execution, and applying BF16 quantization with negligible AUC impact. XLA and TensorRT (TRT) are combined to fuse dense‑layer ops, handle dynamic shapes, and enable BF16 quantization, yielding 10‑30 % QPS gains. Placement optimization moves embedding lookup to CPU and dense computation to GPU while minimizing H2D copies via a min‑cut graph partition.

04 Real‑Time Online Learning

Online learning is realized by streaming logs from PAI‑REC to SLS, then to Datahub, where Flink aggregates samples and labels. The pipeline supports configurable stream training, incremental parameter export to OSS, and automatic processor updates. Feature consistency is enhanced with LZ4‑compressed joins, and delayed or duplicate samples are filtered. The system has demonstrated significant effectiveness in new‑item and content‑driven scenarios.

Additional engineering improvements include direct pod‑IP connections to eliminate an extra Nginx hop (reducing RTT by ~5 ms) and request‑compression techniques (snappy, zstd) that cut high‑throughput traffic by up to fivefold.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Inference Optimization Recommendation Systems Online Learning Training Optimization Distributed computing AI infrastructure

Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.