Artificial Intelligence 10 min read

Dynamic Weight Averaging and Gradient Normalization for Multi‑Task Recommendation Models

To improve multi‑task recommendation in the “每平每屋” system, the team augments an MMoE ranking model with dynamic weight averaging, dynamic task prioritization, and GradNorm gradient normalization, stabilizing loss convergence across CTR, CVR, and fav tasks and delivering 3–4% online metric gains.

DaTaobao Tech
DaTaobao Tech
DaTaobao Tech
Dynamic Weight Averaging and Gradient Normalization for Multi‑Task Recommendation Models

This article is the fourth in a series that shares practical experiences on recall, ranking, and cold‑start modules in the "每平每屋" recommendation system.

In the ranking stage, the team uses a Multi‑Gate Mixture‑of‑Experts (MMoE) multi‑task model to predict three click‑through rates simultaneously: first‑click, detail‑page click, and product‑detail click. Unlike simple hard‑parameter sharing, MMoE learns which parameters (Experts) should be shared among tasks via learned gates.

Beyond model structure, the authors focus on Model Dynamics —the training speed and loss convergence differences among tasks. They propose dynamic weighting strategies to balance these differences.

Dynamic Weight Averaging (DWA) (CVPR 2019) computes task weights from the relative loss decrease speed. The core formula (shown in the original figure) uses the loss at step t and a temperature parameter to produce a softmax over tasks, giving smaller weights to faster‑converging tasks.

Dynamic Task Prioritization (DTP) (ECCV 2018) defines task weights based on a KPI (e.g., accuracy) at each step, also using a temperature‑scaled softmax. Higher‑KPI tasks receive smaller weights.

Gradient Normalization (GradNorm) (ICML 2018) balances tasks by aligning the L2 norm of each task’s gradient to a target value. It jointly considers loss magnitude and training speed, defining a gradient loss that is added to the overall objective.

The implementation follows the open‑source gradnorm_tf.py repository. The authors discovered instability when using the first‑step loss as the denominator in the moving‑average; they replaced it with a more stable moving‑average of the loss magnitude.

Offline comparison shows that adding gradient loss slows down the CTR task (training speed) while accelerating CVR and fav tasks. AUC variance across three runs drops from 0.0055 (baseline) to 0.0016 (baseline + GradNorm), indicating more stable training, especially for sparse detail and fav tasks.

Online A/B test results (GradNorm vs. baseline) demonstrate significant improvements: pctcvr + 3.49%, avg_ipv + 4.04%, as well as gains in uctr, pctr, and click metrics.

Conclusion : By addressing model dynamics in addition to model structure (MMoE/PLE), the team achieves complementary gains in multi‑task recommendation scenarios.

The work was carried out by the 淘系技术部‑淘宝智能团队, which contributes both to product value and academic publications at conferences such as KDD, ICCV, and ICML.

References include papers on Multi‑Task Learning optimization, MMoE (KDD 2018), DWA (CVPR 2019), DTP (ECCV 2018), and GradNorm (ICML 2018).

A/B testingmulti-task learningRecommendation systemsDynamic Weight AveragingGradient NormalizationMMoE
DaTaobao Tech
Written by

DaTaobao Tech

Official account of DaTaobao Technology

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.