Artificial Intelligence 25 min read

DEFUSE and Bi-DEFUSE: Unbiased Delayed‑Feedback Modeling for CVR Prediction

The paper introduces DEFUSE and its multi‑task extension Bi‑DEFUSE, unbiased delayed‑feedback CVR models that correct label bias via rigorous importance‑sampling and a latent fake‑negative variable, achieving superior offline performance and a 2 % CVR lift in online deployment compared with existing industry baselines.

Alimama Tech
Alimama Tech
Alimama Tech
DEFUSE and Bi-DEFUSE: Unbiased Delayed‑Feedback Modeling for CVR Prediction

1. Overview

In classic search advertising scenarios such as Taobao search ads, conversion‑rate (CVR) estimation is a crucial foundation for GMV‑oriented optimization. CVR prediction not only drives ranking but also underlies various bidding strategies (CPC, oCPX), balancing platform efficiency and advertiser ROI while ensuring the health of the e‑commerce advertising ecosystem.

This article shares practical experiences on delayed‑feedback modeling and online learning for the "Direct Train" (直通车) main‑line scenario, covering:

Problem definition and significance of CVR & delayed‑feedback modeling

Understanding of mainstream industry solutions

DEFUSE / Bi‑DEFUSE algorithm ideas for Direct Train CVR

Comparison with industry baselines and deployment results

The related work has been published at TheWebConf 2022.

Paper: Asymptotically Unbiased Estimation for Delayed Feedback Modeling via Label Correction

Download: https://arxiv.org/abs/2202.06472

1.1 Background

Real‑time online learning on streaming samples has achieved remarkable success in CTR/CVR modeling for search advertising. However, because a purchase decision incurs a much larger cost than a click, CVR modeling suffers from a pervasive and significant feedback delay: the time between a click and the eventual conversion can be long. In Taobao Direct Train, only about 60% of conversions happen within 30 minutes after a click (42% on the Criteo dataset).

In CVR estimation, a click at time t is labeled positive only if the conversion occurs before a predefined cutoff t+Δ . This delay limits the ability to set a small label window and forces the model to wait a long time ( Δ ) before a reliable label can be observed, making real‑time sample construction and online learning extremely challenging.

Industry solutions typically introduce an observation window much shorter than the attribution period (e.g., 15 min / 30 min) to trade off sample freshness and label accuracy. Samples observed within the window are used as provisional labels, and delayed‑feedback models correct the bias between the observed and true distributions.

2. Industry Solutions

2.1 Traditional CVR Modeling

Traditional CVR models define input x , feature vector f(x) , true conversion label y , true distribution P(y|x) , and a model p̂(y|x) optimized with binary cross‑entropy:

In practice, the loss is approximated by an empirical expression (see the centered image).

2.2 Mainstream Delayed‑Feedback Modeling

By introducing an observation window W and conversion delay D , samples can be divided into four categories:

Immediate Positive (IP) : conversion occurs within W

Real Negative (RN) : no conversion within the full attribution period

Fake Negative (FN) : conversion occurs after W but before the attribution deadline

Delay Positive (DP) : FN samples that become positive once the delayed conversion is observed

Two major families of delayed‑feedback methods exist:

Joint modeling of observed distribution and delayed feedback (early classic works [1, 2]) – model the probability that a conversion lies outside the observation window and correct the label bias.

Sample‑replenishment based joint modeling – use importance sampling (IS) to re‑weight samples after a replenishment mechanism.

Sample‑Replenishment Mechanisms

Three representative mechanisms are widely used:

FNC/FNW – treat every click as a negative at click time; when a conversion finally occurs, the sample is re‑injected as a positive.

ES‑DFM – window‑based replenishment: after the observation window expires, the observed label is fixed; delayed positives are re‑issued.

DEFER – similar to ES‑DFM but re‑issues *all* samples with their true label after the full attribution period.

Importance Sampling (IS)

Let P_g denote the true (ground‑truth) distribution and P_o the observed distribution after windowing and replenishment. The CVR loss can be rewritten with an importance weight w(x)=P_g(x)/P_o(x) . Existing methods approximate w(x) using the chosen replenishment design.

The derivation involves two approximations: (1) modeling the delay distribution, and (2) fitting the observed distribution. Differences among methods mainly lie in how they design the replenishment and derive the importance weight.

2.3 Problems with Existing Delayed‑Feedback Methods

Although IS‑based methods improve performance, their derivations assume that the transformation from the true distribution P_g to the observed distribution P_o only changes probability density, not the sample values themselves. In CVR, the label of a sample changes from FN to DP after the delayed conversion, which violates this assumption. Consequently, existing works mistakenly treat fake negatives as real negatives in the importance‑weight formula.

3. Proposed Method

To address the issue identified in Section 2.3, we propose a more rigorous IS application and introduce the original algorithm DEFUSE (DElayed Feedback modeling with UnbiaSed Estimation). We also present Bi‑DEFUSE , a multi‑task extension that separates window‑inside and window‑outside modeling.

3.1 DEFUSE

We first refine the sample taxonomy (see Figure 2) and rewrite the loss as:

The new loss differs from traditional IS‑based losses in two ways:

It does not rely on the simplifying assumption that fake negatives equal real negatives.

It introduces a latent variable z(x) to explicitly model whether an observed negative is a fake negative, enabling a correct importance‑weight derivation.

Importance Weight of DEFUSE

By separating the four sample types ( IP, DP, RN, FN ) and introducing z(x) , the importance weight can be expressed analytically (see the following equation).

An auxiliary model g(x) predicts the probability that an observed negative is a fake negative, allowing us to estimate the importance weight without directly observing the true label.

Optimization of DEFUSE

The final loss requires estimating g(x) . We can either train a binary classifier to distinguish RN from FN, or embed the estimation into the CVR model itself (the latter is more compact but less stable).

3.2 Bi‑DEFUSE

Bi‑DEFUSE splits CVR modeling into two sub‑tasks:

Window‑inside modeling (IP) : uses standard cross‑entropy because the observed and true distributions coincide.

Window‑outside modeling (DP) : applies DEFUSE to correct the distribution shift caused by delayed conversions.

The two tasks share a multi‑gate mixture‑of‑experts (MMoE) backbone, enabling joint training while keeping the variance of the window‑outside task bounded.

The overall CVR prediction is the weighted sum of the two sub‑models (see the equation image).

4. Experiments

We evaluate DEFUSE and Bi‑DEFUSE on two datasets:

Criteo public dataset (30‑day and 1‑day attribution)

Taobao Direct Train industrial dataset (1‑day attribution)

Data are split hourly to simulate online streaming; training uses the observed distribution (including fake negatives), while evaluation uses the true distribution without replenishment.

4.1 Baselines

We compare against:

Pre‑trained (offline pre‑training)

Oracle (perfect future label)

Vanilla (cross‑entropy on window‑inside data only)

Vanilla‑Win (window‑outside samples re‑issued without IS)

FNW, FNC, ES‑DFM, DEFER (industry‑standard replenishment + IS)

4.2 Results (RQ1)

DEFUSE/Bi‑DEFUSE achieve top‑1/2 performance on almost all datasets. In short attribution periods where IP dominates, Bi‑DEFUSE outperforms DEFUSE. ES‑DFM and DEFER consistently beat FNC/FNW, confirming the benefit of a well‑designed observation window.

4.3 DEFUSE under Different Replenishment Mechanisms (RQ2)

DEFUSE improves over all three baselines (FNW, ES‑DFM, DEFER) across the Criteo dataset, demonstrating the effectiveness of the refined IS formulation.

4.4 Ablation Studies (RQ3)

We investigate the impact of the latent variable z(x) , the MMoE vs. MLP backbone, and attribution period length. Results show that:

DEFUSE+ (simpler latent modeling) consistently outperforms the more complex variant.

Bi‑DEFUSE with MMoE outperforms the MLP version, confirming the benefit of shared experts.

Shorter attribution cycles favor Bi‑DEFUSE because IP samples dominate.

4.5 Online Deployment

Bi‑DEFUSE was deployed before the 2021 Double‑11 shopping festival. The online A/B test reported +2% CVR, +0.8% ROI, and a slight RPM increase, confirming the practical impact.

5. Conclusion

We introduced DEFUSE, an unbiased delayed‑feedback CVR modeling framework based on rigorous importance‑sampling, and extended it to Bi‑DEFUSE for scenarios with short attribution windows. Extensive offline experiments and a successful online rollout demonstrate the superiority of our methods over existing industry baselines.

6. References

Olivier Chapelle. 2014. Modeling delayed feedback in display advertising.

Yuya Yoshikawa and Yusaku Imai. 2018. A Nonparametric Delayed Feedback Model for Conversion Rate Prediction.

Sofia Ira Ktena et al. 2019. Addressing delayed feedback for continuous training with neural networks in CTR prediction.

Siyu Gu et al. 2021. Real Negatives Matter: Continuous Training with Real Negatives for Delayed‑Feedback Modeling.

Jia‑Qi Yang et al. 2021. Capturing Delayed Feedback in Conversion Rate Prediction via Elapsed‑Time Sampling.

Kuang‑chih Lee et al. 2012. Estimating conversion rate in display advertising from past performance data.

Jiaqi Ma et al. 2018. Modeling Task Relationships in Multi‑task Learning with Multi‑gate Mixture‑of‑Experts.

Junwei Pan et al. 2019. Predicting Different Types of Conversions with Multi‑Task Learning in Online Advertising.

Zhuojian Xiao et al. 2021. Adversarial Mixture Of Experts with Category Hierarchy Soft Constraint.

advertisingCVRonline learningBi-DEFUSEDEFUSEDelayed FeedbackImportance Sampling
Alimama Tech
Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.