Understanding the One-Epoch Overfitting Phenomenon in Deep Click-Through Rate Models
The study reveals that industrial deep click‑through‑rate models often overfit dramatically after the first training epoch—a “one‑epoch phenomenon” caused by the embedding‑plus‑MLP architecture, fast optimizers, and highly sparse features, with performance dropping sharply unless sparsity is reduced or training is limited to a single pass.
This work investigates a peculiar overfitting behavior observed in industrial deep click‑through rate (CTR) models, where performance sharply degrades at the beginning of the second training epoch – the so‑called “one‑epoch phenomenon”. Experiments on Alibaba’s display‑ad data and public datasets (Amazon Book, Taobao) show that the model reaches its best AUC after a single epoch and then collapses.
Deep CTR models differ from typical CV/NLP models: they operate on billions of high‑dimensional, sparse features and usually adopt an Embedding + MLP architecture. Sparse ID features (e.g., item_ID, history_item_IDs) dominate the input space.
Key factors identified as contributors to the one‑epoch phenomenon are (1) the Embedding + MLP structure, (2) fast‑converging optimizers such as Adam or RMSprop with large learning rates, and (3) the use of highly sparse features. Changing any of these factors mitigates the effect but often incurs accuracy loss.
Model‑related experiments reveal that the phenomenon persists across various embedding dimensions, hidden‑layer sizes, and numbers of MLP layers, indicating that it is not caused by parameter count alone. Optimizers that accelerate convergence exacerbate the issue, while batch size, activation functions, weight decay, and dropout show little impact.
Feature‑related studies demonstrate that reducing feature sparsity—by filtering low‑frequency IDs, applying hashing with smaller tables, or removing the most sparse domains—gradually weakens or eliminates the one‑epoch drop, confirming the strong link between sparsity and overfitting.
To explain the behavior, the authors hypothesize a distribution shift: after the first epoch, the MLP rapidly adapts to a transformed embedding distribution (denoted \(\Delta\)), creating a mismatch between training and test data that triggers sudden overfitting. This is supported by measuring A‑distance between training and test embeddings, which spikes at epoch 2, and by observing a sudden increase in the norm of MLP parameter updates while embedding updates remain modest.
Further validation includes fine‑tuning experiments where freezing the embedding while updating the MLP reproduces the phenomenon, whereas freezing the MLP while updating the embedding does not.
In conclusion, the one‑epoch phenomenon is widespread in industrial CTR systems and is primarily driven by model architecture, optimizer aggressiveness, and feature sparsity. Training beyond a single epoch rarely yields gains, explaining why production pipelines often perform a single pass over data. The study suggests future work on mitigation strategies and extending the analysis to other user‑behavior prediction tasks.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.