Artificial Intelligence 14 min read

AntM2C: A Large-Scale Multi‑Scenario Multi‑Modal CTR Prediction Dataset from Alipay

AntM2C is a publicly released, billion‑sample click‑through‑rate (CTR) dataset covering five distinct Alipay business scenarios, providing both ID and rich multi‑modal (text and image) features to enable comprehensive evaluation of multi‑scenario, cold‑start, and multi‑modal CTR models at industrial scale.

AntTech
AntTech
AntTech
AntM2C: A Large-Scale Multi‑Scenario Multi‑Modal CTR Prediction Dataset from Alipay

Click‑through‑rate (CTR) prediction is critical for recommendation systems, influencing user experience and platform revenue. Existing public CTR datasets suffer from limited scenarios, modality, and scale, prompting Ant Group to release the AntM2C dataset, derived from real Alipay industrial data.

AntM2C contains 1 billion CTR samples across five typical Alipay business scenarios (advertising, coupons, mini‑programs, content, and video). Each sample includes ID features and multi‑modal features such as text and images, with over 200 feature columns, making it the largest publicly available CTR dataset.

The first open‑source phase provides 10 million samples (29 ID features, 2 text features) with plans to release the full billion‑sample corpus. The dataset has been de‑identified and encrypted to protect privacy and is intended solely for academic research.

Benchmark tasks include multi‑scenario CTR modeling, cold‑start CTR modeling (few‑shot and zero‑shot), and multi‑modal CTR modeling. Standard AUC is used as the evaluation metric. Results show that multi‑task models outperform simple mixed‑data DNNs, and that incorporating text modalities improves performance in sparse scenarios, highlighting the dataset’s ability to differentiate model capabilities.

Statistical analysis reveals long‑tail distributions for both users and items, with overlapping users across scenarios, reflecting realistic industrial conditions. The dataset also provides auxiliary features such as timestamps and scenario identifiers to facilitate flexible train/validation/test splits.

AntM2C aims to fill the gap in multi‑scenario, multi‑modal CTR research and invites the community to contribute to its ongoing expansion and benchmark development.

machine learningrecommendationctrmulti-scenarioLarge Scaledatasetmulti‑modal
AntTech
Written by

AntTech

Technology is the core driver of Ant's future creation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.