Artificial Intelligence 13 min read

Category-Specific CNN for Visual-Aware Click‑Through Rate Prediction at JD.com

The paper introduces a Category‑Specific Convolutional Neural Network (CSCNN) that jointly leverages product category information and visual features of e‑commerce images to improve click‑through‑rate (CTR) prediction, detailing its architecture, training strategy, large‑scale experiments, and significant performance gains in JD.com’s advertising system.

JD Retail Technology
JD Retail Technology
JD Retail Technology
Category-Specific CNN for Visual-Aware Click‑Through Rate Prediction at JD.com

Click‑through‑rate (CTR) prediction is a core challenge in AI‑driven recommendation, search, and advertising systems. JD.com’s advertising team presented a Category‑Specific CNN (CSCNN) that incorporates rich product category metadata together with product images as inputs to a CNN feature extractor, dramatically improving CTR estimation accuracy.

Their advertising platform evolved from shallow FM models (2015) to LR‑DNN (2016) and later migrated to TensorFlow‑based pipelines (2017). Since 2018, the focus shifted to business‑understanding modeling, addressing issues such as model scale, learning efficiency, and real‑time parameter updates.

Traditional visual‑aware CTR models use a post‑fusion approach: a generic CNN (e.g., Inception, ResNet) extracts image features, which are later combined with non‑visual features. This suffers from two problems: (1) CNN inference is computationally heavy, becoming a bottleneck for low‑latency online serving; (2) the CNN ignores explicit product‑category information that is readily available in e‑commerce, limiting its ability to extract category‑relevant visual cues.

CSCNN solves these issues by feeding both the product image and its category label into each convolutional layer. Inspired by SE‑Net and CBAM, the network adds a category‑aware channel‑attention module (Mc) and a spatial‑attention module (Ms) after every convolution. Category vectors are concatenated with pooled features, transformed via fully‑connected layers, and used to re‑weight feature maps, enabling the network to learn visual patterns specific to each category.

The system architecture integrates CSCNN with a Deep & Cross Net CTR model. Offline training uses a special sampling strategy that groups 25 ads sharing the same image in one batch, allowing a single CNN forward pass per batch and completing joint training on 150 billion impressions and 1.77 billion images within a day. Offline, visual embeddings are pre‑computed into a 20 GB lookup table covering 90 % of next‑day traffic. Online, the serving layer retrieves these embeddings and combines them with other features, achieving sub‑20 ms 99th‑percentile latency at 3 million requests per second.

Experiments on the Amazon benchmark (using a linear BPR CTR model) and on JD.com’s industrial dataset (150 billion samples) show that CSCNN consistently outperforms state‑of‑the‑art baselines, improving AUC and online CTR. The method also demonstrates robustness across different attention mechanisms and backbone CNNs.

In conclusion, the Category‑Specific CNN effectively fuses category priors with visual features, yielding significant CTR gains and now powers JD.com’s mainstream search advertising traffic for hundreds of millions of active users.

e-commerceadvertisingdeep learningCTR predictioncategory-specific CNNvisual features
JD Retail Technology
Written by

JD Retail Technology

Official platform of JD Retail Technology, delivering insightful R&D news and a deep look into the lives and work of technologists.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.