CapOnImage: Context-driven Dense Captioning on Images
The paper presents CapOnImage, a novel image‑on‑image captioning task that generates location‑specific decorative text for product images, introduces the 2.1‑million‑image CapOnImage2M dataset, and proposes a mixed‑modality transformer with position‑aware pre‑training and progressive training, achieving superior accuracy and diversity and already deployed in Alibaba’s advertising platforms for measurable business impact.
This paper introduces a new task called "image-on-image caption generation" (CapOnImage), which aims to generate decorative textual captions for specific locations on product images to enhance advertising effectiveness.
Existing captioning systems produce text unrelated to image regions, limiting their use in ad scenarios. To address this, the authors construct a large-scale dataset, CapOnImage2M, containing 2.1 million product images with titles, attributes, and location-specific captions.
The proposed model leverages multimodal context—including image content, product metadata, layout coordinates, and neighboring box information—through a mixed-modality transformer that generates captions autoregressively. Several position-aware pre‑training tasks (Level‑I, Level‑II, Level‑III) and a progressive training strategy are designed to help the model understand spatial relationships.
Experiments show that the model outperforms baseline image‑text description methods in both accuracy and diversity. Ablation studies confirm the effectiveness of each pre‑training task and the progressive training scheme. Visualizations demonstrate that generated captions align well with the intended image regions.
The work has already been deployed in Alibaba’s advertising platforms (e.g., homepage focus slots and recommendation feeds), yielding significant business gains. The authors anticipate future end‑to‑end text rendering without separate layout prediction modules.
Alimama Tech
Official Alimama tech channel, showcasing all of Alimama's technical innovations.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.