Artificial Intelligence 7 min read

GCN‑LSTM Image Captioning Model by JD AI Research Institute

JD AI Research Institute presented a GCN‑LSTM encoder‑decoder system that integrates object semantic and spatial relationships via graph convolutional networks to significantly improve image captioning performance on the COCO benchmark, achieving state‑of‑the‑art results.

JD Tech
JD Tech
JD Tech
GCN‑LSTM Image Captioning Model by JD AI Research Institute

Human‑level image description requires not only detecting objects but also understanding the relationships among them. While modern AI can recognize objects accurately, capturing inter‑object connections for comprehensive captions remains challenging.

At the ECCV 2018 conference, JD AI Research Institute introduced a novel approach that combines computer vision and natural language processing. Their GCN‑LSTM system encodes both semantic and spatial relationships between objects using a graph convolutional network (GCN) and then decodes the enriched features with a two‑layer long short‑term memory (LSTM) network to generate vivid, accurate captions.

The model consists of three modules: (1) an object detection module that extracts region‑level features for each detected object; (2) a GCN‑based image encoder that processes a semantic relationship graph and a spatial relationship graph, embedding pairwise object relations into the region features; (3) an LSTM decoder that transforms the enriched region features into natural‑language sentences.

Semantic relationship graphs are built by classifying pairwise object pairs using a learned semantic relation classifier, while spatial graphs are constructed from eleven predefined spatial relations (including containment, overlap, and eight angular relations). These graphs are then fed into the GCN to produce relationship‑aware feature representations.

Experimental results on the COCO test set show that incorporating both semantic and spatial relations raises the CIDEr‑D score from 120.1 % (the previous Up‑Down model) to 128.7 %. Qualitative examples demonstrate that GCN‑LSTM can highlight specific interactions—such as “kids eating dessert”—that baseline LSTM or Up‑Down models miss.

The technology not only advances high‑level semantic understanding of images but also enables applications like automatic generation of descriptive titles, advertising copy, or poetic narratives for e‑commerce, logistics, and finance scenarios. JD AI plans to integrate this capability across its full value‑chain services.

References: [1] Ting Yao, Yingwei Pan, Yehao Li, Tao Mei. "Exploring Visual Relationship for Image Captioning." ECCV 2018. [2] Anderson et al. "Bottom‑up and top‑down attention for image captioning and visual question answering." CVPR 2018.

multimodal AIcomputer visionimage captioningLSTMgraph convolutional networkCOCO dataset
JD Tech
Written by

JD Tech

Official JD technology sharing platform. All the cutting‑edge JD tech, innovative insights, and open‑source solutions you’re looking for, all in one place.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.