Artificial Intelligence 10 min read

Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models

At ICASSP 2025, Gaode’s two accepted papers present Lenna, a language‑enhanced reasoning detection assistant that adds a DET token to multimodal LLMs and achieves state‑of‑the‑art accuracy on RefCOCO benchmarks, and a chain‑of‑thought image‑editing framework that converts complex prompts into segmented masks and repair prompts for diffusion‑based inpainting, surpassing existing methods.

Amap Tech
Amap Tech
Amap Tech
Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models

ICASSP (International Conference on Acoustics, Speech, and Signal Processing) is the flagship annual conference of the IEEE Signal Processing Society, covering the latest research in acoustic, speech, and signal processing. The 50th edition (2025) focuses on "Celebrating Signal Processing" and includes topics such as speech recognition, speech synthesis, speech enhancement, natural language processing, and machine learning. Two papers from the Gaode technology team were accepted in the conference proceedings.

Paper 1 – Lenna: Language Enhanced Reasoning Detection Assistant

Technical fields: multimodal large models, reasoning detection. The paper introduces Lenna, a language‑enhanced reasoning detection assistant that integrates a multimodal large language model (MLLM) with an open‑set detector (Grounding‑DINO). A special token <DET> extends the LLM vocabulary to express detection needs, enabling end‑to‑end reasoning‑augmented detection. Lenna also provides a new benchmark dataset, ReasonDet, to quantitatively evaluate logical inference and intent detection performance of MLLMs. Experiments show that Lenna achieves lower training cost and higher accuracy on both REC and ReasonDet compared with previous MLLM approaches.

The architecture combines the LLaVA multimodal LLM with Grounding‑DINO. After receiving an image x_i and textual instruction t_i , the MLLM generates a textual response y_i . The embedding of <DET> ( h_det ) captures semantic and positional information. The image and object caption are fed to the detector encoder to obtain enhanced image features f_img and text features f_txt . These three representations are processed by the MSQ (MLM‑guided query selection) module, which uses cross‑attention (with h_det as K, V) and a similarity‑based selection to align features across modalities. The merged h_det is injected into each decoder cross‑attention layer to produce final position predictions.

Training objectives include a language modeling loss L_tok , detection losses ( L_det using L1 and GIoU for bounding‑box regression and contrastive loss for classification), and the overall loss combines these terms (see equations 4‑7 in the original paper).

Experimental results on RefCOCO, RefCOCO+, and RefCOCOg demonstrate that Lenna outperforms state‑of‑the‑art methods, achieving up to 47.37% higher accuracy than the previous best model MiniGPT‑v2, and exceeding 85.50% overall accuracy.

Paper 2 – Enhancing Image Editing with Chain‑of‑Thought Reasoning and Multimodal Large Language Models

Technical fields: image editing. This work proposes a novel image‑editing framework that leverages Chain‑of‑Thought (CoT) reasoning and the localization capability of multimodal large language models (MLLMs) to guide diffusion models. The CoT process decomposes complex user instructions into simpler sub‑prompts, each associated with a segmentation token [SEG] that generates a mask M . Corresponding repair prompts P are also produced. The original image, mask M , and repair prompt P are then fed to a powerful inpainting model for final editing.

The pipeline consists of: (1) feeding the image and complex prompt into the MLLM; (2) obtaining a CoT sequence that breaks the prompt into simpler steps; (3) generating segmentation masks via the [SEG] token; (4) applying the masks and repair prompts to an inpainting model. This architecture enables precise understanding of intricate instructions and accurate region‑level editing.

Extensive experiments show that the proposed framework surpasses existing state‑of‑the‑art image‑editing models in both qualitative and quantitative metrics, particularly in complex instruction compliance and image consistency before and after editing.

computer visionAIChain-of-ThoughtICASSPmultimodal LLMimage editingreasoning detection
Amap Tech
Written by

Amap Tech

Official Amap technology account showcasing all of Amap's technical innovations.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.