Tag

multimodal LLM

1 views collected around this technical thread.

Amap Tech
Amap Tech
Apr 21, 2025 · Artificial Intelligence

Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models

At ICASSP 2025, Gaode’s two accepted papers present Lenna, a language‑enhanced reasoning detection assistant that adds a DET token to multimodal LLMs and achieves state‑of‑the‑art accuracy on RefCOCO benchmarks, and a chain‑of‑thought image‑editing framework that converts complex prompts into segmented masks and repair prompts for diffusion‑based inpainting, surpassing existing methods.

AIChain-of-ThoughtICASSP
0 likes · 10 min read
Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models
Snowball Engineer Team
Snowball Engineer Team
Mar 31, 2025 · Frontend Development

Leveraging Multimodal Large Language Models for Frontend Automated Testing (NL2Test)

This article explores how multimodal large language models (MM‑LLMs) combined with structured prompt engineering can transform frontend regression testing by enabling natural‑language‑driven test case generation, visual verification, and script self‑healing, thereby reducing maintenance costs and improving coverage across dynamic UI scenarios.

AI AutomationFrontend TestingNL2Test
0 likes · 17 min read
Leveraging Multimodal Large Language Models for Frontend Automated Testing (NL2Test)
JD Tech
JD Tech
Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward Modeladvertising image generation
0 likes · 10 min read
CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)
AntTech
AntTech
Mar 14, 2025 · Artificial Intelligence

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

The CVPR 2025 paper "MP-GUI: Modality Perception with MLLMs for GUI Understanding" presents a novel algorithm that enhances multimodal large language models' ability to perceive and reason about graphical user interfaces by integrating text, visual, and spatial signals through specialized perception modules and a dynamic fusion gate, achieving state‑of‑the‑art performance on multiple GUI benchmarks.

CVPR2025GUI UnderstandingMLLM
0 likes · 5 min read
MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding
JD Retail Technology
JD Retail Technology
Mar 14, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

The paper presents CAIG, a CTR‑driven advertising image generation pipeline that pre‑trains a multimodal LLM on e‑commerce data, trains a reward model on CTR‑labeled image pairs, and fine‑tunes generation via product‑centric preference optimization, achieving state‑of‑the‑art online and offline performance.

AIad image generationctr
0 likes · 11 min read
CTR-Driven Advertising Image Generation Using Multimodal Large Language Models
Xiaohongshu Tech REDtech
Xiaohongshu Tech REDtech
Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

MedusaPPOPRM
0 likes · 21 min read
Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance
DataFunSummit
DataFunSummit
Nov 1, 2024 · Artificial Intelligence

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

This article reviews recent advances in multimodal large language models, covering their background, architectural components, training strategies, application scenarios, evaluation benchmarks, team research on hallucination mitigation and long‑video understanding, and outlines promising future research directions.

Large Language ModelsModel ArchitectureVision-Language
0 likes · 15 min read
Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook
360 Tech Engineering
360 Tech Engineering
Jun 25, 2023 · Artificial Intelligence

Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model

The article reviews why visual ability is essential for artificial general intelligence, compares native multimodal and expert‑stitching integration approaches, details the architectures of models such as KOSMOS‑1, PALM‑E, Flamingo, BLIP‑2, LLAVA, miniGPT‑4, and introduces the SEEChat project that fuses CLIP vision encoders with chatGLM6B via a projection layer, presenting its training pipeline, experimental results, and future directions.

AGIModel FusionSEEChat
0 likes · 13 min read
Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model