Tagged articles

72 articles

Page 1 of 1

May 21, 2026 · Artificial Intelligence

ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying

MedScope proposes a "Think with Videos" paradigm that lets AI models actively locate and verify evidence in long clinical videos, using coarse‑to‑fine tool calling, evidence‑centric training data (ClinVideoSuite) and a grounding‑aware reinforcement learning objective, achieving superior performance on multiple video‑understanding benchmarks.

Evidence-based QALong Video ReasoningMedical Video AI

0 likes · 10 min read

ICML 2026: MedScope Introduces a New Paradigm for Long Medical Video Reasoning—From Watching to Verifying

AIWalker

May 16, 2026 · Artificial Intelligence

Qwen3-VL-Seg Unlocks Pixel‑Level Open‑World Segmentation

Qwen3-VL-Seg, the latest open‑source multimodal LLM from Alibaba, extends bounding‑box predictions to pixel‑accurate masks using a lightweight box‑guided decoder, achieving strong performance on both closed‑set and open‑world segmentation tasks with only 0.4% extra parameters.

Qwen3-VL-SegSA1B-ORS datasetbox‑guided decoder

0 likes · 6 min read

Qwen3-VL-Seg Unlocks Pixel‑Level Open‑World Segmentation

Machine Learning Algorithms & Natural Language Processing

May 14, 2026 · Artificial Intelligence

Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

APO (Autonomous Preference Optimization) converts the drift and conflict among multiple teacher multimodal LLMs into dynamic negative constraints while treating consensus as a positive preference, enabling robust concept alignment and superior diagnostic accuracy on the CXR‑MAX benchmark, as demonstrated by extensive ICML‑2026 experiments.

APOICML 2026Knowledge Distillation

0 likes · 11 min read

Turning Multi‑Teacher Conflict into Dynamic Constraints: Robust Reasoning Alignment for Multimodal LLMs (ICML 2026)

Machine Heart

May 13, 2026 · Artificial Intelligence

Turning Multi-Teacher Conflict into Dynamic Constraints for Precise Multimodal Model Alignment (ICML 2026)

The paper introduces APO, a novel autonomous preference optimization framework that converts concept drift among multiple teacher multimodal LLMs into dynamic negative constraints and treats consensus as a positive preference, achieving robust concept alignment and surpassing strong teachers on a high‑risk medical X‑ray benchmark.

APOCXR-MAXICML 2026

0 likes · 11 min read

Turning Multi-Teacher Conflict into Dynamic Constraints for Precise Multimodal Model Alignment (ICML 2026)

Machine Heart

May 13, 2026 · Artificial Intelligence

Super‑Charging MiniCPM‑V 4.6 on One RTX 4090: 1B‑Parameter Multimodal Model Sets New Efficiency Bar

MiniCPM‑V 4.6, a 1.3 B‑parameter multimodal LLM, outperforms larger rivals such as Qwen3.5‑0.8B and Gemma 4 on both accuracy and speed, thanks to early ViT token compression and 4×/16× visual token reduction, delivering sub‑100 ms latency and over 2.6 k token/s throughput on a single RTX 4090 while also running offline on mobile devices.

Edge AIMiniCPM-VRTX 4090

0 likes · 16 min read

Super‑Charging MiniCPM‑V 4.6 on One RTX 4090: 1B‑Parameter Multimodal Model Sets New Efficiency Bar

Machine Heart

May 11, 2026 · Artificial Intelligence

Why Visual Perception Limits STEM Large Models and How CodePercept Breaks the Barrier

The authors demonstrate that visual perception, not reasoning, is the primary bottleneck for STEM multimodal large language models, introduce the CodePercept paradigm and the ICC-1M dataset, and show that code‑driven perception dramatically improves performance, surpassing much larger models on new benchmarks.

CVPR2026CodePerceptSTEM

0 likes · 9 min read

Why Visual Perception Limits STEM Large Models and How CodePercept Breaks the Barrier

Data Party THU

Apr 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

The MME-Emotion benchmark, introduced by researchers from CUHK and Alibaba Tongyi and accepted at ICLR 2026, provides a large‑scale, multimodal evaluation of emotional intelligence in large language models, revealing current models’ limited emotion recognition and reasoning abilities across diverse real‑world scenarios.

AIMME-Emotionbenchmark

0 likes · 10 min read

Can Multimodal LLMs Truly Understand Emotions? Inside the MME-Emotion Benchmark

PaperAgent

Apr 8, 2026 · Artificial Intelligence

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

This article examines the visual token redundancy in decoder-only multimodal large language models and introduces a training-free dynamic computation reduction framework—featuring Probe-Activated Dynamic FFN, Hollow Attention, and a Layer Ranking Algorithm—that significantly lowers inference cost while preserving performance.

Efficient Inferencedecoder-only architecturedynamic computation

0 likes · 12 min read

How Dynamic Computation Cuts Redundancy in Decoder-Only Multimodal LLMs

Machine Heart

Apr 3, 2026 · Artificial Intelligence

Google Open‑Sources Gemma 4, Outperforming a 13×‑Larger Qwen 3.5

Google DeepMind released the open‑source Gemma 4 family—four model sizes ranging from 2 B to 31 B parameters, supporting text, images, video and audio, with up to 256 k token context, Apache 2.0 licensing, and benchmark results that place it on par with the 397 B Qwen 3.5 despite being far smaller.

Apache-2.0Gemma 4Google DeepMind

0 likes · 11 min read

Google Open‑Sources Gemma 4, Outperforming a 13×‑Larger Qwen 3.5

Weekly Large Model Application

Mar 23, 2026 · Artificial Intelligence

Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment

This article dissects Step‑Audio2, an industrial‑grade multimodal large language model that unifies speech understanding, translation, dialogue and audio generation in a single causal LM, detailing its inference pipeline, key implementation tricks, deployment modes, strengths, limitations, and suitable application scenarios.

PythonStep-Audio2Token2Wav

0 likes · 10 min read

Inside Step‑Audio2: End‑to‑End Multimodal Audio LLM Architecture and Deployment

AI Frontier Lectures

Mar 16, 2026 · Artificial Intelligence

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

This article presents MME-Emotion, a large‑scale multimodal benchmark that evaluates both emotion recognition and reasoning abilities of multimodal large language models across 27 real‑world scenarios, revealing current models’ significant gaps in emotional intelligence and outlining future research directions.

AIbenchmarkdataset

0 likes · 9 min read

Can Multimodal LLMs Truly Understand Human Emotions? Introducing the MME-Emotion Benchmark

Machine Learning Algorithms & Natural Language Processing

Mar 13, 2026 · Artificial Intelligence

Can Multimodal LLMs Beat Humans in Real Web Search? GPT‑5.2 Scores Only 36% on New BrowseComp‑V3 Benchmark

A new multimodal browsing benchmark, BrowseComp‑V3, reveals that human experts achieve a 68.03% success rate while the strongest closed‑source model, GPT‑5.2, manages just 36.17%, highlighting current limitations in deep web‑scale visual‑text reasoning and the critical role of tool‑augmented agents.

GPT-5.2OmniSeekerhuman performance

0 likes · 12 min read

Can Multimodal LLMs Beat Humans in Real Web Search? GPT‑5.2 Scores Only 36% on New BrowseComp‑V3 Benchmark

Machine Learning Algorithms & Natural Language Processing

Mar 12, 2026 · Artificial Intelligence

LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation

LongHorizonUI tackles the steep success‑rate drop of GUI agents on tasks longer than 10‑15 steps by introducing three tightly coupled modules—enhanced perception, deep reflective decision, and compensatory execution—and validates the approach on the new LongGUIBench benchmark with consistent performance gains across both app and game scenarios.

GUI automationICLR 2026benchmark

0 likes · 12 min read

LongHorizonUI: A Unified Robust Framework for Long‑Horizon GUI Agent Automation

Huolala Tech

Mar 4, 2026 · Artificial Intelligence

How Lalamove Built an AI‑Powered Edge‑Cloud Review System for Global Driver Verification

Lalamove tackled the scalability and accuracy challenges of worldwide driver onboarding by designing a layered edge‑cloud AI architecture that combines lightweight mobile models, cloud‑based large‑language and computer‑vision models, OCR, and multimodal LLMs to filter low‑quality inputs, automate identity checks, and reduce manual effort while maintaining data compliance.

AIDriver VerificationOCR

0 likes · 12 min read

How Lalamove Built an AI‑Powered Edge‑Cloud Review System for Global Driver Verification

AIWalker

Mar 3, 2026 · Artificial Intelligence

RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits

RetouchIQ introduces an instruction‑driven AI retouching system that uses a general reward model to interpret abstract user commands, delivering precise image adjustments with higher semantic consistency and visual naturalness than existing multimodal large language models, thereby lowering the technical barrier for cinematic‑style edits.

AI Image EditingRetouchIQReward Model

0 likes · 3 min read

RetouchIQ’s Instruction‑Driven AI Editing Overcomes Traditional Retouching Limits

Data Party THU

Feb 25, 2026 · Artificial Intelligence

Why Multimodal LLMs Miss Tiny Objects—and How to Fix It

This article analyzes why multimodal large language models often fail to detect small objects, identifies three core bottlenecks, and presents a four‑tiered optimization roadmap—from zero‑cost inference tricks to data augmentation, model fine‑tuning, and engineering safeguards—backed by three real‑world case studies and actionable guidelines.

Inference Optimizationdata augmentationmodel fine-tuning

0 likes · 20 min read

Why Multimodal LLMs Miss Tiny Objects—and How to Fix It

Alibaba Cloud Infrastructure

Feb 23, 2026 · Cloud Native

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

This guide details how to deploy the open‑source Qwen 3.5‑397B‑A17B multimodal LLM on Alibaba Cloud ACK using the RoleBasedGroup (RBG) engine, covering model preparation, Kubernetes resources, role‑based orchestration, performance tuning, and benchmark testing.

Cloud Native AIKubernetesRoleBasedGroup

0 likes · 24 min read

Deploying Qwen 3.5 Multimodal Model on Alibaba Cloud ACK with RoleBasedGroup

AI Engineering

Feb 16, 2026 · Artificial Intelligence

Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

Alibaba’s Qwen3.5-397B-A17B, a 397‑billion‑parameter open‑source multimodal LLM, combines mixed linear attention with a sparse MoE architecture to achieve 8.6‑19× higher decoding throughput than Qwen3‑Max, supports 201 languages, and can be deployed via vLLM, Docker, Transformers, or SGLang with various optimization presets.

Inference OptimizationLarge Language ModelSparse MoE

0 likes · 8 min read

Qwen3.5-397B: 397B‑Parameter Multimodal LLM Boosts Inference Speed 8‑19×

JD Tech

Jan 27, 2026 · Artificial Intelligence

How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation

Uni-Layout introduces a unified framework that integrates a universal layout generator, a human‑feedback‑simulating evaluator, and a dynamic margin preference optimization technique to align generation and evaluation across diverse e‑commerce design tasks, backed by a new 100k human‑annotated dataset.

Human Feedbackdynamic margin optimizatione-commerce design

0 likes · 11 min read

How Uni-Layout Unifies Cross‑Task Layout Generation with Human‑Like Evaluation

JD Cloud Developers

Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment

Uni-Layout introduces a unified framework that combines a multimodal large language model‑based generator, a human‑like evaluator trained on the large Layout‑HF100k dataset, and a Dynamic Margin Preference Optimization (DMPO) method to align generation and evaluation, achieving state‑of‑the‑art results across diverse layout tasks.

DMPOHuman Feedbackevaluation

0 likes · 11 min read

Uni-Layout: Unifying Layout Generation with Human Feedback and Dynamic Alignment

JD Tech Talk

Jan 15, 2026 · Artificial Intelligence

Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation

Uni-Layout introduces a unified framework that generates layouts across diverse tasks, simulates human evaluation with a novel feedback dataset, and aligns generation and assessment through dynamic margin preference optimization, achieving state‑of‑the‑art performance on multiple benchmarks.

AI designHuman Feedbackevaluation

0 likes · 11 min read

Uni-Layout: Harnessing Human Feedback for Unified Layout Generation and Evaluation

JD Retail Technology

Jan 8, 2026 · Artificial Intelligence

Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Uni-Layout introduces a unified layout generation framework that consolidates diverse design tasks, leverages multimodal large language models for flexible generation, and aligns outputs with human perception through a novel human‑feedback dataset (Layout‑HF100k) and a dynamic margin preference optimization (DMPO) evaluator.

ACM MultimediaHuman Feedbackdynamic margin optimization

0 likes · 11 min read

Uni-Layout: Unified Cross-Task Layout Generation with Human-Aligned Evaluation

Tencent Advertising Technology

Dec 25, 2025 · Artificial Intelligence

How RAVEN Leverages Reinforcement Reasoning for Precise Ad Video Violation Grounding

RAVEN is a reinforcement‑reasoning framework that combines curriculum learning with hierarchical rewards to enable multimodal large language models to accurately locate and classify violation segments in advertisement videos, even under noisy, large‑scale industrial data.

AdvertisingCurriculum Learningmultimodal LLM

0 likes · 17 min read

How RAVEN Leverages Reinforcement Reasoning for Precise Ad Video Violation Grounding

Xiaohongshu Tech REDtech

Dec 4, 2025 · Artificial Intelligence

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

CrossVid is an open‑source benchmark that evaluates multimodal large language models on cross‑video reasoning tasks, providing 5,331 videos, 9,015 QA pairs, four high‑level dimensions and ten specific tasks, and exposing significant performance gaps between current models and humans.

AI evaluationcross-video reasoningmultimodal LLM

0 likes · 9 min read

CrossVid: A New Benchmark Reveals the Limits of Multimodal LLMs in Cross‑Video Reasoning

HyperAI Super Neural

Nov 28, 2025 · Artificial Intelligence

Weekly AI paper roundup: protein design, open‑source agent, HunyuanOCR, Olmo 3

This weekly roundup highlights five recent AI papers—including HumanSense for multimodal LLM evaluation, JAM‑2 for de novo antibody design, the open‑source Olmo 3 language models, the Lumine generalist 3D agent, and the lightweight HunyuanOCR vision‑language model—summarizing their core contributions, results, and links.

OCRgeneralist agentsmultimodal LLM

0 likes · 6 min read

Weekly AI paper roundup: protein design, open‑source agent, HunyuanOCR, Olmo 3

Baobao Algorithm Notes

Nov 13, 2025 · Artificial Intelligence

Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite

UNO‑Bench, an open‑source benchmark from Meituan’s LongCat team, provides the first high‑quality, low‑redundancy unified evaluation framework for omni‑modal large language models, featuring 1,250 manually annotated cross‑modal samples and 2,480 enhanced single‑modal samples covering 44 fine‑grained tasks and five modality combinations.

AI Scaling Lawbenchmarkdata pipeline

0 likes · 15 min read

Introducing UNO‑Bench: The First Unified Omni‑Modal LLM Evaluation Suite

JavaEdge

Nov 10, 2025 · Artificial Intelligence

What AI Trends Will Dominate the Next Year? A Deep Dive into Physical AI, Agents, and RAG

This report outlines emerging AI trends for the coming year, highlighting Physical AI, AI agents, multimodal models, the MCP standard, and the growing role of AI‑driven DevOps, while also discussing security implications and adoption stages across the technology lifecycle.

AI DevOpsAI agentsAI trends

0 likes · 12 min read

What AI Trends Will Dominate the Next Year? A Deep Dive into Physical AI, Agents, and RAG

Tencent Technical Engineering

Nov 5, 2025 · Artificial Intelligence

iDetex: The Winning AI Model Transforming Image Quality Assessment

iDetex, the champion solution of the ICCV 2025 MIPI Detailed Image Quality Assessment Challenge, introduces a novel multimodal LLM-driven framework that precisely locates, describes, and grades image distortions, outperforming traditional IQA models and enabling practical deployments across video, live streaming, e‑commerce, and image‑processing pipelines.

AIICCV 2025computer vision

0 likes · 18 min read

iDetex: The Winning AI Model Transforming Image Quality Assessment

HyperAI Super Neural

Oct 27, 2025 · Artificial Intelligence

Weekly AI Paper Digest: New OCR Model, Multimodal LLM, Next‑Gen DNA Sequencing

This week’s AI roundup highlights five recent papers: DeepSeek‑OCR’s context‑compression model for large‑scale data generation, Rex‑Omni’s 3‑billion‑parameter multimodal LLM achieving state‑of‑the‑art object perception, Alpha‑Service’s proactive AI‑glass framework, a bias‑variance approach to narrowing cross‑lingual gaps, and GATK’s MapReduce‑based toolkit for next‑generation DNA sequencing.

AI GlassesCross-lingual NLPDNA Sequencing

0 likes · 6 min read

Weekly AI Paper Digest: New OCR Model, Multimodal LLM, Next‑Gen DNA Sequencing

Bighead's Algorithm Notes

Oct 17, 2025 · Artificial Intelligence

Exploring MLLM4TS: A Universal Multimodal Framework for Time‑Series Analysis

This article reviews the MLLM4TS framework, which fuses visual representations of multivariate time series with large language models to address complex temporal dependencies, cross‑channel interactions, and task generalization, and demonstrates superior performance on classification, anomaly detection, forecasting, and few‑shot scenarios across multiple benchmarks.

Ablation StudyBenchmark ResultsTime Series Analysis

0 likes · 11 min read

Exploring MLLM4TS: A Universal Multimodal Framework for Time‑Series Analysis

Bilibili Tech

Oct 17, 2025 · Artificial Intelligence

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

This article details how Bilibili’s multimedia lab leveraged a multimodal training pipeline combining data‑compressed SFT and the GRPO reinforcement‑learning algorithm to achieve a 13.5% metric boost and secure second place in the ICCV MIPI Detailed Image Quality Assessment competition.

GRPOMIPI competitionSFT

0 likes · 15 min read

How Bilibili’s Multimodal Team Won 2nd Place at ICCV MIPI with a Novel SFT+GRPO Strategy

Amap Tech

Oct 4, 2025 · Artificial Intelligence

How JanusVLN Redefines Vision‑Language Navigation with Dual Implicit Memory

JanusVLN presents a groundbreaking Vision‑and‑Language Navigation framework that decouples semantic understanding from spatial geometry using dual implicit memory, eliminates explicit memory overhead, achieves state‑of‑the‑art performance with only RGB video input, and dramatically improves efficiency and generalization across VLN benchmarks.

3D spatial reasoningDual Implicit Memorymultimodal LLM

0 likes · 10 min read

How JanusVLN Redefines Vision‑Language Navigation with Dual Implicit Memory

Bighead's Algorithm Notes

Oct 2, 2025 · Artificial Intelligence

FinZero: Multimodal Large‑Model Reasoning for Financial Time‑Series Forecasting

FinZero is a multimodal large‑model that leverages a 30‑billion‑parameter Qwen2.5‑VL backbone fine‑tuned with the UARPO strategy on the FVLDB dataset, enabling accurate financial time‑series prediction, uncertainty quantification, and outperforming larger models such as GPT‑4o by about 13.5% in high‑confidence groups.

FinZeroGPT-4o comparisonQwen2.5-VL-3B

0 likes · 10 min read

FinZero: Multimodal Large‑Model Reasoning for Financial Time‑Series Forecasting

Data Party THU

Sep 26, 2025 · Artificial Intelligence

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

Keye‑VL‑1.5, an 8‑billion‑parameter multimodal large language model, introduces a Slow‑Fast video encoding strategy, a four‑stage progressive pre‑training pipeline with 128K context, and a sophisticated post‑training regime that together achieve state‑of‑the‑art performance on video and vision‑language benchmarks while maintaining strong general capabilities.

Large Language Modelbenchmarkmultimodal LLM

0 likes · 21 min read

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

HyperAI Super Neural

Sep 19, 2025 · Artificial Intelligence

Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs

This article surveys five recent AI papers, covering reinforcement learning for large reasoning models, a tree‑structured table QA framework (ST‑Raptor), visual representation alignment for multimodal LLMs, GraphRAG‑based generation, and an LLM‑driven cryptographic vulnerability detector, each with key insights and links.

cryptographic vulnerability detectiongraph retrievallarge language models

0 likes · 5 min read

Weekly AI Paper Roundup: RL Advances, Tree‑Structured QA, and GraphRAG Breakthroughs

Kuaishou Tech

Sep 16, 2025 · Artificial Intelligence

How Kling-Avatar Generates Long, Emotionally Rich Digital Human Videos with Multimodal LLMs

Kuaishou's Kling-Avatar leverages a multimodal large‑language‑model‑driven two‑stage generation framework to produce minute‑long digital‑human videos that synchronize lip movements, facial expressions, and body gestures with audio, achieving high visual quality, identity consistency, and controllable storytelling across diverse scenarios.

AI AvatarVideo Generationdigital human

0 likes · 9 min read

vivo Internet Technology

Sep 10, 2025 · Artificial Intelligence

How Structured Input Boosts Multimodal LLMs in Document QA Without Retraining

This article presents a training‑free, architecture‑agnostic method that leverages LaTeX‑style structured inputs to preserve document hierarchy and spatial relationships, thereby improving multimodal large language model performance on document question answering tasks across multiple benchmarks.

AIDocQAattention analysis

0 likes · 8 min read

How Structured Input Boosts Multimodal LLMs in Document QA Without Retraining

Kuaishou Large Model

Sep 8, 2025 · Artificial Intelligence

Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks

Kwai's newly released Keye-VL-1.5-8B multimodal large language model dramatically improves visual, reasoning, and temporal understanding, achieving top scores on public video benchmarks and surpassing closed‑source models like GPT‑4o, while offering an open‑source release and detailed technical documentation.

Benchmark performancemultimodal LLMopen-source

0 likes · 11 min read

Keye-VL-1.5-8B: The New Multimodal LLM That Beats GPT-4o on Vision Benchmarks

DaTaobao Tech

Sep 3, 2025 · Artificial Intelligence

Why a Simple Workflow Beats Complex Agents in AI‑Powered Insurance Audits

A retrospective of an AI‑based insurance claim audit project shows that a well‑designed workflow, precise prompt engineering, and rule‑based pre‑filtering can achieve stable, high‑accuracy results, while overly complex agent architectures often become fragile patchwork solutions.

AI auditinsurance claimmultimodal LLM

0 likes · 24 min read

Why a Simple Workflow Beats Complex Agents in AI‑Powered Insurance Audits

Network Intelligence Research Center (NIRC)

Aug 27, 2025 · Artificial Intelligence

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

Perception‑R1, a post‑training framework that applies rule‑based reinforcement learning to existing multimodal LLMs, dramatically improves visual perception tasks such as grounding, OCR, counting and object detection, as demonstrated by extensive benchmarks and ablation studies.

GRPOPerception PolicyReward Modeling

0 likes · 10 min read

Perception‑R1: A Rule‑Based RL Method that Elevates Multimodal Model Vision

vivo Internet Technology

Aug 25, 2025 · Artificial Intelligence

How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training

DiMo-GUI is a plug‑and‑play framework that dramatically improves multimodal large language models' ability to locate GUI elements by using a hierarchical dynamic visual reasoning loop and modality‑aware optimization, achieving up to double the performance on high‑resolution GUI benchmarks without any additional training data.

GUI groundingTest-Time Scalingdynamic visual reasoning

0 likes · 7 min read

How DiMo-GUI Boosts Multimodal LLMs for GUI Grounding Without Training

AIWalker

Aug 13, 2025 · Artificial Intelligence

Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception

Look-Back is an implicit training paradigm that enables the Qwen‑2.5‑VL‑7B multimodal LLM to autonomously re‑focus on visual inputs during reasoning, achieving a 6.3 % boost in perception tasks, outperforming prior baselines while requiring no extra image tokens or model architecture changes.

Look-BackQwen-2.5-VLimplicit training

0 likes · 26 min read

Look-Back Triggers Visual Reflection in Qwen-2.5-VL, +6.3% Perception

AI Algorithm Path

Aug 9, 2025 · Artificial Intelligence

How LoRA Enables Multimodal Capabilities in Large Language Models

This article compares two ways to add vision to large language models—training a native multimodal model from scratch or attaching a visual module to a pretrained LLM—then details the VoRA approach that uses LoRA adapters to inject visual knowledge without extra inference cost.

ChameleonLLaVALoRA

0 likes · 7 min read

How LoRA Enables Multimodal Capabilities in Large Language Models

Bilibili Tech

Aug 8, 2025 · Artificial Intelligence

Can Language‑Centric Tree Reasoning Transform Video Question Answering?

This article introduces a language‑centric tree reasoning (LTR) framework that recursively decomposes VideoQA queries into perceptual sub‑questions and performs bottom‑up logical inference with video assistance, achieving significantly higher accuracy and explainability across eleven benchmark datasets.

Artificial IntelligenceTree ReasoningVideoQA

0 likes · 17 min read

Can Language‑Centric Tree Reasoning Transform Video Question Answering?

AIWalker

Aug 5, 2025 · Artificial Intelligence

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

The paper introduces Perception‑R1, a rule‑based reinforcement‑learning framework that trains multimodal large language models for visual perception tasks without relying on chain‑of‑thought reasoning, and demonstrates up to 17.9% performance gains on RefCOCO+, PixMo‑Count, PageOCR and COCO2017, while analyzing the key roles of perception confusion and reward design.

RLHFbenchmarkmultimodal LLM

0 likes · 24 min read

Perception‑R1: RL Gives Visual Insight Without Chain‑of‑Thought, Beats Four Tasks

AIWalker

Aug 4, 2025 · Artificial Intelligence

Introducing CAIG: CTR‑Driven Advertising Image Generation with Open‑Source Code

CAIG leverages a multimodal large language model, a novel reward model, and product‑centered preference optimization to generate ad images that maximize click‑through rate, achieving state‑of‑the‑art performance in both online and offline evaluations.

CTROpen Sourcead image generation

0 likes · 7 min read

Introducing CAIG: CTR‑Driven Advertising Image Generation with Open‑Source Code

AIWalker

Aug 3, 2025 · Artificial Intelligence

CVPR 2025: DeQA-Score Lets LLMs Predict Image Quality Score Distributions

DeQA-Score introduces a soft‑label discretization that lets multimodal large language models regress continuous image‑quality scores as Gaussian distributions, achieving 30× lower mean error and preserving variance and inter‑image relationships, with KL‑divergence and fidelity losses driving state‑of‑the‑art performance.

CVPR2025DeQA-Scoreimage quality assessment

0 likes · 8 min read

CVPR 2025: DeQA-Score Lets LLMs Predict Image Quality Score Distributions

Meituan Technology Team

Jul 31, 2025 · Artificial Intelligence

8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More

Meituan’s research team showcases eight ACL 2025 papers spanning generative retrieval, multi‑objective preference alignment, rich‑text image understanding, cross‑language transfer, multimodal math reasoning, and more, offering insights and breakthroughs that can inspire and aid fellow researchers.

ACL 2025Code-SwitchingGenerative Retrieval

0 likes · 15 min read

8 Must-Read ACL 2025 Papers from Meituan: Generative Retrieval, Multimodal LLMs & More

Amap Tech

Jul 24, 2025 · Artificial Intelligence

FingER: Fine-Grained Evaluation and Reasoning for AI-Generated Videos

The paper introduces FingER, an entity-level evaluation framework and the FingER-Instruct-60k dataset for assessing AI-generated video quality with fine-grained reasoning, and demonstrates state-of-the-art zero-shot performance on multiple benchmarks using novel training strategies.

AI-generated videoReasoningdataset

0 likes · 9 min read

FingER: Fine-Grained Evaluation and Reasoning for AI-Generated Videos

AntTech

Jul 2, 2025 · Artificial Intelligence

How Multimodal Large Models Revolutionize UI Automation Testing

This article details how Alibaba's Ant Group leverages multimodal large‑language models and multi‑agent architectures to create a low‑code, AI‑driven UI automation testing framework that improves test coverage, reduces manual effort, and scales across diverse mobile mini‑program scenarios.

AI testingSoftware qualityUI automation

0 likes · 9 min read

How Multimodal Large Models Revolutionize UI Automation Testing

AIWalker

Jun 30, 2025 · Artificial Intelligence

ICCV 2025 MIPI Workshop Launches ViDA-UGC: A New UGC Image Quality Assessment Challenge

The ICCV MIPI workshop introduces the ViDA-UGC competition, presenting a richly annotated UGC image quality dataset, a benchmark suite covering degradation detection, region perception, and quality description, detailed evaluation metrics, submission formats, prize information, and open participation for researchers worldwide.

ICCVMIPIUGC

0 likes · 15 min read

ICCV 2025 MIPI Workshop Launches ViDA-UGC: A New UGC Image Quality Assessment Challenge

Volcano Engine Developer Services

Jun 18, 2025 · Artificial Intelligence

ChatTS: A Synthetic Data‑Driven Multimodal LLM that Natively Understands Time Series

ChatTS is a time‑series‑native multimodal large language model trained on purely synthetic data, offering superior understanding and reasoning over both real and synthetic time‑series datasets, and outperforming existing LLM baselines across alignment and inference tasks.

AILLM alignmentTS‑MLLM

0 likes · 18 min read

ChatTS: A Synthetic Data‑Driven Multimodal LLM that Natively Understands Time Series

Amap Tech

Apr 21, 2025 · Artificial Intelligence

Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models

At ICASSP 2025, Gaode’s two accepted papers present Lenna, a language‑enhanced reasoning detection assistant that adds a DET token to multimodal LLMs and achieves state‑of‑the‑art accuracy on RefCOCO benchmarks, and a chain‑of‑thought image‑editing framework that converts complex prompts into segmented masks and repair prompts for diffusion‑based inpainting, surpassing existing methods.

AIICASSPchain-of-thought

0 likes · 10 min read

Lenna: Language‑Enhanced Reasoning Detection Assistant and a Chain‑of‑Thought Image Editing Framework Using Multimodal Large Language Models

AIWalker

Apr 11, 2025 · Artificial Intelligence

Teaching Large Language Models to Predict Image Quality Scores with DeQA-Score

DeQA-Score, a CVPR 2025 work, shows how to train multimodal large language models to regress continuous image quality scores by discretizing scores into soft-label level tokens, preserving Gaussian distribution statistics and achieving state‑of‑the‑art performance without any installation.

CVPR2025DeQA-Scoreimage quality assessment

0 likes · 8 min read

Teaching Large Language Models to Predict Image Quality Scores with DeQA-Score

Network Intelligence Research Center (NIRC)

Apr 7, 2025 · Artificial Intelligence

Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

This guide introduces Hugging Face's TRL library, explains how to install it alongside Transformers, and walks through modifying LLaVA's trainer, dataset, and data collator to apply the DPO reinforcement‑learning algorithm for multimodal model fine‑tuning.

DPOHugging FaceLLaVA

0 likes · 4 min read

Getting Started with Hugging Face TRL: Fine‑tune LLaVA using DPO

AI Frontier Lectures

Apr 3, 2025 · Artificial Intelligence

How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge

ChartMoE, an oral paper at ICLR 2025, introduces a multi‑stage alignment training pipeline and a diversified MoE Connector that dramatically improves chart comprehension while maintaining performance on general multimodal tasks, backed by extensive data construction, training recipes, and thorough evaluations.

Chart UnderstandingChartMoEMixture of Experts

0 likes · 10 min read

How ChartMoE Uses Sparse MoE to Master Chart Understanding and Preserve General Knowledge

Snowball Engineer Team

Mar 31, 2025 · Frontend Development

Leveraging Multimodal Large Language Models for Frontend Automated Testing (NL2Test)

This article explores how multimodal large language models (MM‑LLMs) combined with structured prompt engineering can transform frontend regression testing by enabling natural‑language‑driven test case generation, visual verification, and script self‑healing, thereby reducing maintenance costs and improving coverage across dynamic UI scenarios.

AI automationNL2Testmultimodal LLM

0 likes · 17 min read

Leveraging Multimodal Large Language Models for Frontend Automated Testing (NL2Test)

JD Tech

Mar 26, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

The JD advertising team proposes a CTR‑driven advertising image generation framework (CAIG) that leverages multimodal large language models, a novel reward model, and product‑centric preference optimization to produce ad images with superior click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward Modeladvertising image generation

0 likes · 10 min read

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models (CAIG)

AI Algorithm Path

Mar 25, 2025 · Artificial Intelligence

Extract Structured Vehicle Data from Images with Pydantic and GPT‑4 Vision

This tutorial shows how to build a LangChain pipeline that uses GPT‑4 Vision to read vehicle images, defines a Pydantic model for type, license, make, model and color, and returns the results as structured JSON for both single and batch inference.

GPT-4 VisionLangChainPydantic

0 likes · 11 min read

Extract Structured Vehicle Data from Images with Pydantic and GPT‑4 Vision

AI Frontier Lectures

Mar 25, 2025 · Artificial Intelligence

What Drives Alignment in Multimodal Large Language Models? A Comprehensive Review

This article provides an in‑depth review of alignment algorithms for multimodal large language models, covering application scenarios, dataset construction methods, evaluation benchmarks, current challenges, and future research directions, while summarizing contributions from leading academic institutions.

AI researchDataset Constructionalignment algorithms

0 likes · 22 min read

What Drives Alignment in Multimodal Large Language Models? A Comprehensive Review

AI Algorithm Path

Mar 20, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis

This article surveys the latest multimodal large language model research, dissecting the design, training strategies, and performance trade‑offs of models such as Llama 3.2, Molmo, NVLM, Qwen2‑VL, Pixtral, MM1.5, Emu3, and Janus, and highlights the challenges of fair cross‑model evaluation.

AI researchCross-AttentionModel Training Strategies

0 likes · 16 min read

Understanding Multimodal Large Language Models: Recent Advances and Comparative Analysis

AI Algorithm Path

Mar 19, 2025 · Artificial Intelligence

Understanding Multimodal Large Language Models: Part 1

This article explains the fundamentals of multimodal large language models, covering their definition, typical applications, two main architectural approaches—unified embedding decoder and cross‑modal attention—along with detailed component breakdowns, a PyTorch implementation of image‑patch projection, and training considerations, ending with a discussion of trade‑offs between the methods.

Cross-AttentionImage EncoderLinear Projection

0 likes · 14 min read

Understanding Multimodal Large Language Models: Part 1

AntTech

Mar 14, 2025 · Artificial Intelligence

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

The CVPR 2025 paper "MP-GUI: Modality Perception with MLLMs for GUI Understanding" presents a novel algorithm that enhances multimodal large language models' ability to perceive and reason about graphical user interfaces by integrating text, visual, and spatial signals through specialized perception modules and a dynamic fusion gate, achieving state‑of‑the‑art performance on multiple GUI benchmarks.

CVPR2025GUI UnderstandingMLLM

0 likes · 5 min read

MP-GUI: Modality Perception with Multimodal Large Language Models for GUI Understanding

JD Retail Technology

Mar 14, 2025 · Artificial Intelligence

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

The paper presents CAIG, a CTR‑driven advertising image generation pipeline that pre‑trains a multimodal LLM on e‑commerce data, trains a reward model on CTR‑labeled image pairs, and fine‑tunes generation via product‑centric preference optimization, achieving state‑of‑the‑art online and offline performance.

AICTRad image generation

0 likes · 11 min read

CTR-Driven Advertising Image Generation Using Multimodal Large Language Models

JD Cloud Developers

Mar 13, 2025 · Artificial Intelligence

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

This paper presents a CTR‑driven advertising image generation framework that leverages multimodal large language models, reward modeling, and reinforcement learning to produce product‑centric ad visuals with higher click‑through performance, validated by extensive offline and online experiments.

CTR optimizationReward Modeladvertising image generation

0 likes · 13 min read

Can Multimodal LLMs Boost Ad Click‑Through Rates? Introducing CTR‑Driven Image Generation

Xiaohongshu Tech REDtech

Jan 2, 2025 · Artificial Intelligence

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Xiaohongshu’s team unveiled a self‑developed RLHF system that trains multimodal large language models using heterogeneous and homogeneous network architectures, extensive PPO optimizations, and Medusa speculative sampling, achieving over 50% throughput gains, reduced hardware needs, and 5‑20% performance improvements on zero‑shot benchmarks.

PPOPRMPerformance

0 likes · 21 min read

Xiaohongshu's Self-developed RLHF System for Multimodal Large Language Models: Design, Optimization, and Performance

Full-Stack Cultivation Path

Nov 25, 2024 · Artificial Intelligence

Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code

This guide shows how to set up the open‑source Ollama‑OCR tool, which leverages the Llama 3.2‑Vision multimodal model to perform high‑quality OCR, covering installation of Ollama, the vision model, the OCR package, and example code for plain‑text and Markdown outputs.

Llama 3.2-VisionNode.jsOCR

0 likes · 6 min read

Get High-Quality OCR with Ollama-OCR in Just a Few Lines of Code

NewBeeNLP

Nov 11, 2024 · Artificial Intelligence

What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?

This article surveys ten recent multimodal large language model papers, covering vision representation laws, a stricter instruction benchmark, safety impacts of visual adaptation, the Mini‑Gemini architecture, automatic pruning, vision capability boosting, long‑context transfer, efficient token sparsification, math reasoning, and hallucination mitigation.

EfficiencyTraining StrategiesVision-Language Models

0 likes · 18 min read

What Do Recent Multimodal LLM Papers Reveal About Vision‑Language Models?

DataFunSummit

Nov 1, 2024 · Artificial Intelligence

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

This article reviews recent advances in multimodal large language models, covering their background, architectural components, training strategies, application scenarios, evaluation benchmarks, team research on hallucination mitigation and long‑video understanding, and outlines promising future research directions.

evaluation benchmarksfuture researchmodel architecture

0 likes · 15 min read

Progress in Multimodal Large Language Models: Background, Architecture, Evolution, Team Work, and Future Outlook

Baobao Algorithm Notes

Oct 24, 2024 · Artificial Intelligence

How NoteLLM-2 Boosts Multimodal Recommendations with In-Content Learning

NoteLLM-2 introduces multimodal In-Content Learning and Late Fusion to overcome visual‑modality bias in end‑to‑end fine‑tuned large representation models, delivering significant gains over baseline multimodal LLMs and traditional retrieval methods in recommendation tasks.

AI researchRecommendation Systemscontrastive learning

0 likes · 11 min read

How NoteLLM-2 Boosts Multimodal Recommendations with In-Content Learning

Baobao Algorithm Notes

Oct 17, 2024 · Artificial Intelligence

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

Meta’s newly released 92‑page Movie Gen paper introduces a multimodal LLM that unifies text‑to‑image, text‑to‑video, personalized video, precise video editing, and audio generation, detailing its dual‑model architecture, training pipeline, temporal auto‑encoder design, scaling strategies, evaluation benchmark, and ablation studies.

Model ScalingVideo Generationdeep learning

0 likes · 34 min read

How Meta’s Movie Gen Pushes Text‑to‑Video Generation to New Heights

360 Tech Engineering

Jun 25, 2023 · Artificial Intelligence

Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model

The article reviews why visual ability is essential for artificial general intelligence, compares native multimodal and expert‑stitching integration approaches, details the architectures of models such as KOSMOS‑1, PALM‑E, Flamingo, BLIP‑2, LLAVA, miniGPT‑4, and introduces the SEEChat project that fuses CLIP vision encoders with chatGLM6B via a projection layer, presenting its training pipeline, experimental results, and future directions.

AGIImage CaptioningSEEChat

0 likes · 13 min read

Visual Capability as a Fundamental Requirement for AGI and the SEEChat Multimodal Dialogue Model