Artificial Intelligence 15 min read

ICML 2026 Highlights: Five Taotian Group Papers Pushing Multimodal AI Boundaries

The article showcases five ICML 2026 papers from the Taotian Group that tackle core multimodal AI challenges—interactive video try‑on, high‑resolution vision, e‑commerce video reasoning, sparse‑reward reinforcement learning, and curriculum learning for large language models—detailing their problem statements, novel solutions, and strong experimental results.

Alimama Tech

Jun 4, 2026

ICML 2026 Highlights: Five Taotian Group Papers Pushing Multimodal AI Boundaries

ICML 2026 received 23,918 submissions and accepted 6,352 papers (26.6% acceptance). Five papers from the Taotian Group address multimodal AI challenges in interactive video try‑on, high‑resolution image degradation, e‑commerce short‑video reasoning, sparse‑reward reinforcement learning, and reward‑hacking in large language models.

iTryOn: Interactive Video Virtual Try‑On Framework

Pain point: Existing video try‑on methods generate non‑interactive clips and cannot model actions such as pulling zippers or adjusting sleeves, which are essential for e‑commerce live streams.

Solution: The iTryOn framework treats video try‑on as a conditional generation task guided by clothing images and action semantics. It introduces three key contributions: (1) a 3D hand prior that provides precise hand pose and spatial location; (2) time‑stamped action class labels combined with an Action‑aware Rotational Positional Encoding (A‑RoPE) for accurate temporal alignment; (3) an Action‑aware Constraint (AC) loss that emphasizes supervision on interaction frames, preventing sparse interaction signals from being drowned by non‑interactive frames.

Results: iTryOn outperforms prior methods on both interactive and traditional video try‑on tasks, generating more natural and coherent human‑clothing interaction videos.

HiDe: Hierarchical Decoupling of the Zoom‑In Operation for High‑Resolution MLLMs

Pain point: High‑resolution multimodal large language models (MLLMs) perform poorly; the prevailing “zoom‑in” strategy is assumed to address perception limits, yet the true cause of degradation is unclear and existing visual‑search pipelines suffer from inefficiency and memory overflow.

Solution: HiDe performs a hierarchical decoupling of the zoom‑in operation into four sub‑operations: zoom‑and‑crop, foreground‑background separation, semantic‑vs‑non‑semantic token split, and appearance‑vs‑spatial‑layout split. Experiments show that pixel‑level zoom yields negligible benefit; performance loss stems from complex background semantics and token redundancy. HiDe introduces (1) Token‑level Attention Decoupling (TAD) to separate foreground semantics from noisy background; (2) Layout‑Preserving Decoupled Reconstruction (LPD) to extract compact, high‑information regions while preserving spatial relations; (3) a single‑tensor offloading optimization that reduces peak memory from 96 GB to 20 GB.

Results: HiDe achieves state‑of‑the‑art performance on V*Bench and HRBench, surpasses existing training‑free methods and even outperforms RL‑fine‑tuned baselines, while cutting memory usage by 75% and halving inference latency.

E‑VAds: Benchmark for E‑commerce Short‑Video Understanding

Pain point: Existing multimodal video benchmarks focus on generic scenes and ignore the extreme information density and commercial intent of e‑commerce short videos, leading to poor MLLM performance in real business scenarios.

Solution: The authors propose a three‑stage framework—Quantitative Evaluation, Benchmark Construction, and Reinforcement Alignment. They define three density metrics (visual dynamics, audio speech rate, text coverage) to quantify video complexity, build a dataset of 3,961 short videos with 19,785 high‑quality Q&A pairs, and develop the E‑VAds‑R1 model with Multi‑Granular Relative Policy Optimization (MG‑GRPO) to address sparse reward signals.

Results: With only a few hundred training examples, E‑VAds‑R1 improves performance by 109.2% over strong baselines, establishing a new state‑of‑the‑art for e‑commerce video reasoning.

TP‑GRPO: Modeling Step‑Wise and Long‑Term Interactions in Flow‑Based GRPO

Pain point: Flow‑based generative RL for text‑to‑image assigns a single terminal reward to all denoising steps, resulting in (1) overly sparse reward signals that ignore contributions of individual steps, and (2) neglect of intra‑trajectory dependencies, especially early actions that affect later generation quality.

Solution: TurningPoint‑GRPO (TP‑GRPO) replaces the terminal reward with step‑wise incremental rewards by comparing reward changes before and after each denoising step. It also introduces a “turning point” modeling mechanism that identifies steps capable of reversing local reward trends and assigns aggregated long‑term rewards to them, thereby capturing long‑range effects within the diffusion process.

Results: TP‑GRPO consistently outperforms baselines on compositional generation, text rendering, and human‑preference alignment tasks, and shows superior performance when deployed on SD3.5 and FLUX.1‑dev diffusion models.

RuCL: Stratified Rubric‑Based Curriculum Learning for Multimodal LLM Reasoning

Pain point: MLLMs trained with outcome‑only reinforcement learning are prone to reward hacking, producing hallucinated intermediate reasoning steps. Existing rubric‑based supervision is sample‑wise, computationally expensive, and treats all rubrics equally, causing gradient noise before basic perception skills are learned.

Solution: RuCL shifts curriculum focus from data filtering to reward‑weight design. It offline constructs a universal set of multimodal reasoning rubrics, stratifies them into “basic perception” and “advanced reasoning” layers based on model pass rates, and dynamically adjusts their reward weights (λₜ) during training via a performance‑triggered scheduler, enabling a smooth transition from visual grounding to logical deduction.

Results: On seven mainstream benchmarks (including math and visual reasoning), RuCL improves Qwen2.5‑VL‑7B by +7.83% average accuracy, reaching 60.06% overall SOTA. Ablation studies show that progressive reward weighting reduces early gradient variance and significantly curtails logical cheating and hallucination.

These five works collectively advance multimodal AI toward greater physical interaction awareness, higher commercial value, and improved reliability and controllability.

Paper links: https://arxiv.org/abs/2605.21431, https://arxiv.org/abs/2510.00054, https://arxiv.org/abs/2602.08355, https://arxiv.org/abs/2602.06422, https://arxiv.org/abs/2602.21628

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

multimodal AI large language models benchmark reinforcement learning Curriculum Learning ICML 2026

Written by

Alimama Tech

Official Alimama tech channel, showcasing all of Alimama's technical innovations.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.