Artificial Intelligence 13 min read

7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding

Kuaishou’s foundational large-model team has secured seven papers at ACL 2025, spanning alignment bias in training, safety defenses during inference, decoding strategies, fine-grained video-temporal understanding, reward fairness in RLHF, multimodal captioning benchmarks, and methods to curb hallucinations in vision-language models.

Kuaishou Tech

Jun 5, 2025

7 Kuaishou AI Papers Accepted at ACL 2025: Video Understanding & Safe LLM Decoding

The 63rd Annual Meeting of the Association for Computational Linguistics (ACL) will take place from July 27 to August 1 in Vienna. The conference has announced its paper acceptance list, and Kuaishou’s foundational large-model team has seven papers selected.

The accepted works cover frontier topics of large models, including alignment bias during training, safety protection at inference, decoding strategies and reliability, video-temporal understanding, and evaluation benchmarks.

Paper 01: TUNA – Comprehensive Fine‑grained Temporal Understanding Evaluation on Dense Dynamic Videos

Type: ACL 25 Main

Link: https://friedrichor.github.io/projects/TUNA/

Abstract: Existing video‑understanding benchmarks treat temporal elements such as shots, scenes, actions, and attributes separately or focus on limited aspects, ignoring overall video coherence. TUNA introduces a temporal‑focused benchmark for dense dynamic videos with two complementary tasks—video description and question answering—featuring diverse scenes, dynamic attributes, interpretable and robust evaluation metrics. Evaluation of leading models on TUNA reveals challenges like limited action description, insufficient multi‑entity understanding, and insensitivity to camera motion.

Paper 02: Root Defense Strategies – Ensuring Safety of LLM at the Decoding Level

Type: ACL 25 Main

Link: https://arxiv.org/pdf/2410.06809

Abstract: As large language models (LLMs) advance, the risk of harmful outputs from erroneous or malicious prompts grows. Existing jailbreak defenses operate only at the pre‑fill stage and do not fully exploit decoding‑stage information, leading to lower effectiveness and robustness, and they often sacrifice usefulness. This work studies LLMs’ ability to assess token danger, quantifies it, and proposes a decoding‑oriented, step‑wise defense that directly corrects harmful queries rather than rejecting them, using speculative decoding to maintain usability and speed. Experiments show improved safety without affecting inference speed.

Paper 03: Towards Reward Fairness in RLHF – From a Resource Allocation Perspective

Type: ACL 25 Main

Link: https://arxiv.org/pdf/2505.23349

Abstract: Reward functions serve as proxies for human preferences in Reinforcement Learning from Human Feedback (RLHF). Imperfect rewards can introduce biases such as length preference, harming alignment. This paper treats reward as a resource to be allocated, balancing utility and fairness. Two fairness mechanisms—regularization and coefficient—are introduced for validation and RL stages, yielding fair reward and policy models. Experiments demonstrate more equitable alignment of LLMs with human preferences.

Paper 04: HAIC – Improving Human Action Understanding and Generation with Better Captions for Multimodal Large Language Models

Type: ACL 25 Main

Link: https://arxiv.org/abs/2502.20811

Abstract: Multimodal large language models have progressed in video understanding, yet lack high‑quality data for human‑action videos, limiting performance. The authors propose a two‑stage annotation pipeline to collect videos with clear human actions and annotate them with standardized, attribute‑rich, temporally ordered descriptions. The resulting HAICTrain (126 k video‑caption pairs) and HAICBench (500 manually annotated pairs plus 1 400 QA pairs) enable comprehensive evaluation. Training on HAICTrain significantly improves action understanding and video‑to‑text generation quality.

Paper 05: GODBench – A Benchmark for Multimodal Large Language Models in Video Comment Art

Type: ACL 25 Main

Link: https://stan-lei.github.io/KwaiMM-Dialogue/paper3-godbench.html

Abstract: Video comment art enriches user engagement through humor, satire, or emotional resonance, demanding deep cultural and contextual understanding. While multimodal LLMs excel in STEM tasks, they struggle with creative video comment generation. Existing benchmarks lack modality diversity and coverage. GODBench introduces a multimodal benchmark for evaluating MLLMs’ ability to generate artistic video comments, and proposes a Ripple of Thought (RoT) multi‑step reasoning framework that markedly enhances creative generation capabilities.

Paper 06: Mixture of Decoding – An Attention‑Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision‑Language Models

Type: ACL 25 Findings

Link: https://arxiv.org/pdf/25

Abstract: Large vision‑language models (LVLMs) achieve impressive results but still suffer from hallucinations. The proposed Mixture of Decoding (MoD) dynamically adjusts decoding based on attention correctness: when model‑focused tokens align with image tags, a complementary strategy amplifies key information; when misaligned, a contrasting strategy suppresses misleading cues. Experiments show MoD outperforms existing decoders across major benchmarks, effectively reducing hallucinations.

Paper 07: VidCapBench – A Comprehensive Benchmark of Video Captioning for Controllable Text‑to‑Video Generation

Type: ACL 25 Findings

Link: https://arxiv.org/pdf/2502.12782

Abstract: Controllable text‑to‑video (T2V) models rely on high‑quality video‑caption pairs, yet current evaluations separate caption quality from T2V generation. VidCapBench provides a caption evaluation framework independent of format, annotating videos with aesthetic, content, motion, and physical law attributes, split into automatically and manually assessable subsets. Extensive evaluation of state‑of‑the‑art caption models demonstrates VidCapBench’s stability and comprehensiveness, and its scores correlate strongly with T2V quality metrics, offering valuable guidance for T2V training.

These papers collectively showcase Kuaishou’s advances in AI research, spanning multimodal understanding, safety, fairness, and evaluation benchmarks.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

large language models benchmark Multimodal video understanding AI safety acl

Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Paper 01: TUNA – Comprehensive Fine‑grained Temporal Understanding Evaluation on Dense Dynamic Videos

Paper 02: Root Defense Strategies – Ensuring Safety of LLM at the Decoding Level

Paper 03: Towards Reward Fairness in RLHF – From a Resource Allocation Perspective

Paper 04: HAIC – Improving Human Action Understanding and Generation with Better Captions for Multimodal Large Language Models

Paper 05: GODBench – A Benchmark for Multimodal Large Language Models in Video Comment Art

Paper 06: Mixture of Decoding – An Attention‑Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision‑Language Models

Paper 07: VidCapBench – A Comprehensive Benchmark of Video Captioning for Controllable Text‑to‑Video Generation

Kuaishou Tech

How this landed with the community

Was this worth your time?

0 Comments

Paper 01: TUNA – Comprehensive Fine‑grained Temporal Understanding Evaluation on Dense Dynamic Videos

Paper 02: Root Defense Strategies – Ensuring Safety of LLM at the Decoding Level

Paper 03: Towards Reward Fairness in RLHF – From a Resource Allocation Perspective

Paper 04: HAIC – Improving Human Action Understanding and Generation with Better Captions for Multimodal Large Language Models

Paper 05: GODBench – A Benchmark for Multimodal Large Language Models in Video Comment Art

Paper 06: Mixture of Decoding – An Attention‑Inspired Adaptive Decoding Strategy to Mitigate Hallucinations in Large Vision‑Language Models

Paper 07: VidCapBench – A Comprehensive Benchmark of Video Captioning for Controllable Text‑to‑Video Generation