Artificial Intelligence 21 min read

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

Keye‑VL‑1.5, an 8‑billion‑parameter multimodal large language model, introduces a Slow‑Fast video encoding strategy, a four‑stage progressive pre‑training pipeline with 128K context, and a sophisticated post‑training regime that together achieve state‑of‑the‑art performance on video and vision‑language benchmarks while maintaining strong general capabilities.

Data Party THU

Sep 26, 2025

How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding

Core Innovation 1: Slow‑Fast Video Encoding Strategy

The model tackles the spatio‑temporal trade‑off by classifying frames as either Slow (key frames with high spatial resolution) or Fast (transition frames with low resolution but high temporal coverage). The first frame is always Slow; subsequent frames are compared to the latest Slow frame using an image‑patch similarity function. If similarity exceeds 95 %, the frame is marked Fast, otherwise it becomes a new Slow frame. Tokens are allocated proportionally: each Fast frame receives 30 % of the token budget of a Slow frame, and a binary search determines the exact token count per Slow frame within a total visual token budget of 75 k.

Core Innovation 2: Progressive Four‑Stage Pre‑Training & Ultra‑Long Context

Stage 1 – Cross‑Modal Alignment

Freeze the visual transformer (ViT) and language decoder (LLM) and train only the MLP projector on massive image‑text pairs to map visual features into the language semantic space.

Stage 2 – Multi‑Task Pre‑Training

Unfreeze all parameters and train end‑to‑end on tasks such as image captioning, OCR, VQA, grounding, and mixed image‑text data, using an 8 K token context window.

Stage 3 – Annealing with High‑Quality Data

Refine the model on curated high‑quality data, expanding the context length from 8 K to 128 K tokens. This involves resetting RoPE inverse frequency to 8 000 000, introducing context parallelism and pipeline parallelism, and adjusting data mixing ratios to 24 % video, 50 % image, 26 % text.

Stage 4 – Model Fusion

Combine expert models trained on specialized data (e.g., OCR) with the base model to improve robustness and reduce bias.

Core Innovation 3: Post‑Training for Reasoning & Preference Alignment

SFT (Supervised Fine‑Tuning) : Over 7.5 M high‑quality multimodal QA samples covering 70 K task types are used to fine‑tune the model.

MPO (Mixture of Preference Optimization) : After SFT, MPO trains the model to distinguish good vs. bad responses using generated preference pairs.

Keye Reward Model : Evaluates response quality for RL, supporting two modes – no_think (direct judgment) and think (nine‑dimensional reasoning before judgment).

LongCoT Cold‑Start : A five‑step automated pipeline creates high‑quality chain‑of‑thought data, involving multi‑source collection, multi‑path reasoning generation with confidence scoring, two‑level quality assessment, human‑in‑the‑loop enhancement, and dynamic scoring for repeated training.

Iterative General RL (GSPO) : Uses group‑sequence policy optimization with importance weighting to iteratively improve the model.

Progressive Hint Sampling : A five‑level hint system (L1–L5) provides increasingly detailed assistance for hard samples, enabling minimal‑necessary‑information sampling.

Alignment RL : Enforces instruction following, format adherence, and human‑preference alignment through rule‑based, generative, and model‑based rewards.

Training Infrastructure Optimizations

Heterogeneous Hybrid Parallelism : Data parallelism for ViT; a mix of pipeline, tensor, and data parallelism for the LLM to handle differing compute characteristics.

Dynamic Load Balancing : Estimate per‑sample compute cost and greedily assign samples to GPUs to equalize step time.

Flexible Scalable Dataloader : Deep‑aware data loading, I/O server offloading video decoding, and sample‑level checkpoint recovery.

Comprehensive Evaluation & Results

Zero‑shot ViT classification shows competitive performance on ImageNet‑A and ObjectNet after adding 2D RoPE. The Slow‑Fast strategy outperforms Qwen‑2.5‑VL on VideoMME, achieving higher scores with more frames (384 vs 128). On public benchmarks (OpenCompass, MMMU, AI2D, MMBench, MMStar) Keye‑VL‑1.5 attains best or near‑best results. For video tasks (Video‑MME, Video‑MMMU, TempCompass) it surpasses all open‑source models, delivering a 6.5 % absolute gain on Video‑MMMU. Internal evaluations with a 5‑point rubric across eight dimensions confirm superior correctness (+0.57) and reasoning (+1.00) over strong baselines.

Key Results & Ablation Studies

From SFT/MPO/LongCoT cold‑start, each component adds consistent performance gains.

Expert model fusion (e.g., OCR specialist) markedly improves domain‑specific tasks.

Alignment RL boosts instruction‑following and mathematical reasoning.

Progressive hint sampling provides partial solutions that raise success rates in RL.

Rejection sampling during iterative RL yields significant improvements over vanilla RL.

Case Studies

Temporal localization: the model pinpoints a 2‑second handbag appearance within a 26‑second clip with 0.1 s precision. Complex behavior reasoning: it interprets “big dog lightly bites small dog’s ear” as correcting the small dog’s food‑stealing behavior and links it to the video title. Fine‑grained scene description: for a forest hailstorm, it generates detailed environmental details despite missing the word “hail”.

Conclusion

Keye‑VL‑1.5’s three major contributions are (1) the innovative Slow‑Fast encoding that resolves the fundamental spatio‑temporal trade‑off, (2) a systematic training pipeline—from progressive pre‑training to iterative post‑training—that provides a blueprint for building next‑generation multimodal LLMs, and (3) superior comprehensive performance that sets a new benchmark for video understanding while retaining strong general vision‑language abilities.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Large Language Model benchmark video understanding pretraining multimodal LLM slow-fast encoding

Written by

Data Party THU

Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.

Core Innovation 1: Slow‑Fast Video Encoding Strategy

Core Innovation 2: Progressive Four‑Stage Pre‑Training & Ultra‑Long Context

Stage 1 – Cross‑Modal Alignment

Stage 2 – Multi‑Task Pre‑Training

Stage 3 – Annealing with High‑Quality Data

Stage 4 – Model Fusion

Core Innovation 3: Post‑Training for Reasoning & Preference Alignment

Training Infrastructure Optimizations

Comprehensive Evaluation & Results

Key Results & Ablation Studies

Case Studies

Conclusion

Data Party THU

How this landed with the community

Was this worth your time?

0 Comments

Core Innovation 1: Slow‑Fast Video Encoding Strategy

Core Innovation 2: Progressive Four‑Stage Pre‑Training & Ultra‑Long Context

Stage 1 – Cross‑Modal Alignment

Stage 2 – Multi‑Task Pre‑Training

Stage 3 – Annealing with High‑Quality Data

Stage 4 – Model Fusion

Core Innovation 3: Post‑Training for Reasoning & Preference Alignment