How Keye‑VL‑1.5 Redefines Video Understanding with Slow‑Fast Encoding
Keye‑VL‑1.5, an 8‑billion‑parameter multimodal large language model, introduces a Slow‑Fast video encoding strategy, a four‑stage progressive pre‑training pipeline with 128K context, and a sophisticated post‑training regime that together achieve state‑of‑the‑art performance on video and vision‑language benchmarks while maintaining strong general capabilities.
Core Innovation 1: Slow‑Fast Video Encoding Strategy
The model tackles the spatio‑temporal trade‑off by classifying frames as either Slow (key frames with high spatial resolution) or Fast (transition frames with low resolution but high temporal coverage). The first frame is always Slow; subsequent frames are compared to the latest Slow frame using an image‑patch similarity function. If similarity exceeds 95 %, the frame is marked Fast, otherwise it becomes a new Slow frame. Tokens are allocated proportionally: each Fast frame receives 30 % of the token budget of a Slow frame, and a binary search determines the exact token count per Slow frame within a total visual token budget of 75 k.
Core Innovation 2: Progressive Four‑Stage Pre‑Training & Ultra‑Long Context
Stage 1 – Cross‑Modal Alignment
Freeze the visual transformer (ViT) and language decoder (LLM) and train only the MLP projector on massive image‑text pairs to map visual features into the language semantic space.
Stage 2 – Multi‑Task Pre‑Training
Unfreeze all parameters and train end‑to‑end on tasks such as image captioning, OCR, VQA, grounding, and mixed image‑text data, using an 8 K token context window.
Stage 3 – Annealing with High‑Quality Data
Refine the model on curated high‑quality data, expanding the context length from 8 K to 128 K tokens. This involves resetting RoPE inverse frequency to 8 000 000, introducing context parallelism and pipeline parallelism, and adjusting data mixing ratios to 24 % video, 50 % image, 26 % text.
Stage 4 – Model Fusion
Combine expert models trained on specialized data (e.g., OCR) with the base model to improve robustness and reduce bias.
Core Innovation 3: Post‑Training for Reasoning & Preference Alignment
SFT (Supervised Fine‑Tuning) : Over 7.5 M high‑quality multimodal QA samples covering 70 K task types are used to fine‑tune the model.
MPO (Mixture of Preference Optimization) : After SFT, MPO trains the model to distinguish good vs. bad responses using generated preference pairs.
Keye Reward Model : Evaluates response quality for RL, supporting two modes – no_think (direct judgment) and think (nine‑dimensional reasoning before judgment).
LongCoT Cold‑Start : A five‑step automated pipeline creates high‑quality chain‑of‑thought data, involving multi‑source collection, multi‑path reasoning generation with confidence scoring, two‑level quality assessment, human‑in‑the‑loop enhancement, and dynamic scoring for repeated training.
Iterative General RL (GSPO) : Uses group‑sequence policy optimization with importance weighting to iteratively improve the model.
Progressive Hint Sampling : A five‑level hint system (L1–L5) provides increasingly detailed assistance for hard samples, enabling minimal‑necessary‑information sampling.
Alignment RL : Enforces instruction following, format adherence, and human‑preference alignment through rule‑based, generative, and model‑based rewards.
Training Infrastructure Optimizations
Heterogeneous Hybrid Parallelism : Data parallelism for ViT; a mix of pipeline, tensor, and data parallelism for the LLM to handle differing compute characteristics.
Dynamic Load Balancing : Estimate per‑sample compute cost and greedily assign samples to GPUs to equalize step time.
Flexible Scalable Dataloader : Deep‑aware data loading, I/O server offloading video decoding, and sample‑level checkpoint recovery.
Comprehensive Evaluation & Results
Zero‑shot ViT classification shows competitive performance on ImageNet‑A and ObjectNet after adding 2D RoPE. The Slow‑Fast strategy outperforms Qwen‑2.5‑VL on VideoMME, achieving higher scores with more frames (384 vs 128). On public benchmarks (OpenCompass, MMMU, AI2D, MMBench, MMStar) Keye‑VL‑1.5 attains best or near‑best results. For video tasks (Video‑MME, Video‑MMMU, TempCompass) it surpasses all open‑source models, delivering a 6.5 % absolute gain on Video‑MMMU. Internal evaluations with a 5‑point rubric across eight dimensions confirm superior correctness (+0.57) and reasoning (+1.00) over strong baselines.
Key Results & Ablation Studies
From SFT/MPO/LongCoT cold‑start, each component adds consistent performance gains.
Expert model fusion (e.g., OCR specialist) markedly improves domain‑specific tasks.
Alignment RL boosts instruction‑following and mathematical reasoning.
Progressive hint sampling provides partial solutions that raise success rates in RL.
Rejection sampling during iterative RL yields significant improvements over vanilla RL.
Case Studies
Temporal localization: the model pinpoints a 2‑second handbag appearance within a 26‑second clip with 0.1 s precision. Complex behavior reasoning: it interprets “big dog lightly bites small dog’s ear” as correcting the small dog’s food‑stealing behavior and links it to the video title. Fine‑grained scene description: for a forest hailstorm, it generates detailed environmental details despite missing the word “hail”.
Conclusion
Keye‑VL‑1.5’s three major contributions are (1) the innovative Slow‑Fast encoding that resolves the fundamental spatio‑temporal trade‑off, (2) a systematic training pipeline—from progressive pre‑training to iterative post‑training—that provides a blueprint for building next‑generation multimodal LLMs, and (3) superior comprehensive performance that sets a new benchmark for video understanding while retaining strong general vision‑language abilities.
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
Data Party THU
Official platform of Tsinghua Big Data Research Center, sharing the team's latest research, teaching updates, and big data news.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
