Artificial Intelligence 8 min read

MBench: Tsinghua and Tencent Define Long-Term Memory for Video World Models

MBench, a new benchmark from Tsinghua University and Tencent, systematically evaluates the long‑term memory ability of streaming video generation models across entity, environment, and causal consistency, introduces a trigger‑conditioned scoring scheme, and reveals that memory remains a major bottleneck for current SOTA models.

Machine Heart

Jun 11, 2026

MBench: Tsinghua and Tencent Define Long-Term Memory for Video World Models

As video generation advances from short clip synthesis to streaming long‑video creation, models must not only produce visually realistic frames but also maintain stable internal states over extended interactions, obeying physical laws and logical rules.

Introducing MBench

To quantify this capability, Tsinghua University and Tencent's WeChat Vision team released MBench, a benchmark specifically targeting the memory ability of video world models. Built on 1,040 cases, MBench decomposes memory into three complementary core dimensions and further into twelve measurable sub‑dimensions covering static attributes to dynamic causality.

Three Memory Dimensions

Entity Consistency : evaluates whether a model preserves the persistent identity and attributes of objects and humans (e.g., geometry, texture, clothing) when they re‑appear after occlusion.

Environment Consistency : measures scene stability, including spatial consistency (3‑D layout via epipolar geometry and reprojection error) and rendering consistency (stable lighting and style) during camera motion and return.

Causal Consistency : tests the model’s ability to remember causal logic, both self‑evolution and interaction, such as correctly rendering fragments after an object is broken and remembering new object positions after textual commands.

Trigger‑Conditioned Scoring

The authors identified a confounding factor: models may avoid challenging memory triggers, yielding inflated consistency scores. MBench therefore splits the score into:

Trigger Coverage (C_trig) : checks if the model successfully executes the memory challenge event (e.g., object re‑entry).

Memory Reliability (S_rel) : computes consistency only on samples where the trigger succeeded.

The final M‑Score is the harmonic mean of these two components, penalizing static or overly conservative generation and rewarding dynamic, consistent modeling.

Evaluation of 14 SOTA Models

Using MBench, the team evaluated eight text‑driven and six action‑driven state‑of‑the‑art models. Results show no single model excels across all dimensions; long‑term memory remains a universal bottleneck. Key findings include:

Spatial and causal abilities are the primary weaknesses—most models fail to reconstruct geometry after long‑term viewpoint changes or maintain causal logic.

Action‑driven models exhibit a “static‑bias” failure mode, often generating overly static scenes to avoid spatial collapse, which limits their ability to drive complex physical evolution.

High visual fidelity does not guarantee memory stability; models that produce photorealistic frames may still lose consistency over long sequences.

Conclusion

While video generation has progressed from single‑image synthesis to minute‑level video creation, achieving world‑understanding, prediction, and interaction requires robust memory. MBench openly provides the full dataset, evaluation code, a live leaderboard, and detailed reports, aiming to guide future research toward models that can truly “remember, understand, and predict” the world.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

AI video generation benchmark Memory world model long-term consistency

Written by

Machine Heart

Professional AI media and industry service platform

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.