Agent‑Driven Newton Toolbox: A New Paradigm for Grounded Video Generation
NEWTON introduces an Agent‑centric framework that augments existing video generators with a planner, physics‑aware tools, and a verification loop, enabling multi‑round refinement and significantly improving physical consistency on benchmarks without retraining the underlying generator.
Recent text‑to‑video models such as Sora, Veo and Kling have achieved impressive visual fidelity, yet they still lack genuine understanding of physical laws. Evaluations on the VideoPhy‑2 benchmark reveal that even the best models attain only 32.6% joint accuracy, exposing a gap between visual realism and physical plausibility.
The root cause is identified as insufficient input information: a natural‑language prompt compresses the physical scene and omits crucial parameters (e.g., container shape, foam generation, liquid rise speed). Consequently, models must hallucinate missing dynamics, leading to artifacts such as static liquid surfaces, absent collision effects, or unrealistic object behavior.
To bridge this gap, Zhejiang University, Hong Kong Polytechnic University, TreeRoot Technology and Sany Group propose NEWTON (Neural Agentic World‑Aware Tool‑Orchestrated Navigation). NEWTON embeds an Agent paradigm into video generation: a Planner first determines which physical variables are missing and which tools to invoke; an Executor calls those tools and the frozen video generator; a Verifier scores physical plausibility and feeds feedback back to the Planner for the next round.
The system’s tool library covers complementary physical dimensions:
Keyframe generation tool supplies temporal boundary conditions, such as enforcing a parabola’s apex at a specific frame or gradually raising liquid level while pouring.
Scientific computation tool runs sandboxed Python code to compute trajectories, momentum conservation, rotational dynamics, etc., explicitly inserting physical reasoning into the generation context.
Prompt optimization tool restructures material properties, action phases and causal relations into a form that the generator can more readily “understand”.
Crucially, NEWTON does not modify the underlying video generator. Whether using LTX‑Video or Veo‑3.1, the generator remains frozen; only the Planner is trained via Flow‑GRPO on‑policy optimization across multi‑round tool‑calling episodes, learning when to compute physics, generate keyframes, rewrite scene descriptions, or trigger final video synthesis.
Experiments on VideoPhy‑2 demonstrate consistent gains without retraining the generator. With LTX‑Video, joint accuracy rises from 21.4% to 29.7%; with Veo‑3.1, it improves from 30.7% to 37.4%. Qualitative cases show NEWTON correctly rendering rising liquid when pouring beer, carving a groove and wood chips when cutting wood, and producing coherent bubble and LEGO‑rugby interactions—scenarios where baseline models produce static or implausible outcomes.
In summary, NEWTON’s contribution extends beyond a metric boost; it proposes a new paradigm where video models serve as callable modules within an Agent system, enabling systematic identification of missing information, tool‑driven physical reasoning, result verification, and iterative replanning to achieve more trustworthy, world‑consistent video synthesis.
Paper: "NEWTON: Agentic Planning for Physically Grounded Video Generation"
arXiv: https://arxiv.org/abs/2605.18396
Project page: https://newton026.github.io/newton/
Signed-in readers can open the original source through BestHub's protected redirect.
This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.
