Physical reasoning

Physical reasoning studies whether a model that watches a short clip can predict what happens next under ordinary physics. Description is not the bottleneck. A frontier model can narrate a scene in fluent detail, yet still fail to anticipate that a tilted glass spills or that an unsupported object falls. We track that gap directly.

The task

Each item presents a short video and asks the model to predict the physical outcome of an event in the scene. Scenes cover everyday dynamics: contact and collision, support and balance, containment, and the effects of force over time. The questions are designed so that surface description is not enough; answering correctly requires a working model of how the depicted world evolves.

Why it matters

The capabilities that follow today's language models are likely to come from systems that build and update an internal model of the world, then use that model to predict and plan. Measuring physical prediction from video isolates that ability from language fluency, which makes it a cleaner signal of progress toward grounded reasoning.

Loka-1 result

Loka-1 is our open physical reasoning model. On the Meta FAIR Physical Reasoning from Video leaderboard, it reports results across all three tracked benchmarks:

Model	IntPhys 2	MVPBench	CausalVQA
Human baseline	92.44	92.90	84.78
Cosmos-Reason2-8B	58.14	47.19	59.14
Loka-1	50.00	50.02	50.44
V-JEPA 2	56.40	44.50	44.89
GPT-4o	53.19	32.50	50.95
Qwen2.5-VL	49.12	36.70	49.05
Gemini 2.5 Flash	56.10	-	61.66

That places Loka-1 second among model submissions that report all three tasks, behind Cosmos-Reason2-8B, and third overall when the human baseline is included.

Start here

Explore the benchmark and submission details.

Physical Reasoning benchmark

Physical reasoning

The task

Why it matters

Loka-1 result

Start here

Related blog posts

From Language Models to World Models

Loka-1 on Physical Reasoning