Research
Physical reasoning

Physical reasoning studies whether a model that watches a short clip can predict what happens next under ordinary physics. Description is not the bottleneck. A frontier model can narrate a scene in fluent detail, yet still fail to anticipate that a tilted glass spills or that an unsupported object falls. We track that gap directly.
The task
Each item presents a short video and asks the model to predict the physical outcome of an event in the scene. Scenes cover everyday dynamics: contact and collision, support and balance, containment, and the effects of force over time. The questions are designed so that surface description is not enough; answering correctly requires a working model of how the depicted world evolves.
Why it matters
The capabilities that follow today's language models are likely to come from systems that build and update an internal model of the world, then use that model to predict and plan. Measuring physical prediction from video isolates that ability from language fluency, which makes it a cleaner signal of progress toward grounded reasoning.
Loka-1 result
Loka-1 is our open physical reasoning model. On the Meta FAIR Physical Reasoning from Video leaderboard, it reports results across all three tracked benchmarks:
| Model | IntPhys 2 | MVPBench | CausalVQA |
|---|---|---|---|
| Human baseline | 92.44 | 92.90 | 84.78 |
| Cosmos-Reason2-8B | 58.14 | 47.19 | 59.14 |
| Loka-1 | 50.00 | 50.02 | 50.44 |
| V-JEPA 2 | 56.40 | 44.50 | 44.89 |
| GPT-4o | 53.19 | 32.50 | 50.95 |
| Qwen2.5-VL | 49.12 | 36.70 | 49.05 |
| Gemini 2.5 Flash | 56.10 | - | 61.66 |
That places Loka-1 second among model submissions that report all three tasks, behind Cosmos-Reason2-8B, and third overall when the human baseline is included.
Start here
Explore the benchmark and submission details.