Research

Visual Grounding

#1 on ScreenSpot Pro
Visual Grounding

When an AI uses a computer for you, it has to do something people barely think about: look at the screen and know exactly where to click. If it chooses the wrong tiny icon, menu item, or button, the whole task can fail.

Visual grounding is the problem of making that click target reliable. It matters for everyday automation because real software is messy: screens are dense, labels repeat, icons are small, and the right answer often depends on what is visibly happening on the screen.

What the field studies

GUI visual grounding asks whether a model can connect language to interface geometry. The model has to understand the instruction, read the screen, distinguish similar controls, and return a location precise enough for an agent to act.

The field matters because most higher-level GUI agents eventually reduce a plan to a grounded action. If the action target is wrong, the planner can be right and the workflow still fails.

Previous approaches

Early systems leaned on structured representations such as HTML, accessibility trees, or UI metadata. Those are efficient and easy to inspect when they are available, but they can miss visual state, fail on canvas-heavy apps, and do not generalize cleanly to desktop software or remote screens.

Vision-language models made screenshot-first grounding possible. A model can look at the same pixels a person sees and output coordinates directly. The tradeoff is reliability: dense screens, tiny icons, scrolling regions, and ambiguous labels make raw coordinate prediction brittle.

GUI-specific training improved that picture. Work such as SeeClick trains models to connect screen instructions to actions, while parsing systems such as OmniParser detect and describe UI elements before the agent acts. These approaches add useful structure, but they also introduce data requirements, detector errors, and pipeline decisions.

Zoom-in pipelines are another practical response. They first make a coarse prediction on the full screenshot, crop around it, and ask again at higher effective resolution. The usual goal is better localization. The intermediate predictions are mostly treated as plumbing.

The current ScreenSpot-Pro leaderboard shows the same pattern in practice. Top systems include GUI specialists like KV-Ground that learn to read software screens, agentic models like Holo that reason more before choosing a target, multi-view methods like MVP that combine several visual guesses, and zoom/refine systems like AdaZoom that use a second pass on hard cases. These approaches all try to make small UI targets less ambiguous.

Our contribution

Zoom Consistency turns that existing plumbing into a confidence signal. In a two-step zoom pipeline, the second prediction should land near the center of the crop if the first prediction was already close to the target. If it lands far from center, the first step was probably off.

That distance is useful because it is geometric. It does not depend on token probabilities, model internals, or per-model calibration. Different models can be compared in the same coordinate space.

We use that signal for training-free routing between a GUI grounding specialist and a general vision-language model. On ScreenSpot-Pro, zoom consistency is a weak but consistent predictor of correctness, and routing captures part of the gap between either model alone and an oracle that always picks the right one.

The result is a simple pattern: keep the zoom pipeline, keep both model predictions, and use their consistency to decide which answer to trust.

For the result write-up and a less technical walkthrough, read Om Labs Tops ScreenSpot-Pro.

Related blog posts