Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Abstract

Vision-language models (VLMs) excel at many tasks, yet continue to struggle with spatial reasoning, i.e. tasks that require information not directly observable in the input. Many spatial questions require simulating unseen viewpoints or integrating multiple partial observations into a unified spatial map. Humans naturally support through imagination. Prior work introduces intermediate visual representations (e.g., visual thoughts, depth, or box tokens), but these largely refine structures already visible rather than predicting the missing spatial structure implied by the input. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under an alternative spatial configuration while remaining consistent with the observed evidence. To study this capability, we formulate three tasks that require imaginative perception: Perspective Taking, Path Tracing, and Multiview Counting. For each task, we construct a dataset of roughly 20K examples spanning real-world and simulated environments, paired with ground-truth intermediate imaginations and final answers, and curated evaluation benchmarks. Using a unified MLM backbone as a baseline, we demonstrate that supervising imaginative intermediates provides a principled way to improve spatial reasoning over unobserved structure and enables more faithful, interpretable spatial inference.

Jaemin Cho
Jaemin Cho
Young Investigator @ AI2
Incoming Assistant Professor @ JHU

Incoming Asst. Prof. @ JHU working on Multimodal AI