Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Mahtab Bigverdi, Linjie Li, Weikai Huang, Yiming Liu, Jaemin Cho, Jieyu Zhang, Tuhin Kundu, Chris Dongjoo Kim, Zelun Luo, Ranjay Krishna, Linda Shapiro

June, 2026

Abstract

Vision-language models (VLMs) excel at many tasks, yet continue to struggle with spatial reasoning, i.e. tasks that require information not directly observable in the input. Many spatial questions require simulating unseen viewpoints or integrating multiple partial observations into a unified spatial map. Humans naturally support through imagination. Prior work introduces intermediate visual representations (e.g., visual thoughts, depth, or box tokens), but these largely refine structures already visible rather than predicting the missing spatial structure implied by the input. We introduce Imaginative Perception Tokens (IPT), intermediate perceptual representations that externalize what a VLM would perceive under an alternative spatial configuration while remaining consistent with the observed evidence. To study this capability, we formulate three tasks that require imaginative perception: Perspective Taking, Path Tracing, and Multiview Counting. For each task, we construct a dataset of roughly 20K examples spanning real-world and simulated environments, paired with ground-truth intermediate imaginations and final answers, and curated evaluation benchmarks. Using a unified MLM backbone as a baseline, we demonstrate that supervising imaginative intermediates provides a principled way to improve spatial reasoning over unobserved structure and enables more faithful, interpretable spatial inference.

Type

Conference paper

Publication

In MUSI Workshop at CVPR 2026

Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models

Abstract

Jaemin Cho

Young Investigator @ AI2Incoming Assistant Professor @ JHU

Young Investigator @ AI2
Incoming Assistant Professor @ JHU