Vision and Language

WildDet3D: Scaling Promptable 3D Detection in the Wild
Open-world promptable monocular 3D object detection from a single image
Imaginative Perception Tokens Enhance Spatial Reasoning in Multimodal Language Models
Imaginative Perception Tokens enhance spatial reasoning in multimodal language models - CVPR 2026 MUSI Workshop
DOCCI: Descriptions of Connected and Contrasting Images
High-quality, long, human-annotated descriptions of 15K images - ECCV 2024
DOCCI: Descriptions of Connected and Contrasting Images