Vision and Language

DOCCI: Descriptions of Connected and Contrasting Images
High-quality, long, human-annotated descriptions of 15K images
DOCCI: Descriptions of Connected and Contrasting Images
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Analyzing and patching action knowledge in video-language models - NeurIPS 2023 (Spotlight)
Paxion: Patching Action Knowledge in Video-Language Foundation Models