Vision and Language

Fine-grained Image Captioning with CLIP Reward
Findings of NAACL 2022
Fine-grained Image Captioning with CLIP Reward
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
A question answering benchmark on real-world news articles for multi-media and multi-hop reasoning - AAAI 2022
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding