Vision and Language

TVLT: Textless Vision-Language Transformer

Vision-and-Language modeling without text, by using a transformer which takes only raw visual and audio inputs - *[NeurIPS 2022]( (Oral)*

Fine-grained Image Captioning with CLIP Reward

CLIP as reward function for fine-grained image captioning - *[Findings of NAACL 2022](*

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

A question answering benchmark on real-world news articles for multi-media and multi-hop reasoning - *[AAAI 2022](*