Vision and Lanaguage

Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention

Efficient VL modeling with Perceiver-based iterative cross-attentions - *[WACV 2023](https://nips.cc/Conferences/2021)*

VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks

Adapter-based Parameter-Efficient Training for V&L tasks - *[CVPR 2022](https://cvpr2022.thecvf.com)*

DALL-Eval: Probing the Reasoning Skills and Social Biases of Text-to-Image Generative Transformers

Probing the Reasoning Skills and Social Biases of Text-to-Image Models

VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Video-based grounding can improve diverse NLU tasks - *[NeurIPS 2021](https://nips.cc/Conferences/2021)*

Unifying Vision-and-Language Tasks via Text Generation

Tackle different V&L tasks via text generation with a single unified architecture - *[ICML 2021](https://icml.cc/Conferences/2021)*

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Generate image from text by predicting masked patches with multi-modal transformers - *[EMNLP 2020](https://2020.emnlp.org/)*