VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer

Video-based grounding can improve diverse NLU tasks - *[NeurIPS 2021](*

Unifying Vision-and-Language Tasks via Text Generation

Tackle different V&L tasks via text generation with a single unified architecture - *[ICML 2021](*

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

Generate image from text by predicting masked patches with multi-modal transformers - *[EMNLP 2020](*

Mixture Content Selection for Diverse Sequence Generation

Separate Diversification from Generation to improve both diversity and accuracy in sequence generation - *[EMNLP 2019](*

A Hierarchical Latent Structure for Variational Conversation Modeling

Propose a hierarchical VAE model and utterance drop regularization to mitigate posterior collapse problem - *[NAACL 2018](* (Oral)