Vision-and-Language modeling without text, by using a transformer which takes only raw visual and audio inputs - *[NeurIPS 2022](https://nips.cc/) (Oral)*
A question answering benchmark on real-world news articles for multi-media and multi-hop reasoning - *[AAAI 2022](https://aaai.org/Conferences/AAAI-22/)*