Jaemin Cho
Jaemin Cho
Publications
CV
Light
Dark
Automatic
1
Hierarchical Video-Moment Retrieval and Step-Captioning
HiREST is a holistic, hierarchical benchmark of multimodal retrieval and step-by-step summarization for a video corpus -
CVPR 2023
Abhay Zala
*,
Jaemin Cho
*,
Satwik Kottur
,
Xilun Chen
,
Barlas Oğuz
,
Yasahar Mehdad
,
Mohit Bansal
Preprint
Cite
Code
Project
Perceiver-VL: Efficient Vision-and-Language Modeling with Iterative Latent Attention
Efficient VL modeling with Perceiver-based iterative cross-attentions -
WACV 2023
Zineng Tang
*,
Jaemin Cho
*,
Jie Lei
,
Mohit Bansal
Preprint
Cite
Code
TVLT: Textless Vision-Language Transformer
Vision-and-Language modeling without text, by using a transformer which takes only raw visual and audio inputs -
NeurIPS 2022
(Oral)
Zineng Tang
*,
Jaemin Cho
*,
Yixin Nie
*,
Mohit Bansal
Preprint
Cite
Code
LST: Ladder Side-Tuning for Parameter and Memory Efficient Transfer Learning
LST brings Memory efficiency into Parameter-efficient transfer learning -
NeurIPS 2022
Yi-Lin Sung
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Fine-grained Image Captioning with CLIP Reward
Findings of NAACL 2022
Jaemin Cho
,
Seunghyun Yoon
,
Ajinkya Kale
,
Franck Dernoncourt
,
Trung Bui
,
Mohit Bansal
Preprint
Cite
Code
VL-Adapter: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
Adapter-based Parameter-Efficient Training for V&L tasks -
CVPR 2022
Yi-Lin Sung
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
A question answering benchmark on real-world news articles for multi-media and multi-hop reasoning -
AAAI 2022
Revanth Gangi Reddy
,
Xilin Rui
,
Manling Li
,
Xudong Lin
,
Haoyang Wen
,
Jaemin Cho
,
Lifu Huang
,
Mohit Bansal
,
Avi Sil
,
Shih-Fu Chang
,
Alexander Schiwing
,
Heng Ji
Preprint
Cite
VidLanKD: Improving Language Understanding via Video-Distilled Knowledge Transfer
Video-based grounding can improve diverse NLU tasks -
NeurIPS 2021
Zineng Tang
,
Jaemin Cho
,
Hao Tan
,
Mohit Bansal
Preprint
Cite
Code
Unifying Vision-and-Language Tasks via Text Generation
Tackle different V&L tasks via text generation with a single unified architecture -
ICML 2021
Jaemin Cho
,
Jie Lei
,
Hao Tan
,
Mohit Bansal
Preprint
Cite
Code
X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers
Text-to-Image Generation via predicting vector-quantized image patches with multimodal LMs -
EMNLP 2020
Jaemin Cho
,
Jiasen Lu
,
Dustin Schwenk
,
Hannaneh Hajishirzi
,
Aniruddha Kembhavi
Preprint
Cite
Code
Project
«
»
Cite
×