Jaemin Cho

Jaemin Cho

PhD candidate @ UNC Chapel Hill

🚨 I’m on 2024-2025 Academic job market! + I’m also attending NeurIPS 2024!

I am a PhD candidate of Computer Science at UNC Chapel Hill (MURGe-Lab, UNC NLP group), advised by Prof. Mohit Bansal. While at UNC, I have spent my summer time at Bloomberg (in 2024), Google Research (in 2023), Microsoft Research (in 2022) and Adobe Research (in 2021). Prior to UNC, I did machine learning research at AI2, Naver, and SNU. My research has earned multiple spotlight/oral awards (e.g., NeurIPS, NAACL, CVPR Workshop) and been recognized through Bloomberg Data Science Ph.D. Fellowship and media coverage (e.g., MIT Technology Review, IEEE Spectrum, WIRED).

I have actively reviewed for conferences across natural language processing (ACL, EMNLP, NAACL, COLING, ARR), machine learning (NeurIPS, ICLR), and computer vision (CVPR, WACV), earning the Top Reviewer Award at NeurIPS 2024. I have also co-organized “T4V: Transformers for Vision” workshops at CVPR 2023 and 2024.

I work on multimodal AI, making their reasoning more scalable and faithful in both understanding and generation tasks.

(1) Scalable Multimodal Frameworks – Modern AI models must meet the growing demand for thousands of capabilities. My research has addressed this challenge by introducing: (a) Unified generative frameworks that flexibly accommodate diverse modalities and tasks, using a single architecture and a generative objective – VL-T5 (ICML 2021) / X-LXMERT (EMNLP 2020) / TVLT (NeurIPS 2022 Oral) and (b) Efficient finetuning frameworks that significantly reduce parameter and memory requirements for creating task-specific models – VL-Adapter (CVPR 2022) / LST (NeurIPS 2022) / Ctrl-Adapter (2024)

(2) Faithful Multimodal Reasoning – Scaling alone is not enough. Large models that rely on black-box reasoning and encode all knowledge within their parameters often struggle with basic tasks and produce hallucinations. My research makes their reasoning process more accurate and interpretable by introducing: (a) Planning-based frameworks that decompose complex visual generation problems into faithful, human-interpretable step-by-step reasoning processes – VPGen (NeurIPS 2023) / VideoDirectorGPT (COLM 2024) / DiagrammerGPT (COLM 2024) and (b) Retrieval-augmented generation frameworks that enhance accuracy and factuality by retrieving relevant information before generating outputs – M3DocRAG (2024) / HiREST (CVPR 2023)

(3) Evaluation and Refinement of Multimodal Generation – With recent advancements in multimodal generation models, conventional evaluation metrics have been often saturated and no longer provide meaningful insights into future research direction. To this end, my research introduces: (a) Fine-grained evaluation frameworks that comprehensively measure model skills in multiple dimensions to uncover detailed strengths and weaknesses – DALL-Eval (ICCV 2023) / VPEval (NeurIPS 2023) / DSG (ICLR 2024) / LayoutBench (CVPRW 2024 Oral) / FineCapEval (Findings of NAACL 2022) / M3DocVQA (2024) and (b) Automatic model refinement frameworks that use these evaluations to detect models’ weaknesses and refine their reasoning process – EnvGen (COLM 2024) / DataEnvGym (2024) / SELMA (NeurIPS 2024) / VideoRepair (2024)

News

Education
  • Ph.D. in Computer Science, 2025 (expected)

    University of North Carolina at Chapel Hill

  • M.S. in Computer Science, 2022

    University of North Carolina at Chapel Hill

  • B.S. in Industrial Engineering, 2018

    Seoul National University

All Publications

Quickly search over publications. Also see my Google Scholar for a more complete list.
DOCCI: Descriptions of Connected and Contrasting Images
High-quality, long, human-annotated descriptions of 15K images - ECCV 2024
DOCCI: Descriptions of Connected and Contrasting Images
Paxion: Patching Action Knowledge in Video-Language Foundation Models
Analyzing and patching action knowledge in video-language models - NeurIPS 2023 (Spotlight)
Paxion: Patching Action Knowledge in Video-Language Foundation Models
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding
A question answering benchmark on real-world news articles for multi-media and multi-hop reasoning - AAAI 2022
MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding