Jaemin Cho

Jaemin Cho

PhD candidate @ UNC Chapel Hill

🚨 I’m on 2024-2025 Academic job market!

I am a PhD candidate of Computer Science at UNC Chapel Hill (MURGe-Lab, UNC NLP group), advised by Prof. Mohit Bansal. While at UNC, I have spent my summer time at Bloomberg (in 2024), Google Research (in 2023), Microsoft Research (in 2022) and Adobe Research (in 2021). Prior to UNC, I did machine learning research at AI2, Naver, and SNU. My research has earned multiple spotlight/oral awards (e.g., NeurIPS, NAACL, CVPR Workshop) and been recognized through Bloomberg Data Science Ph.D. Fellowship and media coverage (e.g., MIT Technology Review, IEEE Spectrum, WIRED). I have actively reviewed for conferences across natural language processing (ACL, EMNLP, NAACL, COLING, ARR), machine learning (NeurIPS, ICLR), and computer vision (CVPR, WACV), earning the Top Reviewer Award at NeurIPS 2024. I have also co-organized “T4V: Transformers for Vision” workshops at CVPR 2023 and 2024.

I work on multimodal AI, making their reasoning more scalable and faithful in both understanding and generation tasks.

(1) Scalable Multimodal Frameworks – Modern AI models must meet the growing demand for thousands of capabilities. My research has addressed this challenge by introducing: (a) Unified generative frameworks that flexibly accommodate diverse modalities and tasks, using a single architecture and a generative objective – VL-T5 (ICML 2021) / X-LXMERT (EMNLP 2020) / TVLT (NeurIPS 2022 Oral) and (b) Efficient finetuning frameworks that significantly reduce parameter and memory requirements for creating task-specific models – VL-Adapter (CVPR 2022) / LST (NeurIPS 2022) / Ctrl-Adapter (2024)

(2) Faithful Multimodal Reasoning – Scaling alone is not enough. Large models that rely on black-box reasoning and encode all knowledge within their parameters often struggle with basic tasks and produce hallucinations. My research makes their reasoning process more accurate and interpretable by introducing: (a) Planning-based frameworks that decompose complex visual generation problems into faithful, human-interpretable step-by-step reasoning processes – VPGen (NeurIPS 2023) / VideoDirectorGPT (COLM 2024) / DiagrammerGPT (COLM 2024) and (b) Retrieval-augmented generation (RAG) frameworks that enhance accuracy and factuality by retrieving relevant information before generating outputs – M3DocRAG (2024) / HiREST (CVPR 2023)

(3) Evaluation and Refinement of Multimodal Generation – With recent advancements in multimodal generation models, conventional evaluation metrics have been often saturated and no longer provide meaningful insights into future research direction. To this end, my research introduces: (a) Fine-grained evaluation frameworks that comprehensively measure model skills in multiple dimensions to uncover detailed strengths and weaknesses – DALL-Eval (ICCV 2023) / VPEval (NeurIPS 2023) / DSG (ICLR 2024) / LayoutBench (CVPRW 2024 Oral) / FineCapEval (Findings of NAACL 2022) / M3DocVQA (2024) and (b) Automatic model refinement frameworks that use these evaluations to detect models’ weaknesses and refine their reasoning process – EnvGen (COLM 2024) / DataEnvGym (2024) / SELMA (NeurIPS 2024) / VideoRepair (2024)

News

Education
  • Ph.D. in Computer Science, 2025 (expected)

    University of North Carolina at Chapel Hill

  • M.S. in Computer Science, 2022

    University of North Carolina at Chapel Hill

  • B.S. in Industrial Engineering, 2018

    Seoul National University

Publications

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

arXiv preprint, 2024

Preprint Cite Project

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

arXiv preprint, 2024

Preprint Cite Code Project

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

In COLM, 2024

Preprint Cite Code Project

An Assessment of Reported Biases and Harms of Large Language Models

In ICA (Top Paper Award), 2024

PDF Cite

DOCCI: Descriptions of Connected and Contrasting Images

In ECCV, 2024

Preprint Cite Dataset Project

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

In CVPR Workshop (Oral), 2024

Preprint Cite Code Dataset Project

Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation

In ICLR, 2023

Preprint Cite Code Project

Paxion: Patching Action Knowledge in Video-Language Foundation Models

In NeurIPS, 2023

Preprint Cite Code

Hierarchical Video-Moment Retrieval and Step-Captioning

In CVPR, 2023

Preprint Cite Code Project

TVLT: Textless Vision-Language Transformer

In NeurIPS, 2022

Preprint Cite Code

Fine-grained Image Captioning with CLIP Reward

In Findings of NAACL, 2022

Preprint Cite Code

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

In AAAI, 2021

Preprint Cite

Unifying Vision-and-Language Tasks via Text Generation

In ICML, 2021

Preprint Cite Code

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

In EMNLP, 2020

Preprint Cite Code Project

Mixture Content Selection for Diverse Sequence Generation

In EMNLP, 2019

Preprint Cite Code