Jaemin Cho

Incoming Assistant Professor @ Johns Hopkins University

I am joining the Computer Science Department at Johns Hopkins University as an Assistant Professor in Fall 2026 and now looking for a gap year position starting in Fall 2025.

I am a PhD candidate of Computer Science at UNC Chapel Hill (MURGe-Lab, UNC NLP group), advised by Prof. Mohit Bansal. While at UNC, I have spent my summer time at Bloomberg (in 2024), Google Research (in 2023), Microsoft Research (in 2022) and Adobe Research (in 2021). Prior to UNC, I did machine learning research at AI2, Naver, and SNU. My research has earned multiple spotlight/oral awards (e.g., NeurIPS, ICLR) and been recognized through Bloomberg Data Science Ph.D. Fellowship and media coverage (e.g., MIT Technology Review, IEEE Spectrum, WIRED). I have actively reviewed for conferences across natural language processing (ACL, EMNLP, NAACL, COLING, ARR), machine learning (NeurIPS, ICLR), and computer vision (CVPR, WACV), earning the Top Reviewer Award at NeurIPS 2024. I have also co-organized “T4V: Transformers for Vision” workshops at CVPR 2023 and 2024.

I work on multimodal AI, making their reasoning more scalable and faithful in both understanding and generation tasks.

(1) Scalable Multimodal Frameworks – Modern AI models must meet the growing demand for thousands of capabilities. My research has addressed this challenge by introducing: (a) Unified generative frameworks that flexibly accommodate diverse modalities and tasks, using a single architecture and a generative objective – VL-T5 (ICML 2021) / X-LXMERT (EMNLP 2020) / TVLT (NeurIPS 2022 Oral) and (b) Efficient finetuning frameworks that significantly reduce parameter and memory requirements for creating task-specific models – VL-Adapter (CVPR 2022) / LST (NeurIPS 2022) / Ctrl-Adapter (ICLR 2025 Oral)

(2) Faithful Multimodal Reasoning – Scaling alone is not enough. Large models that rely on black-box reasoning and encode all knowledge within their parameters often struggle with basic tasks and produce hallucinations. My research makes their reasoning process more accurate and interpretable by introducing: (a) Planning-based frameworks that decompose complex visual generation problems into faithful, human-interpretable step-by-step reasoning processes – VPGen (NeurIPS 2023) / VideoDirectorGPT (COLM 2024) / DiagrammerGPT (COLM 2024) and (b) Retrieval-augmented generation (RAG) frameworks that enhance accuracy and factuality by retrieving relevant information before generating outputs – M3DocRAG (2024) / HiREST (CVPR 2023)

(3) Evaluation and Refinement of Multimodal Generation – With recent advancements in multimodal generation models, conventional evaluation metrics have been often saturated and no longer provide meaningful insights into future research direction. To this end, my research introduces: (a) Fine-grained evaluation frameworks that comprehensively measure model skills in multiple dimensions to uncover detailed strengths and weaknesses – DALL-Eval (ICCV 2023) / VPEval (NeurIPS 2023) / DSG (ICLR 2024) / LayoutBench (CVPRW 2024 Oral) / FineCapEval (Findings of NAACL 2022) / M3DocVQA (2024) and (b) Automatic model refinement frameworks that use these evaluations to detect models’ weaknesses and refine their reasoning process – EnvGen (COLM 2024) / DataEnvGym (ICLR 2025 Spotlight) / SELMA (NeurIPS 2024) / VideoRepair (2024)

News

Apr 2025 - New preprints:
Feb 2025 - 2 papers have been accepted to ICLR 2025:
- Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
  - as Oral presentation (top 1.8% of 10,000+ submissions)
- DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
  - as Spotlight presentation (top 5.1% of 10,000+ submissions)
Nov 2024 - New preprints:
- VideoRepair: Improving Text-to-Video Generation via Misalignment Evaluation and Localized Refinement
- M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Oct 2024 - New preprint:
- DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
Oct 2024 - 1 paper has been accepted to NeurIPS 2024:
- SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
Jul 2024 - 3 papers have been accepted to COLM 2024:
Jul 2024 - 2 papers have been accepted to ECCV 2024:
- CRG: Contrastive Region Guidance for Vision-Language Models
- DOCCI: Descriptions of Connected and Contrasting Images

Education

Ph.D. in Computer Science, 2025 (expected)
University of North Carolina at Chapel Hill
M.S. in Computer Science, 2022
University of North Carolina at Chapel Hill
B.S. in Industrial Engineering, 2018
Seoul National University

Publications

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal
arXiv preprint, 2025

Preprint Cite Code

Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems
Zaid Khan, Elias Stengel-Eskin, Archiki Prasad, Jaemin Cho, Mohit Bansal
arXiv preprint, 2025

Preprint Cite Dataset Project

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization
Jialu Li*, Shoubin Yu*, Han Lin*, Jaemin Cho, Jaehong Yoon, Mohit Bansal
arXiv preprint, 2025