Jaemin Cho

Jaemin Cho

Young Investigator @ AI2
Incoming Assistant Professor @ JHU

I am joining the Computer Science Department at Johns Hopkins University as an Assistant Professor in Fall 2026. Until that, I am spending my gap year at AI2 Fall 2025 - Fall 2026. I will be recruiting Ph.D. students for Fall 2026. Please see this page for more details. I plan to attend ICCV 2025 (in Hawaii) and NeurIPS 2025 (in San Diego). Please feel free to reach out if you would like to chat in person!

My research focuses on multimodal AI, integrating diverse data types (e.g., images, videos, text, audio, and motion) to develop models that are interpretable, controllable, and scalable. Some of my recent interests include learning embodied knowledge (e.g., robotics, human actions) from unlabeled visual demonstrations, and developing human-in-the-loop frameworks and softwares that enhance productivity in various applications. Below are my research interests during my PhD years.

(1) Scalable Multimodal Frameworks – Modern AI models must meet the growing demand for thousands of capabilities. My research has addressed this challenge by introducing: (a) Unified generative frameworks that flexibly accommodate diverse modalities and tasks, using a single architecture and a generative objective – VL-T5 (ICML 2021) / X-LXMERT (EMNLP 2020) / TVLT (NeurIPS 2022 Oral) and (b) Efficient finetuning frameworks that significantly reduce parameter and memory requirements for creating task-specific models – VL-Adapter (CVPR 2022) / LST (NeurIPS 2022) / Ctrl-Adapter (ICLR 2025 Oral)

(2) Faithful Multimodal Reasoning – Scaling alone is not enough. Large models that rely on black-box reasoning and encode all knowledge within their parameters often struggle with basic tasks and produce hallucinations. My research makes their reasoning process more accurate and interpretable by introducing: (a) Planning-based frameworks that decompose complex visual generation problems into faithful, human-interpretable step-by-step reasoning processes – VPGen (NeurIPS 2023) / VideoDirectorGPT (COLM 2024) / DiagrammerGPT (COLM 2024) / Video-MSG and (b) Retrieval-augmented generation (RAG) frameworks that enhance accuracy and factuality by retrieving relevant information before generating outputs – M3DocRAG (2024) / HiREST (CVPR 2023)

(3) Evaluation and Refinement of Multimodal Generation – With recent advancements in multimodal generation models, conventional evaluation metrics have been often saturated and no longer provide meaningful insights into future research direction. To this end, my research introduces: (a) Fine-grained evaluation frameworks that comprehensively measure model skills in multiple dimensions to uncover detailed strengths and weaknesses – DALL-Eval (ICCV 2023) / VPEval (NeurIPS 2023) / DSG (ICLR 2024) / LayoutBench (CVPRW 2024 Oral) / FineCapEval (Findings of NAACL 2022) / M3DocVQA (2024) / CAPTURe (ICCV 2025) and (b) Automatic model refinement frameworks that use these evaluations to detect models’ weaknesses and refine their reasoning process – EnvGen (COLM 2024) / DataEnvGym (ICLR 2025 Spotlight) / SELMA (NeurIPS 2024) / VideoRepair (2024)

I’m also sharing some of my past application materials below. I know these are far from perfect, but I hope this helps you in your applications!

News

Education
  • Ph.D. in Computer Science, 2025

    University of North Carolina at Chapel Hill

  • B.S. in Industrial Engineering, 2018

    Seoul National University

Publications

RotBench: Evaluating Multimodal Large Language Models on Identifying Image Rotation

arXiv preprint, 2025

Preprint Cite Code

Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents

arXiv preprint, 2025

Preprint Cite Code Project

A Survey on Long-Video Storytelling Generation: Architectures, Consistency, and Cinematic Quality

arXiv preprint, 2025

Preprint Cite

Executable Functional Abstractions: Inferring Generative Programs for Advanced Math Problems

arXiv preprint, 2025

Preprint Cite Dataset Project

Training-free Guidance in Text-to-Video Generation via Multimodal Planning and Structured Noise Initialization

arXiv preprint, 2025

Preprint Cite Code Project

DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback

In ICLR (Spotlight), 2025

Preprint Cite Code Project

EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents

In COLM, 2024

Preprint Cite Code Project

DOCCI: Descriptions of Connected and Contrasting Images

In ECCV, 2024

Preprint Cite Dataset Project

An Assessment of Reported Biases and Harms of Large Language Models

In ICA (Top Paper Award), 2024

PDF Cite

Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation

In CVPR Workshop (Oral), 2024

Preprint Cite Code Dataset Project

Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation

In ICLR, 2024

Preprint Cite Code Project

Paxion: Patching Action Knowledge in Video-Language Foundation Models

In NeurIPS, 2023

Preprint Cite Code

Hierarchical Video-Moment Retrieval and Step-Captioning

In CVPR, 2023

Preprint Cite Code Project

TVLT: Textless Vision-Language Transformer

In NeurIPS (Oral), 2022

Preprint Cite Code

Fine-grained Image Captioning with CLIP Reward

In Findings of NAACL, 2022

Preprint Cite Code

MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding

In AAAI, 2021

Preprint Cite

Unifying Vision-and-Language Tasks via Text Generation

In ICML, 2021

Preprint Cite Code

X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers

In EMNLP, 2020

Preprint Cite Code Project

Mixture Content Selection for Diverse Sequence Generation

In EMNLP, 2019

Preprint Cite Code