I am joining the Computer Science Department at Johns Hopkins University as an Assistant Professor in Fall 2026. Until that, I am spending my gap year at AI2 Fall 2025 - Fall 2026. I will be recruiting Ph.D. students for Fall 2026. Please see this page for more details. I plan to attend ICCV 2025 (in Hawaii) and NeurIPS 2025 (in San Diego). Please feel free to reach out if you would like to chat in person!
My research focuses on multimodal AI, integrating diverse data types (e.g., images, videos, text, audio, and motion) to develop models that are interpretable, controllable, and scalable. Some of my recent interests include learning embodied knowledge (e.g., robotics, human actions) from unlabeled visual demonstrations, and developing human-in-the-loop frameworks and softwares that enhance productivity in various applications. Below are my research interests during my PhD years.
(1) Scalable Multimodal Frameworks – Modern AI models must meet the growing demand for thousands of capabilities. My research has addressed this challenge by introducing: (a) Unified generative frameworks that flexibly accommodate diverse modalities and tasks, using a single architecture and a generative objective – VL-T5 (ICML 2021) / X-LXMERT (EMNLP 2020) / TVLT (NeurIPS 2022 Oral) and (b) Efficient finetuning frameworks that significantly reduce parameter and memory requirements for creating task-specific models – VL-Adapter (CVPR 2022) / LST (NeurIPS 2022) / Ctrl-Adapter (ICLR 2025 Oral)
(2) Faithful Multimodal Reasoning – Scaling alone is not enough. Large models that rely on black-box reasoning and encode all knowledge within their parameters often struggle with basic tasks and produce hallucinations. My research makes their reasoning process more accurate and interpretable by introducing: (a) Planning-based frameworks that decompose complex visual generation problems into faithful, human-interpretable step-by-step reasoning processes – VPGen (NeurIPS 2023) / VideoDirectorGPT (COLM 2024) / DiagrammerGPT (COLM 2024) / Video-MSG and (b) Retrieval-augmented generation (RAG) frameworks that enhance accuracy and factuality by retrieving relevant information before generating outputs – M3DocRAG (2024) / HiREST (CVPR 2023)
(3) Evaluation and Refinement of Multimodal Generation – With recent advancements in multimodal generation models, conventional evaluation metrics have been often saturated and no longer provide meaningful insights into future research direction. To this end, my research introduces: (a) Fine-grained evaluation frameworks that comprehensively measure model skills in multiple dimensions to uncover detailed strengths and weaknesses – DALL-Eval (ICCV 2023) / VPEval (NeurIPS 2023) / DSG (ICLR 2024) / LayoutBench (CVPRW 2024 Oral) / FineCapEval (Findings of NAACL 2022) / M3DocVQA (2024) / CAPTURe (ICCV 2025) and (b) Automatic model refinement frameworks that use these evaluations to detect models’ weaknesses and refine their reasoning process – EnvGen (COLM 2024) / DataEnvGym (ICLR 2025 Spotlight) / SELMA (NeurIPS 2024) / VideoRepair (2024)
I’m also sharing some of my past application materials below. I know these are far from perfect, but I hope this helps you in your applications!
Sep 2025 - 1 paper accepted at NeurIPS 2025:
Sep 2025 - I’m starting my gap year at AI2 PRIOR team as an Young Investigator!
Aug 2025 - 1 paper accepted at Findings in EMNLP 2025:
Aug 2025 - New preprints:
Jul 2025 - New preprints:
Jun 2025 - 1 paper accepted at ICCV 2025:
Jun 2025 - New preprints:
May 2025 - New preprints:
Apr 2025 - New preprints:
Feb 2025 - 2 papers have been accepted to ICLR 2025:
Ph.D. in Computer Science, 2025
University of North Carolina at Chapel Hill
B.S. in Industrial Engineering, 2018
Seoul National University