I am joining the Computer Science Department at Johns Hopkins University as an Assistant Professor in Fall 2026. Until that, I am spending my gap year at AI2.
I plan to attend ICML 2026 (in Seoul, South Korea 🇰🇷). Please feel free to reach out if you would like to chat in person!
My research focuses on multimodal AI, integrating diverse data types (e.g., images, videos, text, audio, and motion) to develop models that are interpretable, controllable, and scalable. My recent research interests include:
(1) Scalable Multimodal Frameworks – Modern AI models must meet the growing demand for thousands of capabilities. My research has addressed this challenge by introducing: (a) Unified generative frameworks that flexibly accommodate diverse modalities and tasks, using a single architecture and a generative objective – VL-T5 (ICML 2021) / X-LXMERT (EMNLP 2020) / TVLT (NeurIPS 2022 Oral) and (b) Efficient finetuning frameworks that significantly reduce parameter and memory requirements for creating task-specific models – VL-Adapter (CVPR 2022) / LST (NeurIPS 2022) / Ctrl-Adapter (ICLR 2025 Oral)
(2) Faithful Multimodal Reasoning – Scaling alone is not enough. Large models that rely on black-box reasoning and encode all knowledge within their parameters often struggle with basic tasks and produce hallucinations. My research makes their reasoning process more accurate and interpretable by introducing: (a) Planning-based frameworks that decompose complex visual generation problems into faithful, human-interpretable step-by-step reasoning processes – VPGen (NeurIPS 2023) / VideoDirectorGPT (COLM 2024) / DiagrammerGPT (COLM 2024) / Video-MSG (2024) and (b) Retrieval-augmented generation (RAG) frameworks that enhance accuracy and factuality by retrieving relevant information before generating outputs – M3DocRAG (Findings of ICCV 2025) / HiREST (CVPR 2023)
(3) Evaluation and Refinement of Multimodal Generation – With recent advancements in multimodal generation models, conventional evaluation metrics have been often saturated and no longer provide meaningful insights into future research direction. To this end, my research introduces: (a) Fine-grained evaluation frameworks that comprehensively measure model skills in multiple dimensions to uncover detailed strengths and weaknesses – DALL-Eval (ICCV 2023) / VPEval (NeurIPS 2023) / DSG (ICLR 2024) / LayoutBench (CVPRW 2024 Oral) / FineCapEval (Findings of NAACL 2022) / M3DocVQA (Findings of ICCV 2025) / CAPTURe (ICCV 2025) and (b) Automatic model refinement frameworks that use these evaluations to detect models’ weaknesses and refine their reasoning process – EnvGen (COLM 2024) / DataEnvGym (ICLR 2025 Spotlight) / SELMA (NeurIPS 2024) / VideoRepair (Findings of ACL 2026)
I know these are far from perfect, but I hope this helps you in your applications!
Ph.D. in Computer Science, 2025
University of North Carolina at Chapel Hill
B.S. in Industrial Engineering, 2018
Seoul National University
Apr 2026 - New preprint:
Apr 2026 - 1 paper accepted at Findings of ACL 2026:
Apr 2026 - 1 paper accepted at CVPR 2026 MUSI Workshop:
Mar 2026 - New preprints:
Feb 2026 - New preprint:
Jan 2026 - 1 paper accepted at ICLR 2026:
Jan 2026 - 1 paper accepted at EACL 2026:
Dec 2025 - Panel discussion at DCVLR: Data Curation for Vision Language Reasoning NeurIPS 2025 Workshop
Sep 2025 - 1 paper accepted at NeurIPS 2025:
Sep 2025 - Starting gap year at AI2 PRIOR team as a Young Investigator!
Feb 2025 - 2 papers accepted at ICLR 2025: