Jaemin Cho
Jaemin Cho
Publications
CV
Light
Dark
Automatic
1
Bifrost-1: Bridging Multimodal LLMs and Diffusion Models with Patch-level CLIP Latents
a unified framework that bridges multimodal LLMs and diffusion models with patch-level CLIP latents
Han Lin
,
Jaemin Cho
,
Amir Zadeh
,
Chuan Li
,
Mohit Bansal
Preprint
Cite
Code
Project
Video-Skill-CoT: Skill-based Chain-of-Thoughts for Domain-Adaptive Video Reasoning
a framework that automatically constructs and leverages skill-aware CoT supervisions for domain-adaptive video reasoning
Daeun Lee
*,
Jaehong Yoon
*,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Project
CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
a VLM benchmark testing spatial reasoning by making the models count objects under occlusion
Atin Pothiraj
,
Elias Stengel-Eskin
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
A plug-and-play framework that reuses any existing ControlNet for any video/image diffusion model
Han Lin
*,
Jaemin Cho
*,
Abhay Zala
,
Mohit Bansal
Preprint
Cite
Code
Project
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
A new testbed of teacher environments for data generation agents for diverse tasks.
Zaid Khan
,
Elias Stengel-Eskin
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Project
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Multi-modal RAG framework and dataset for multi-page multi-document understanding
Jaemin Cho
,
Debanjan Mahata
,
Ozan İrsoy
,
Yujie He
,
Mohit Bansal
Preprint
Cite
Code
Project
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
SELMA improves T2I models by fine-tuning on automatically generated multi-skill image-text datasets, with skill-specific LoRA expert learning & merging. -
NeurIPS 2024
Jialu Li
*,
Jaemin Cho
*,
Yi-Lin Sung
,
Jaehong Yoon
,
Mohit Bansal
Preprint
Cite
Code
Project
DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning
Using LLM (GPT-4) to generate a ‘diagram plan’ for fine-grained layouts (object/text labels/arrows, etc.) and render in either raster images (via diffusion) and vector graphics (via PowerPoint / Inkscape or any tools). -
COLM 2024
Abhay Zala
,
Han Lin
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Project
EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents
EnvGen is a novel framework that uses LLMs to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at. -
COLM 2024
Abhay Zala
*,
Jaemin Cho
*,
Han Lin
,
Jaehong Yoon
,
Mohit Bansal
Preprint
Cite
Code
Project
VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning
Using LLM (GPT-4) to generate a ‘video plan’ for consistent multi-scene video generation -
COLM 2024
Han Lin
,
Abhay Zala
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Project
»
Cite
×