Jaemin Cho
Jaemin Cho
Publications
Group
1
CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting
a VLM benchmark testing spatial reasoning by making the models count objects under occlusion
Atin Pothiraj
,
Elias Stengel-Eskin
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Ctrl-Adapter: An Efficient and Versatile Framework for Adapting Diverse Controls to Any Diffusion Model
A plug-and-play framework that reuses any existing ControlNet for any video/image diffusion model
Han Lin
*,
Jaemin Cho
*,
Abhay Zala
,
Mohit Bansal
Preprint
Cite
Code
Project
DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback
A new testbed of teacher environments for data generation agents for diverse tasks.
Zaid Khan
,
Elias Stengel-Eskin
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Project
M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
Multi-modal RAG framework and dataset for multi-page multi-document understanding
Jaemin Cho
,
Debanjan Mahata
,
Ozan İrsoy
,
Yujie He
,
Mohit Bansal
Preprint
Cite
Code
Project
SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data
SELMA improves T2I models by fine-tuning on automatically generated multi-skill image-text datasets, with skill-specific LoRA expert learning & merging. -
NeurIPS 2024
Jialu Li
*,
Jaemin Cho
*,
Yi-Lin Sung
,
Jaehong Yoon
,
Mohit Bansal
Preprint
Cite
Code
Project
DiagrammerGPT: Generating Open-Domain, Open-Platform Diagrams via LLM Planning
Using LLM (GPT-4) to generate a ‘diagram plan’ for fine-grained layouts (object/text labels/arrows, etc.) and render in either raster images (via diffusion) and vector graphics (via PowerPoint / Inkscape or any tools). -
COLM 2024
Abhay Zala
,
Han Lin
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Project
EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents
EnvGen is a novel framework that uses LLMs to adaptively create training environments to help smaller embodied RL agents learn useful skills that they are weak at. -
COLM 2024
Abhay Zala
*,
Jaemin Cho
*,
Han Lin
,
Jaehong Yoon
,
Mohit Bansal
Preprint
Cite
Code
Project
VideoDirectorGPT: Consistent Multi-Scene Video Generation via LLM-Guided Planning
Using LLM (GPT-4) to generate a ‘video plan’ for consistent multi-scene video generation -
COLM 2024
Han Lin
,
Abhay Zala
,
Jaemin Cho
,
Mohit Bansal
Preprint
Cite
Code
Project
DOCCI: Descriptions of Connected and Contrasting Images
High-quality, long, human-annotated descriptions of 15K images -
ECCV 2024
Yasumasa Onoe
,
Sunayana Rane
,
Zachary Berger
,
Yonatan Bitton
,
Jaemin Cho
,
Roopal Garg
,
Alexander Ku
,
Zarana Parekh
,
Jordi Pont-Tuset
,
Garrett Tanzer
,
Su Wang
,
Jason Baldridge
Preprint
Cite
Dataset
Project
Contrastive Region Guidance: Improving Grounding in Vision-Language Models without Training
CRG is a training-free method that guides VLMs to help understand the visual prompts, by contrasting the outputs with & without visual prompts. -
ECCV 2024
David Wan
,
Jaemin Cho
,
Elias Stengel-Eskin
,
Mohit Bansal
Preprint
Cite
Code
Project
«
»
Cite
×