Publications

(2024). M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding.

Preprint Cite Project

(2024). SELMA: Learning and Merging Skill-Specific Text-to-Image Experts with Auto-Generated Data. In NeurIPS.

Preprint Cite Code Project

(2024). DataEnvGym: Data Generation Agents in Teacher Environments with Student Feedback.

Preprint Cite Code Project

(2024). EnvGen: Generating and Adapting Environments via LLMs for Training Embodied Agents. In COLM.

Preprint Cite Code Project

(2024). An Assessment of Reported Biases and Harms of Large Language Models. In ICA 2024 (Top Paper Award).

Cite

(2024). DOCCI: Descriptions of Connected and Contrasting Images. In ECCV.

Preprint Cite Dataset Project

(2024). Diagnostic Benchmark and Iterative Inpainting for Layout-Guided Image Generation. In CVPR Workshop (Oral).

Preprint Cite Code Dataset Project

(2023). Davidsonian Scene Graph: Improving Reliability in Fine-Grained Evaluation for Text-to-Image Generation. In ICLR.

Preprint Cite Code Project

(2023). Self-Chained Image-Language Model for Video Localization and Question Answering. In NeurIPS.

Preprint Cite Code

(2023). Paxion: Patching Action Knowledge in Video-Language Foundation Models. In NeurIPS.

Preprint Cite Code

(2023). Hierarchical Video-Moment Retrieval and Step-Captioning. In CVPR.

Preprint Cite Code Project

(2022). TVLT: Textless Vision-Language Transformer. In NeurIPS.

Preprint Cite Code

(2022). Fine-grained Image Captioning with CLIP Reward. In Findings of NAACL.

Preprint Cite Code

(2021). MuMuQA: Multimedia Multi-Hop News Question Answering via Cross-Media Knowledge Extraction and Grounding. In AAAI.

Preprint Cite

(2021). Unifying Vision-and-Language Tasks via Text Generation. In ICML.

Preprint Cite Code

(2020). X-LXMERT: Paint, Caption and Answer Questions with Multi-Modal Transformers. In EMNLP.

Preprint Cite Code Project

(2019). Mixture Content Selection for Diverse Sequence Generation. In EMNLP.

Preprint Cite Code