Fine-grained Image Captioning with CLIP Reward

Abstract

Generating a detailed textual description of an image that distinguishes them from others is crucial for many use-cases, such as image search engine and accessibility for the visual impaired. Modern image captioning models are trained with text similarity objectives. However, since reference captions in public datasets often describe the most salient common objects, models trained with text similarity objective tend to ignore specific and detailed aspects of an image that distinguish it from others. Towards more descriptive and distinctive caption generation, we propose to use CLIP, a multimodal encoder trained on huge image-text pairs from the web, to calculate the multimodal similarity and use it as a reward function. We also propose a simple CLIP finetuning strategy to improve grammar that does not require extra text annotation. To comprehensively evaluate descriptive captions, we introduce FineCapEval, a new dataset for caption evaluation with fine-grained criteria - overall, background, object, relations. In our experiments on text-to-image retrieval and FineCapEval, the proposed CLIP-guided model generates more distinctive captions than CIDEr-optimized model. We also show the effectiveness of grammar finetuning strategy. Lastly, we show human analysis where the annotators strongly prefer CLIP reward to CIDEr and MLE objectives on diverse criteria.