Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstreamtasks. For example, providing the prompt “Let's think step by step” improved GPT-3's reasoning accuracy to 63% on MutiArith while prompting “a photo of” filled with a class name enables CLIP to achieve 80% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods' generalization ability is worse than text-only prompting tuning.
This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization
Overview of LoGoPrompt, which (a) generates class-wise visual prompts as synthetic images with text class names and (b) reformulates the classification objective to visual prompt selection to address the chicken-and-egg challenge by (c) the proposed min-max contrastive learning.
The superior performance of LoGoPrompt on both base and new classes shows its strong generalizability.
LoGoPrompt consistently outperforms compared methods on all the 11 datasets. Following CoOp, ResNet-50 of CLIP is used as the vision backbone.