LogoPrompt

Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models

ICCV 2023


Cheng Shi, Sibei Yang*

**Denotes Corresponding Author


Paper Video(Coming Soon) Code(Coming Soon)

Our interesting findings



Abstract

Prompt engineering is a powerful tool used to enhance the performance of pre-trained models on downstreamtasks. For example, providing the prompt “Let's think step by step” improved GPT-3's reasoning accuracy to 63% on MutiArith while prompting “a photo of” filled with a class name enables CLIP to achieve 80% zero-shot accuracy on ImageNet. While previous research has explored prompt learning for the visual modality, analyzing what constitutes a good visual prompt specifically for image recognition is limited. In addition, existing visual prompt tuning methods' generalization ability is worse than text-only prompting tuning.

This paper explores our key insight: synthetic text images are good visual prompts for vision-language models! To achieve that, we propose our LoGoPrompt, which reformulates the classification objective to the visual prompt selection and addresses the chicken-and-egg challenge of first adding synthetic text images as class-wise visual prompts or predicting the class first. Without any trainable visual prompt parameters, experimental results on 16 datasets demonstrate that our method consistently outperforms state-of-the-art methods in few-shot learning, base-to-new generalization, and domain generalization

Method

Overview of LoGoPrompt, which (a) generates class-wise visual prompts as synthetic images with text class names and (b) reformulates the classification objective to visual prompt selection to address the chicken-and-egg challenge by (c) the proposed min-max contrastive learning.



Performance

Base-to-new Generalization

The superior performance of LoGoPrompt on both base and new classes shows its strong generalizability.

Few-shot classification

LoGoPrompt consistently outperforms compared methods on all the 11 datasets. Following CoOp, ResNet-50 of CLIP is used as the vision backbone.






Bibtex
@InProceedings{Shi_2023_ICCV, author = {Shi, Cheng and Yang, Sibei}, title = {LogoPrompt:Synthetic Text Images Can Be Good Visual Prompts for Vision-Language Models}, booktitle = {Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)}, month = {October}, year = {2023} }