PDCL-Attack

Abstract

Recent vision-language foundation models, such as CLIP, have demonstrated superior capabilities in learning representations that can be transferable across diverse range of downstream tasks and domains. With the emergence of such powerful models, it has become crucial to effectively leverage their capabilities in tackling challenging vision tasks. On the other hand, only a few works have focused on devising adversarial examples that transfer well to both unknown domains and model architectures. In this paper, we propose a novel transfer attack method called PDCL-Attack, which leverages the CLIP model to enhance the transferability of adversarial perturbations generated by a generative model-based attack framework. Specifically, we formulate an effective prompt-driven feature guidance by harnessing the semantic representation power of text, particularly from the ground-truth class labels of input images. To the best of our knowledge, we are the first to introduce prompt learning to enhance the transferable generative attacks. Extensive experiments conducted across various cross-domain and cross-model settings empirically validate our approach, demonstrating its superiority over state-of-the-art methods.

Method

Overview of PDCL-Attack. For effective transfer attacks leveraging CLIP, our proposed pipeline consists of three serial stages; Phase 1 and 2 are the training stage, and Phase 3 is the inference stage. The goal of Phase 1 is to pre-train Prompter, optimizing the context words to yield generalizable text features in Phase 2. In Phase 1, only the learnable context word vectors are updated, while the weights of the CLIP image encoder and text encoder remain fixed. In Phase 2, we train a generator model which crafts adversarial perturbations for encouraging a surrogate model to produce mispredictions for input images. In Phase 3, we employ the trained generator to yield transferable adversarial examples on unknown domains and victim models.

Experimental Results

Cross-domain evaluation results. The perturbation generator is trained on ImageNet-1K with VGG-16 as the surrogate model and evaluated on black-box domains with models. We compare the top-1 classification accuracy after attacks.

Cross-model evaluation results. The perturbation generator is trained on ImageNet-1K with VGG-16 as the surrogate model and evaluated on black-box models. We compare the top-1 classification accuracy after attacks.

Effect of learnable context words. Learnable context words outperform hand-crafted heuristic ones, and increasing their capacity further improves the attack effectiveness.

Qualitative results on ImageNet-1K. PDCL-Attack successfully fools the classifier, causing it to predict the clean image labels (in black) as the mispredicted class labels shown at the bottom (in red). From top to bottom: clean images, unbounded adversarial images, and bounded adversarial images which are actual inputs to the classifier.

Qualitative results on distribution-shifted variants of ImageNet-1K. From top to bottom: Clean images, bounded adversarial images, and unbounded adversarial images, respectively. In the middle, zero-shot CLIP-predicted class labels are displayed for both clean and adversarial image inputs. Our method effectively induces the zero-shot CLIP model to misclassify images as incorrect labels, even when faced with various distribution shifts. For the inference, we employ the text prompt "a photo of a [class]", following the common approach in CLIP.

Contact

PDCL-Attack (pdcl.attack@gmail.com)

BibTex

@InProceedings{yang2024PDCLAttack,
      title={Prompt-Driven Contrastive Learning for Transferable Adversarial Attacks},
      author={Hunmin Yang and Jongoh Jeong and Kuk-Jin Yoon},
      booktitle={European Conference on Computer Vision (ECCV)},
      year={2024}
}