Efficient and Versatile Robust Fine-Tuning of Zero-shot Models

R-Adapter combines the strengths of robust fine-tuning and parameter-efficient fine-tuning (PEFT).

Abstract

Large-scale image-text pre-trained models enable zero-shot classification and provide consistent accuracy across various data distributions. Nonetheless, optimizing these models in downstream tasks typically requires fine-tuning, which reduces generalization to out-of-distribution (OOD) data and demands extensive computational resources. We introduce Robust Adapter (R-Adapter), a novel method for finetuning zero-shot models to downstream tasks while simultaneously addressing both these issues. Our method integrates lightweight modules into the pre-trained model and employs novel self-ensemble techniques to boost OOD robustness and reduce storage expenses substantially. Furthermore, we propose MPM-NCE loss designed for fine-tuning on vision-language downstream tasks. It ensures precise alignment of multiple image-text pairs and discriminative feature learning. By extending the benchmark for robust fine-tuning beyond classification to include diverse tasks such as cross-modal retrieval and open vocabulary segmentation, we demonstrate the broad applicability of R-Adapter. Our extensive experiments demonstrate that R-Adapter achieves state-of-the-art performance across a diverse set of tasks, tuning only 13% of the parameters of the CLIP encoders.

Experimental Results

- ImageNet classification under distribution shifts

Table 1. Top-1 accuracy of models with different robust fine-tuning on ImageNet (ID) and OOD datasets. “OOD avg” is the average accuracy across the five OOD datasets. Entries in green indicate fewer parameters than full fine-tuning, and red use more.

- Few-shot ImageNet classification

Table 2. Top-1 accuracy for adapting CLIP to 16-shot ImageNet classification on ID and OOD datasets. OOD avg is the average accuracy across the four OOD datasets. “r-Rank” denotes our models with adapters employing low-rank decomposition while “Full-Rank” is no decomposition. All methods adopt CLIP ViT-B/16 as the backbone.

- Cross-modal retrieval

Table 3. Cross-modal retrieval performance on the COCO (5K test set) and Flickr30K datasets in Recall at K (R@K). B and L denote the use of 12-layer and 24-layer transformer encoders, respectively. FLYPL training has failed due to memory constraints.

- Open-vocabulary segmentation

Table 4. Comparison of mIoU results between the OVSeg fine-tuned with our method and existing open-vocabulary segmentation models. Note that OVSeg (Org.) is trained in two stages, starting with full CLIP model fine-tuning followed by mask prompt tuning, whereas OVSeg (Ours) involves single-stage adapter training.

Ablation Studies

- Effectiveness of key components

Table 5. Ablation study on key components of our method and comparison with the other adapter-tuning methods using full-rank structure. The experiments are performed on the ImageNet classification with ViT-B/32. The last row (E10) corresponds to our default configuration. DO: Dropout in Adapters. DP: Drop-path in pre-trained layers. AD: Adapter Dropping. AC: Accumulation. RS: Re-scaling. LS: Label Smoothing.

- Re-scaling coefficient

Figure 1. Performance of our method varying re-scaling coefficient α against WiSE-FT.

Acknowledgements

This work was supported by NRF grants (NRF-2021R1A2C3012728–30%, NRF2018R1A5A1060031–30%, RS-2024-00341514–25%) and IITP grants (RS-2019II191906–10%, Artificial Intelligence Graduate School Program - POSTECH, RS-2019-II190079–5%, Artificial Intelligence Graduate School Program - Korea University) funded by the Ministry of Science and ICT, Korea.

BibTeX

@inproceedings{kim2024efficient,
    title={Efficient and Versatile Robust Fine-Tuning of Zero-shot Models},
    author={Kim, Sungyeon and Jeong, Boseung and Kim, Donghyun and Kwak, Suha},
    booktitle={European Conference on Computer Vision (ECCV)},
    year={2024},
  }