Abstract

We introduce the integrative task of few-shot classification and segmentation (FS-CS) that aims to both classify and segment target objects in a query image when the target classes are given with a few examples. This task combines two conventional few-shot learning problems, few-shot classification and segmentation. FS-CS generalizes them to more realistic episodes with arbitrary image pairs, where each target class may or may not be present in the query. To address the task, we propose the integrative few-shot learning (iFSL) framework for FS-CS, which trains a learner to construct class-wise foreground maps for multi-label classification and pixel-wise segmentation. We also develop an effective iFSL model, attentive squeeze network (ASNet), that leverages deep semantic correlation and global self-attention to produce reliable foreground maps. In experiments, the proposed method shows promising performance on the FS-CS task and also achieves the state of the art on standard few-shot segmentation benchmarks.

Three main contributions of this work

1. A new task of integrative few-shot classification and segmentation (FS-CS)
    which combines few-shot classification and few-shot segmentation into an integrative task by addressing their limitations.
2. A new learning framework of integrative few-shot learning framework (iFSL)
    which learns to both classify and segment a query image using class-wise foreground maps.
3. A new neural architecture of attentive squeeze network (ASNet)
    which squeezes semantic correlations into a foreground map for iFSL via strided global self-attention.

Limitations of existing few-shot classification (FS-C) and few-shot segmentation (FS-S) setups

(a)
(b) (c)

Figure 1. The limitations of the conventional few-shot classification and few-shot segmentation setups. (a) & (b): FS-C presumes that the query always contains no more or less than one of the target classes in classification, while FS-S allows the presence of multiple classes but strictly assumes the query class set (C) and the support class set (Cs) are always equivalent. (c): FS-S learners are trained to segment a query image using a semantically-coupled support set thus often blindly highlight any salient objects regardless of support semantics. The proposed FS-CS learners are trained to predict the class presence as well as corresponding masks thus correctly discriminate what to segment based on the semantic relevance between the support.

Integrative task of few-shot classifcation and segmentation (FS-CS)

Figure 2. Integrative task of few-shot classification and segmentation. FS-CS combines few-shot classification and segmentation while it generalizes them to more realistic episodes with arbitrary image pairs, where each target class may or may not be present in the query.

Integrative few-shot learning framework (iFSL)

Figure 3. Integrative Few-Shot Learning (iFSL) for FS-CS. An iFSL learner estimates the class presence and the their corresponding class foreground maps from the class-wise shared foreground maps.

Attentive squeeze network (ASNet)

(a) Attentive squeeze layer (ASLayer)

(b) Attentive squeeze network

Figure 4. Overview of ASNet consisting of ASLayers. ASNet first constructs a hypercorrelation with image feature maps between a query (colored red) and a support (colored blue), where the 4D correlation is depicted as two 2D squares for demonstrational simplicity. ASNet then learns to transform the correlation to a foreground map by gradually squeezing the support dimension on each query dimension via global self-attention. Each input correlation, intermediate feature, and output foreground map has a channel dimension but is omitted in the illustration.

Performance comparison of methods trained under the iFSL framework on the FS-CS task

Table 1. Performance comparison of ASNet and others on FS-CS and Pascal-5i . All methods are trained and evaluated under the iFSL framework given strong labels, i.e., class segmentation masks, except for ASNetw that is trained only with weak labels, i.e., class tags.

Performance comparison of ASNet with others on the FS-S task

Table 2. FS-S results on 1-way 1-shot and 1-way 5-shot setups on Pascal-5i using ResNet50 (R50).

Studies on the FS-CS task

(a) Multi-way classification and segmentation. (b) Task transfer between FS-S, FS-C, and FS-CS.

Figure 6. (a) FS-CS is extensible to a multi-class problem with arbitrary numbers of classes, while FS-S is not as flexible as FS-CS in the wild. (b) We evaluate the transferability between FS-CS, FS-C, and FS-S by training a model on one task and evaluating it on the other task.

Qualitative results of ASNet for the FS-CS task

Figure 7. 2-way 1-shot FS-CS segmentation prediction maps on the COCO-20i benchmark.

Acknowledgements

This work was supported by Samsung Advanced Institute of Technology (SAIT) and also by Center for Applied Research in Artificial Intelligence (CARAI) grant funded by DAPA and ADD (UD190031RD).

Paper

Integrative Few-Shot Learning for Classification and Segmentation
Dahyun Kang, Minsu Cho
CVPR, 2022
[paper] [Bibtex]

Code

Check our GitHub repository: [GitHub]