Abstract

We propose to address the problem of few-shot classification by meta-learning ``what to observe'' and ``where to attend'' in a relational perspective. Our method leverages relational patterns within and between images via self-correlational representation (SCR) and cross-correlational attention (CCA). Within each image, the SCR module transforms a base feature map into a self-correlation tensor and learns to extract structural patterns from the tensor. Between the images, the CCA module computes cross-correlation between two image representations and learns to produce co-attention between them. Our Relational Embedding Networks (RENet) combine the two relational modules to learn relational embedding in an end-to-end manner. In experimental evaluation, it achieves consistent improvements over state-of-the-art methods on four widely used few-shot classification benchmarks of miniImageNet, tieredImageNet, CUB-200-2011, and CIFAR-FS.

Overall architecture of RENet (Relational Embedding Networks)

Figure 1. The base representations, Zq and Zs, are transformed to self-correlation tensors, Rq and Rs, which are then updated by the convolutional block g to self-correlational representations, Fq and Fs, respectively. For their co-attention maps, cross-correlation C is computed between two image representations and then refined by the convolutional block h to h(C), which is bidirectionally aggregated to generate two attention maps, Aq and As. These co-attention maps are applied to corresponding image representations, Fq and Fs, and their attended features are aggregated to produce the final relational embeddings, q and s, respectively.

Self-correlational representation (SCR) & Cross-correlational attention (CCA)

Architecture of SCR and CCA. Effects of SCR and CCA.

Figure 2. (a): The SCR module captures relational patterns in the input self-correlation R by convolving it over U × V dimensions. The result g(R) is added to the base representation Z to form the self-correlational representation F (Eq. 2). (b): The CCA module refines the cross-correlation which will be summarized into co-attention maps, Aq and As (Eq. 4). Both modules consistently improve classification accuracies on miniImageNet and CUB-200-2011 datasets.

Comparison to the state-of-the-art methods

(a) miniImageNet (b) tieredImageNet

(c) CUB-200-2011 (d) CIFAR-FS

Table 1. Performance comparison in terms of accuracy (%) with 95% confidence intervals on (a) miniImageNet, (b) tieredImageNet, (c) CUB-200-2011, and (d) CIFAR-FS. “†” denotes larger backbones than ResNet12.

Qualitative results of RENet

Figure 3. (a): Channel activation of base representation. (b): Channel activation of SCR. (c): Attention map of CCA. SCR module can deactivate irrelavant features by abstracting self-correlated neighborhoods, e.g., the activation of a building behind a truck decreases. The subsequent CCA module generates co-attention maps that focus on the common context between a query and a support, e.g., the hands grasping the bars are co-attended.

Acknowledgements

This work was supported by Samsung Electronics Co., Ltd. (IO201208-07822-01) and the IITP grants (No.2019-0-01906, AI Graduate School Program - POSTECH) (No.2021-0-00537, Visual common sense through self-supervised learning for restoration of invisible parts in images) funded by Ministry of Science and ICT, Korea.

Paper

Relational Embedding for Few-Shot Classification
Dahyun Kang, Heeseung Kwon, Juhong Min, Minsu Cho
ICCV, 2021
[paper] [Bibtex]

Code

Check our GitHub repository: [GitHub]