TransforMatcher: Match-to-Match Attention for Semantic Correspondence

Abstract

Establishing correspondences between images remains a challenging task, especially under large appearance changes due to different viewpoints or intra-class variations. In this work, we introduce a strong semantic image matching learner, dubbed TransforMatcher, which builds on the success of transformer networks in vision domains. Unlike existing convolution- or attention-based schemes for correspondence, TransforMatcher performs global match-to-match attention for precise match localization and dynamic refinement. To handle a large number of matches in a dense correlation map, we develop a light-weight attention architecture to consider the global match-to-match interactions. We also propose to utilize a multi-channel correlation map for refinement, treating the multi-level scores as features instead of a single score to fully exploit the richer layer-wise semantics. In experiments, TransforMatcher sets a new state of the art on SPair-71k while performing on par with existing SOTA methods on the PF-PASCAL dataset.

Overall pipeline of TransforMatcher.

Figure 1. The feature maps extracted from an image pair are used to compute a multi-channel correlation map to be processed by our match-to-match attention module for refinement. The multi-level scores are used as features for each match. We construct a dense flow field from the resulting correlation map, which can be used to transfer keypoints for training with keypoint pair annotation. We formulate our training objective to minimize the average Euclidean distance between the predicted target keypoints and the ground-truth target keypoints.

Dynamic Global Match-to-Match Attention.

Figure 2. Our main contribution is the dynamic global match-to-match attention using transformers. Previous methods either propose to empower features through self- and cross- attention [1,2], refine matches using static local match-to-match attention[3,4,5], or refine matches by aggregating features and matches together through patch-to-patch attention [6]. In contrast, we propose to refine and update each match in the 4D correlation map individually, by attending to non-local global cues among the matches in a dynamic manner.

Additive attention for linear complexity.

Figure 3. The multi-channel correlation map is projected to query, key and value matrices, which are multiplied with rotary positional embeddings. The match-to-match attention module exploits additive addition mechanisms to aggregate query/key matrices to global vectors, which is used for element-wise product to induce global context awareness. The final output is projected to a single-width channel to be reshaped to a refined 4D correlation map. As this additive attention structure has linear complexity (in comparison to vanilla transformers which have quadratic complexity), we can effectively model long-range match-to-match interactions within feasible computational constraints.

Experimental results on standard benchmarks of semantic correspondence.

Table 1. Higher PCK is better. All the results reported in the table uses pretrained ResNet-101 model as the feature extractor. Methods in the first group were trained with weak supervision (image pair annotations), while those in the second group were trained with strong supervision (sparse keypoint match annotations). Models with * are retrained using keypoint annotations from ANC-Net. † indicates the use of data augmentation during training. Numbers in bold indicate the best performance, followed by the underlined numbers. It can be seen that TransforMatcher trained without data augmentations outperform CATs trained with data augmentations, proving the efficacy of our 4D match-to-match attention and multi-level correlation score features.

Sample qualitative results.

Figure 4. The source image is TPS-transformed to the target image using the predicted correspondences. The results show the transformed images. It can be seen that TransforMatcher can establish accurate correpsondences under challenging intra-class variations.

Analysis on nonlocality of match-to-match attention.

Figure 5. Nonlocality distributions of high-dimensional convolutional kernels (left) and TransforMatcher’s attention layers (right). Convolutional layers statically transforms maches with fixed, local receptive fields. In contrast, Transformatcher layers can dynamically transform matches by adaptively deciding regions of attention for effective transformation with global receptive fields.

Figure 6. Proportion of image pair difficulty with respect to nonlocality. For all difficulty types, the proportion of hard/medium samples increase with increasing nonlocality. This trend is especially visible in types of truncation/occlusion; our model attends larger contexts to better perceive truncated/occluded parts. Therefore, the harder the image pair is to match, the more TransforMatcher relies on nonlocal global cues to establish correspondences.

Acknowledgements

This work was supported by Samsung Advanced Institute of Technology (SAIT) and also by the NRF grant (NRF-2021R1A2C3012728) and the IITP grants (No.2021-0-02068: AI Innovation Hub, No.2019-0-01906: Artificial Intelligence Graduate School Program at POSTECH) funded by the Korea government (MSIT).

References

[1] Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. SuperGlue: Learning feature matching with graph neural networks. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[2] Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. LoFTR: Detector-free local feature matching with transformers. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[3] Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovic, Akihiko Torii, Tomas Pajdla, and Josef Sivic. Neighbourhood consensus networks. In Advances in Neural Information Processing Systems (NeurIPS), 2018.
[4] Jae Yong Lee, Joseph DeGol, Victor Fragoso, and Sudipta Sinha. Patchmatch-based neighborhood consensus for semantic correspondence. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2021.
[5] Juhong Min and Minsu Cho. Convolutional hough matching networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2021
[6] S. Cho, S. Hong, S. Jeon, Y. Lee, K. Sohn, and S. Kim, “Cats: Cost aggregation transformers for visual correspondence,” in Thirty-Fifth Conference on Neural Information Processing Systems, 2021