Efficient Semantic Matching with Hypercolumn Correlation

Recent studies show that leveraging the match-wise relationships within the 4D correlation map yields significant improvements in establishing semantic correspondences-but at the cost of increased computation and latency. In this work, we focus on the aspect that the performance improvements of recent methods can also largely be attributed to the usage of multi-scale correlation maps, which hold various information ranging from low-level geometric cues to high-level semantic contexts. To this end, we propose HCCNet, an efficient yet effective semantic matching method which exploits the full potential of multi-scale correlation maps, while eschewing the reliance on expensive match-wise relationship mining on the 4D correlation map. Specifically, HCCNet performs feature slicing on the bottleneck features to yield a richer set of intermediate features, which are used to construct a hypercolumn correlation. HCCNet can consequently establish semantic correspondences in an effective manner by reducing the volume of conventional high-dimensional convolution or self-attention operations to efficient point-wise convolutions. HCCNet demonstrates state-of-the-art or competitive performances on the standard benchmarks of semantic matching, while incurring a notably lower latency and computation overhead compared to the existing SoTA methods.

Overall pipeline of HCCNet (HyperColumn Correlation Network).

The intermediate feature maps extracted from an image pair are first sliced, and are used to compute a consequently amplified multi-channel correlation map. We then identify and exploit the position-specific inter-correlation consensuses to provide the refined single-channel correlation map. We construct a dense flow field from the refined correlation map, which can be used to transfer given source keypoints to the target image to supervise HCCNet using ground-truth keypoint pair annotation.

Experimental results on standard benchmarks of semantic correspondence.

All the methods reported in the above table uses a pretrained ResNet-101 model as the feature extractor. The first group of methods were trained with weak supervision (image pair annotations), and the second group of methods were trained using strong supervision (keypoint pair annotations). Models with * are retrained using keypoint annotations from ANC-Net. † indicates the use of data augmentation during training. Numbers in bold indicate the best performance, followed by the underlined numbers.

Sample qualitative results.

Qualitative comparison of HCCNet against TransforMatcher under larger viewpoint/occlusion/truncation variations. Green lines represent ground truth correspondences, and blue lines represent predicted correspondences. The source images are TPS warped to the target image using the predicted correspondences for better comparison and visibility.

Analysis on Feature Slicing.

Figure 5. Average magnitude for some group g with and without feature slicing. The high variance of magnitudes when using our feature slicing implies that feature slicing enables fine-grained differentiation of relevant correlation activations to establish more reliable correspondences.

Acknowledgements

This work was supported by the NRF grant (NRF-2021R1A2C3012728 (50%)) and the IITP grants (2022-0-00290: Visual Intelligence for Space-Time Understanding and Generation based on Multi-layered Visual Common Sense (40%), 2019-0-01906: AI Graduate School Program at POSTECH (10%)) funded by Ministry of Science and ICT, Korea.