Convolutional Hough Matching Networks

Abstract

Despite advances in feature representation, leveraging geometric relations is crucial for establishing reliable visual correspondences under large variations of images. In this work we introduce a Hough transform perspective on convolutional matching and propose an effective geometric matching algorithm, dubbed Convolutional Hough Matching (CHM). The method distributes similarities of candidate matches over a geometric transformation space and evaluate them in a convolutional manner. We cast it into a trainable neural layer with a semi-isotropic high-dimensional kernel, which learns non-rigid matching with a small number of interpretable parameters. To validate the effect, we develop the neural network with CHM layers that perform convolutional matching in the space of translation and scaling. Our method sets a new state of the art on standard benchmarks for semantic visual correspondence, proving its strong robustness to challenging intra-class variations.

Convolutional Hough matching

Figure 1. Convolutional Hough matching (left) and visualization of learned kernel (right). The CHM carries out local geometric voting in high-dimensional space, e.g., translation and scale spaces, with a small number of interpretable parameters.

Experimental results

1. Quantitative results on standard benchmarks of semantic visual correspondence.

Table 1. Performance on standard benchmarks in accuracy, FLOPs, per-pair inferencetime, and memory footprint. Subscripts denote backbone networks. Some resultsare from [27, 31, 40, 42, 45, 47]. Numbers in bold indicate the best performanceand underlined ones are the second best. Models with an asterisk (∗) are retrained using keypoint annotations (strong supervision) from [40]. The first column shows supervisory signals used in training: image-level labels (I), and keypoint matches (K). Superscript † denotes inference time using our implementation of nD conv.

* Benchmark datasets are available at [SPair-71k] [PF-PASCAL] [PF-WILLOW]

2. Precision-recall curves.

Figure 3. PR curves on SPair-71k (left) and PF-PASCAL (right). For each model, we define a set of coordinates on a regular grid on the input pair of images and assign their best matches using its own keypoint transfer method, thus providing the same number of (fairly collected) candidate matches to every model that we compare. For each candidate match, we define its match score as a score nearest to spatial position in the correlation tensor. Given top-k matches according to their matching scores, we define true positives (TPs) as matches falling inside object segmentation masks (or bounding box) and false positives (FPs) as those lying outside object masks (boxes).

3. Qualitative results

Figure 4. Example results with large illumination and scale differences, and truncation from SPair-71k [9]: (a) source image, (b) target image (c) CHMNet (ours), (d) DHPF [10], (e) ANC-Net [6], (f) HPF [8], (g) DCCNet [4], and (h) NCNet [12].

Figure 5. Visualization of maxpooled position in scale-space. In each image pair, we show source keypoints (given) and their corresponding target keypoints (predicted) in circles in left and right images respectively. The size (large, medium, and small) of each circle indicates maxpooled position in scale-space. If both circles of a match are large, its match score is pooled from position (√2, √2) in scale space. If one circle is medium and the other is small, its match score is from position (1, 1/√2) and so on. We show ground-truth target keypoints in crosses with a line that depicts matching error.

Acknowledgements

This work was supported by Samsung Advanced Institute of Technology (SAIT), the NRF grants (NRF-2017R1E1A1A01077999, NRF-2021R1A2C3012728), and the IITP grant (No.2019-0-01906, AI Graduate School Program - POSTECH) funded by Ministry of Science and ICT, Korea.

Convolutional Hough Matching Networks

Abstract