Abstract

Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class. One of recent approaches to this problem is to estimate parameters of a global transformation model that densely aligns one image to the other. Since an entire correlation map between all feature pairs across images is typically used to predict such a global transformation, noisy features from different backgrounds, clutter, and occlusion distract the predictor from correct estimation of the alignment. This is a challenging issue, in particular, in the problem of semantic correspondence where a large degree of image variations is often involved. In this paper, we introduce an attentive semantic alignment method that focuses on reliable correlations, filtering out distractors. For effective attention, we also propose an offset-aware correlation kernel that learns to capture translation-invariant local transformations in computing correlation values over spatial locations. Experiments demonstrate the effectiveness of the attentive model and offset-aware kernel, and the proposed model combining both techniques achieves the state-of-the-art performance.

A2Net (Attentive alignment network)

Figure 1. Overall architecture of the A2Net. A2Net takes as inputs two images and estimates a set of global transformation using three main components: feature extractor, local transformation encoder, and attentive global transformation estimator.

Offset-aware correlation kernels (OAC kernels)

Figure 2. Offset-aware correlation kernel at different source locations: (a) at (0,0) and (b) at (0,1). Each dotted line connects source and target features to compute correlation, and w represents a kernel weight for the dotted line. Note that kernel weights are associated with different correlation pairs when source locations vary.

Attentive global transformation estimator

Figure 3. Illutstraion of attention process. Noisy features in local transformation feature map are filtered by assigning lower probabilities to these locations. Arrows in boxes of local transformation feature map demonstrate features encoding local transformations, and grayscale colors in attention distribution represent magnitudes of probabilities where brighter colors mean higher probabilities.

Results

1. Results on PF-WILLOW and PF-PASCAL datasets.

(Updated 26. Oct, 2018) Note that there has been some inconsistency across related papers in determining the error tolerance threshold for PCK: some paper (DeepFlow, GMK, SIFTFlow, DSP, ProposalFlow) determine the threshold based on the object bounding box size whereas some others (UCN, FCSS, SCNet) use the entire image size.
Unfortunately, this issue confuses the comparisons across the methods. In producing the PF-PASCAL benchmark comparison in our paper, we've used the codes from the previous methods and in doing so we made some mistake in some of evaluation due to the issue.
Although the overall tendencies of performances between models remain unchanged, scores of some models are overestimated. We have posted a new version of our paper in arXiv with all the correct scores measured with bounding box sizes. Please refer to this version for the correct scores: https://arxiv.org/abs/1808.02128. We apologize for all this inconvenience.

Figure 4. Experimental results on PF-WILLOW and PF-PASCAL. PCK is measured with α=0.1. Scores for other models are brought from each paper while scores marked with an asterisk (*) are drawn from the reproduced models by released official codes. The PCK scores marked with a dagger (>) are measured with height and width of the image size instead of the bounding box size. Note that the PCK score w.r.t. the bounding box size is more conservative than to one w.r.t. the image size, resulting in lower scores. For example, PCK scores of A2Net on PF-PASCAL measured by image sizes are 0.59 (affine), 0.65 (affine+TPS) and 0.71 (affine+TPS; ResNet101).

2. Ablation study

Figure 5. PCKs of ablations on PF-WILLOW trained with PASCAL VOC 2011. Scores of GeoCNN are obtained from the code released by the authors. The numbers of network parameters exclude the feature extractors since all models share the same feature extractor.

2. Ablation study - Effect of each component

Figure 5. Qualitative results of the attentive semantic alignment. Each row shows an example of PF-PASCAL benchmark. Given the source and target images shown in first and third columns, we visualize the attention maps of the affine model (second column), the transformed image by the affine model (fourth column) and the final transformed image by the affine+TPS model (last column). Since the models learn inverse transformation, the target image is transformed toward the source image while the attention distribution is drawn over the source image. The model attends to the objects to match and estimates dense correspondences despite intra-class variations and background clutters.

Paper

Attentive Semantic Alignment with Offset-Aware Correlation Kernels
Paul Hongsuck Seo, Jongmin Lee, Deunsol Jung, Bohyung Han, and Minsu Cho
ECCV, 2018
[arXiv] [Bibtex]

Code

Check our GitHub repository: [github]