Paul Hongsuck Seo1 | Jongmin Lee1 | Deunsol Jung1 | Bohyung Han2 | Minsu Cho1 | |||||
1POSTECH | 2Seoul National University |
Semantic correspondence is the problem of establishing correspondences across images depicting different instances of the same object or scene class. One of recent approaches to this problem is to estimate parameters of a global transformation model that densely aligns one image to the other. Since an entire correlation map between all feature pairs across images is typically used to predict such a global transformation, noisy features from different backgrounds, clutter, and occlusion distract the predictor from correct estimation of the alignment. This is a challenging issue, in particular, in the problem of semantic correspondence where a large degree of image variations is often involved. In this paper, we introduce an attentive semantic alignment method that focuses on reliable correlations, filtering out distractors. For effective attention, we also propose an offset-aware correlation kernel that learns to capture translation-invariant local transformations in computing correlation values over spatial locations. Experiments demonstrate the effectiveness of the attentive model and offset-aware kernel, and the proposed model combining both techniques achieves the state-of-the-art performance.
Figure 1. Overall architecture of the A2Net. A2Net takes as inputs two images and estimates a set of global transformation using three main components: feature extractor, local transformation encoder, and attentive global transformation estimator.
Figure 2. Offset-aware correlation kernel at different source locations: (a) at (0,0) and (b) at (0,1). Each dotted line connects source and target features to compute correlation, and w represents a kernel weight for the dotted line. Note that kernel weights are associated with different correlation pairs when source locations vary.
Figure 3. Illutstraion of attention process. Noisy features in local transformation feature map are filtered by assigning lower probabilities to these locations. Arrows in boxes of local transformation feature map demonstrate features encoding local transformations, and grayscale colors in attention distribution represent magnitudes of probabilities where brighter colors mean higher probabilities.
Figure 4. Experimental results on PF-WILLOW and PF-PASCAL. PCK is measured with α=0.1. Scores for other models are brought from each paper while scores marked with an asterisk (*) are drawn from the reproduced models by released official codes. The PCK scores marked with a dagger (>) are measured with height and width of the image size instead of the bounding box size. Note that the PCK score w.r.t. the bounding box size is more conservative than to one w.r.t. the image size, resulting in lower scores. For example, PCK scores of A2Net on PF-PASCAL measured by image sizes are 0.59 (affine), 0.65 (affine+TPS) and 0.71 (affine+TPS; ResNet101).
Figure 5. PCKs of ablations on PF-WILLOW trained with PASCAL VOC 2011. Scores of GeoCNN are obtained from the code released by the authors. The numbers of network parameters exclude the feature extractors since all models share the same feature extractor.
Figure 5. Qualitative results of the attentive semantic alignment. Each row shows an example of PF-PASCAL benchmark. Given the source and target images shown in first and third columns, we visualize the attention maps of the affine model (second column), the transformed image by the affine model (fourth column) and the final transformed image by the affine+TPS model (last column). Since the models learn inverse transformation, the target image is transformed toward the source image while the attention distribution is drawn over the source image. The model attends to the objects to match and estimates dense correspondences despite intra-class variations and background clutters.
Check our GitHub repository: [github]