Self-Supervised Learning of Semantic Correspondence Using Web Videos

Abstract

Existing datasets for semantic correspondence are often limited in terms of both the amount of labeled data and diversity of labeled keypoints due to the tremendous cost of manual correspondence labeling. To address this issue, we propose the first self-supervised learning framework that utilizes a large amount of web videos collected and annotated fully automatically. Our main motivation is that smooth changes between consecutive video frames allow to build accurate space-time correspondences with no human intervention. Hence, we establish space-time correspondences within each web video and leverage them for deriving pseudo correspondence labels between two distant frames of the video. In addition, we present a dedicated training strategy that facilitates stable training using web videos with such pseudo labels. Our experiments on public benchmarks demonstrated that the proposed method surpasses existing self-supervised learning models and that our self-supervised learning as pretraining for supervised learning improves performance substantially. Our codebase for web video crawling and pseudo label generation will be released public to promote future research.

Overall Architecture

Figure 1. Our algorithm for collecting web videos and annotating them with pseudo correspondence labels fully automatically. The algorithm first downloads only thumbnail images and classify them to identify relevant videos. The relevant videos are then downloaded and divided into multiple clips with no abrupt transition. The algorithm trains a space-time correspondence model with the clips to generate dense pseudo correspondence labels for arbitrary pairs of frames of the clips.

Figure 2. An overview of our framework using web videos. The web videos are used for conventional supervised learning of the base model (CATs in this figure). Additionally, a common dataset is employed for domain adaptation learning, aiming to bridge the domain gap between the web videos and the common dataset without any form of supervision from the dataset.

Performance

1. Comparison with self-/weakly-supervised methods

Table 1. Comparisons with self/weakly-supervised methods in PCK (%) on PF-PASCAL, PF-Willow, and SPair-71k. The backbone network of each method is indicated in the subscript. The best and the second best results are marked in bold and underline, respectively.

2. Comparisons with supervised methods

Table 2. Comparisons with supervised methods in PCK (%) on PF-PASCAL, PF-Willow and SPair-71k. The backbone network of each method is indicated in the subscripts. The best and the second best results are marked in bold and underline, respectively.

Qualitative results

1. Qualitative results on SPair-71K test set

Figure 1. Qualitative results on SPair-71K test set.

2. Qualitative results of pseudo correspondence labels on web-crawled videos

Figure 2. Qualitative results of pseudo correspondence labels on web-crawled videos. We randomly sampled eight keypoints for pseudo correspondence labels of each frame pair. Note that source and target images are augmented differently.

Paper

Self-Supervised Learning of Semantic Correspondence Using Web Videos

Donghyeon Kwon, Minsu Cho and Suha Kwak

WACV, 2024

[Paper] [Bibtex]

Code

Check our GitHub repository: GitHub Repository