Donghyeon Kwon | Minsu Cho | Suha Kwak |
POSTECH |
Existing datasets for semantic correspondence are often limited in terms of both the amount of labeled data and diversity of labeled keypoints due to the tremendous cost of manual correspondence labeling. To address this issue, we propose the first self-supervised learning framework that utilizes a large amount of web videos collected and annotated fully automatically. Our main motivation is that smooth changes between consecutive video frames allow to build accurate space-time correspondences with no human intervention. Hence, we establish space-time correspondences within each web video and leverage them for deriving pseudo correspondence labels between two distant frames of the video. In addition, we present a dedicated training strategy that facilitates stable training using web videos with such pseudo labels. Our experiments on public benchmarks demonstrated that the proposed method surpasses existing self-supervised learning models and that our self-supervised learning as pretraining for supervised learning improves performance substantially. Our codebase for web video crawling and pseudo label generation will be released public to promote future research.
Check our GitHub repository: GitHub Repository