Sungyeon Kim1 | Dongwon Kim1 | Minsu Cho1, 2 | Suha Kwak1, 2 |
1Department of CSE, POSTECH | 2Graduate School of AI, POSTECH |
We present a novel self-taught framework for unsupervised metric learning, which alternates between predicting class-equivalence relations between data through a moving average of an embedding model and learning the model with the predicted relations as pseudo labels. At the heart of our framework lies an algorithm that investigates contexts of data on the embedding space to predict their class-equivalence relations as pseudo labels. The algorithm enables efficient end-to-end training since it demands no off-the-shelf module for pseudo labeling. Also, the class-equivalence relations provide rich supervisory signals for learning an embedding space. On standard benchmarks for metric learning, it clearly outperforms existing unsupervised learning methods and sometimes even beats supervised learning models using the same backbone network. It is also applied to semi-supervised metric learning as a way of exploiting additional unlabeled data, and achieves the state of the art by boosting performance of supervised learning substantially.
Figure 1. An overview of our STML framework. First, contextualized semantic similarity between a pair of data is estimated on the embedding space of the teacher network. The semantic similarity is then used as a pseudo label, and the student network is optimized by relaxed contrastive loss with KL divergence. Pink arrows represent backward gradient flows. Finally, the teacher network is updated by an exponential moving average of the student. The student network learns by iterating these steps a number of times, and its backbone and embedding layer in light green are considered as our final model.
Figure 2. Figure 1. Accuracy in Recall@1 versus embedding dimension on the CUB-200-2011 dataset using GoogleNet backbone. Superscripts denote embedding dimensions and † indicates supervised learning methods. Our model with 128 embedding dimensions outperforms all previous arts using higher embedding dimensions and sometimes surpasses supervised learning methods.
Table 1. Performance of the unsupervised and supervised metric learning methods on the three datasets. Their network architectures are denoted by abbreviations, G–GoogleNet, BN–Inception with BatchNorm, where superscripts denote their embedding dimensions.
Table 2. Performance on the SOP dataset using ResNet18 without pre-trained weights.
Table 3. Performance of the supervised and semi-supervised methods on the two datasets. Network architectures of the methods are all ResNet50 (R50) and superscripts denote their embedding dimensions. Also, the column “Init.” indicates whether the models are pre-trained on ImageNet or by SwAV. Our model and SLADE are all fine-tuned with the MS loss.
Figure 3. t-SNE visualization of (a) ImageNet pre-trained model and (b) model trained by STML on the test split of SOP dataset. A green dotted box indicates query samples, and a red cross mark indicates samples with different class labels from the query.
Figure 4. Top-3 retrievals of our model at every 30 epochs on the CUB-200-2011 datasets. Images with green boundary are correct and those with red boundary are incorrect.
Check our GitHub repository: [github]