Self-Taught Metric Learning without Labels

Abstract

We present a novel self-taught framework for unsupervised metric learning, which alternates between predicting class-equivalence relations between data through a moving average of an embedding model and learning the model with the predicted relations as pseudo labels. At the heart of our framework lies an algorithm that investigates contexts of data on the embedding space to predict their class-equivalence relations as pseudo labels. The algorithm enables efficient end-to-end training since it demands no off-the-shelf module for pseudo labeling. Also, the class-equivalence relations provide rich supervisory signals for learning an embedding space. On standard benchmarks for metric learning, it clearly outperforms existing unsupervised learning methods and sometimes even beats supervised learning models using the same backbone network. It is also applied to semi-supervised metric learning as a way of exploiting additional unlabeled data, and achieves the state of the art by boosting performance of supervised learning substantially.

Overview of STML

Figure 1. An overview of our STML framework. First, contextualized semantic similarity between a pair of data is estimated on the embedding space of the teacher network. The semantic similarity is then used as a pseudo label, and the student network is optimized by relaxed contrastive loss with KL divergence. Pink arrows represent backward gradient flows. Finally, the teacher network is updated by an exponential moving average of the student. The student network learns by iterating these steps a number of times, and its backbone and embedding layer in light green are considered as our final model.

Comparison Between Other Unsupervised Metric Learning Methods

Figure 2. Figure 1. Accuracy in Recall@1 versus embedding dimension on the CUB-200-2011 dataset using GoogleNet backbone. Superscripts denote embedding dimensions and † indicates supervised learning methods. Our model with 128 embedding dimensions outperforms all previous arts using higher embedding dimensions and sometimes surpasses supervised learning methods.

Quantitative Results

1. Unsupervised metric learning with ImageNet pretrained weights

Table 1. Performance of the unsupervised and supervised metric learning methods on the three datasets. Their network architectures are denoted by abbreviations, G–GoogleNet, BN–Inception with BatchNorm, where superscripts denote their embedding dimensions.

2. Unsupervised metric learning without ImageNet pretrained weights

Table 2. Performance on the SOP dataset using ResNet18 without pre-trained weights.

3. Semi-supervised Metric Learning

Table 3. Performance of the supervised and semi-supervised methods on the two datasets. Network architectures of the methods are all ResNet50 (R50) and superscripts denote their embedding dimensions. Also, the column “Init.” indicates whether the models are pre-trained on ImageNet or by SwAV. Our model and SLADE are all fine-tuned with the MS loss.

Qualitative Results

1. t-SNE visualization

Figure 3. t-SNE visualization of (a) ImageNet pre-trained model and (b) model trained by STML on the test split of SOP dataset. A green dotted box indicates query samples, and a red cross mark indicates samples with different class labels from the query.

2. The evolution of our model without using labels

Figure 4. Top-3 retrievals of our model at every 30 epochs on the CUB-200-2011 datasets. Images with green boundary are correct and those with red boundary are incorrect.

Sungyeon Kim¹	Dongwon Kim¹	Minsu Cho^{1, 2}	Suha Kwak^{1, 2}
¹Department of CSE, POSTECH		²Graduate School of AI, POSTECH