Existing metric learning losses can be categorized into two classes: pair-based and proxy-based losses. The former class can leverage fine-grained semantic relations between data points, but slows convergence in general due to its high training complexity. In contrast, the latter class enables fast and reliable convergence, but cannot consider the rich datato- data relations. This paper presents a new proxy-based loss that takes advantages of both pair- and proxy-based methods and overcomes their limitations. Thanks to the use of proxies, our loss boosts the speed of convergence and is robust against noisy labels and outliers. At the same time, it allows embedding vectors of data to interact with each other through its gradients to exploit data-to-data relations. Our method is evaluated on four public benchmarks, where a standard network trained with our loss achieves state-ofthe-art performance and most quickly converges.

Comparison Between Other Metric Learning Losses

Figure 1. Comparison between popular metric learning losses and ours. Small nodes are embedding vectors of data in a batch, and black ones indicate proxies; their different shapes represent distinct classes. The associations defined by the losses are expressed by edges, and thicker edges get larger gradients. Also, embedding vectors associated with the anchor are colored in red if they are of the same class of the anchor (i.e., positive) and in blue otherwise (i.e., negative). (a) Triplet loss associates each anchor with a positive and a negative data point without considering their hardness. (b) N-pair loss and (c) Lifted Structure loss reflect hardness of data, but do not utilize all data in the batch. (d) Proxy-NCA loss cannot exploit data-to-data relations since it associates each data point only with proxies. (e) Our loss handles entire data in the batch, and associates them with each proxy with consideration of their relative hardness determined by data-to-data relations. See the text for more details.

Speed of Convergence

Figure 2. Accuracy in Recall@1 versus training time on the Cars-196 dataset. Note that all methods were trained with batch size of 150 on a single Titan Xp GPU. Our loss enables to achieve the highest accuracy, and converge faster than the baselines in terms of both the number of epochs and the actual training time.

Quantitative Results

Table 1. Recall@K (%) on the CUB-200-2011 and Cars-196 datasets. Superscripts denote embedding sizes and indicates models using larger input images. Backbone networks of the models are denoted by abbreviations: G–GoogleNet, BN–Inception with batch normalization, R50–ResNet50.

Table 2. Recall@K (%) on the SOP and In-Shop. Superscripts denote embedding sizes and indicates models using larger input images.

Qualitative Results

Figure 3. Qualitative results on the CUB-200-2011 (a), Cars-196 (b), SOP (c) and In-shop (d). For each query image (leftmost), top-4 retrievals are presented. The results with red boundary are failure cases, which are however substantially similar to their query images in terms of appearance.


Proxy Anchor Loss for Deep Metric Learning
Sungyeon Kim, Dongwon Kim, Minsu Cho and Suha Kwak
CVPR, 2020
[arXiv] [Bibtex]


Check our GitHub repository: [github]