Deep Metric Learning Beyond Binary Supervision

Abstract

Metric Learning for visual similarity has mostly adopted binary supervision indicating whether a pair of images are of the same class or not. Such a binary indicator covers only a limited subset of image relations, and is not sufficient to represent semantic similarity between images described by continuous and/or structured labels such as object poses, image captions, and scene graphs. Motivated by this, we present a novel method for deep metric learning using continuous labels. First, we propose a new triplet loss that allows distance ratios in the label space to be preserved in the learned metric space. The proposed loss thus enables our model to learn the degree of similarity rather than just the order. Furthermore, we design a triplet mining strategy adapted to metric learning with continuous labels. We address three different image retrieval tasks with continuous labels in terms of human poses, room layouts and image captions, and demonstrate the superior performance of our approach compared to previous methods.

Our Framework vs Conventional Metric Learning

Figure 1. A conceptual illustration for comparing existing methods and ours. Each image is labeled by human pose, and colored in red if its pose similarity to the anchor is high. (a) Existing methods categorize neighbors into positive and negative classes, and learn a metric space where positive images are close to the anchor and negative ones far apart. In such a space, the distance between a pair of images is not necessarily related to their semantic similarity since the order and degrees of similarities between them are disregarded. (b) Our approach allows distance ratios in the label space to be preserved in the learned metric space so as to overcome the aforementioned limitation.

Quantitative Results

1. Results of Three Retrieval Tasks Based on Continuous Similarities

Figure 2. Quantitative evaluation of the three retrieval tasks in terms of mean label distance (top) and mean nDCG (bottom).

2. Results of Image Captioning

Table 1. Captioning performance on the Karpathy test split. We report scores obtained by a single model with the beam search algorithm (beam size = 2). ATT: Att2all2. TD: Topdown. Img: ImageNet pretrained feature. Cap: Caption-aware feature. XE: Pretrained with cross-entropy. RL: Finetuned by reinforcement learning. B4: BLEU-4. C: CIDEr-D. M: METEOR. R: ROGUE-L. S: SPICE.

Qualitative Results

1. Qualitative Results of Human Pose Retrieval.

Figure 3. Qualitative results of human pose retrieval.

2. Qualitative Results of Room Layout Retrieval

Figure 4. Qualitative results of room layout retrieval. For an easier evaluation, the retrieved images are blended with their groundtruth masks, and their mIoU scores are reported together. Binary Tri.: L(Triplet)+M(Binary). ImgNet: ImageNet pretraiend ResNet101.

3. Qualitative Results of Caption-Aware Image Retrieval

Figure 5. Qualitative results of caption-aware image retrieval. Binary Tri.: L(Triplet)+M(Binary). ImgNet: ImageNet pretraiend ResNet101.

4. Attention Maps of Examples

Figure 6. Attention maps of typical examples of from reinforcement learned att2all2 model with ImageNet pretrained feature (Img RL), caption aware feature(Cap RL)