Sungyeon Kim1 | Minkyo Seo1 | Ivan Laptev2 | Suha Kwak1 | Minsu Cho1 | ||||||||||||||||||||
1POSTECH | 2Inria |
Metric Learning for visual similarity has mostly adopted binary supervision indicating whether a pair of images are of the same class or not. Such a binary indicator covers only a limited subset of image relations, and is not sufficient to represent semantic similarity between images described by continuous and/or structured labels such as object poses, image captions, and scene graphs. Motivated by this, we present a novel method for deep metric learning using continuous labels. First, we propose a new triplet loss that allows distance ratios in the label space to be preserved in the learned metric space. The proposed loss thus enables our model to learn the degree of similarity rather than just the order. Furthermore, we design a triplet mining strategy adapted to metric learning with continuous labels. We address three different image retrieval tasks with continuous labels in terms of human poses, room layouts and image captions, and demonstrate the superior performance of our approach compared to previous methods.
Figure 1. A conceptual illustration for comparing existing methods and ours. Each image is labeled by human pose, and colored in red if its pose similarity to the anchor is high. (a) Existing methods categorize neighbors into positive and negative classes, and learn a metric space where positive images are close to the anchor and negative ones far apart. In such a space, the distance between a pair of images is not necessarily related to their semantic similarity since the order and degrees of similarities between them are disregarded. (b) Our approach allows distance ratios in the label space to be preserved in the learned metric space so as to overcome the aforementioned limitation.
Figure 2. Quantitative evaluation of the three retrieval tasks in terms of mean label distance (top) and mean nDCG (bottom).
Table 1. Captioning performance on the Karpathy test split. We report scores obtained by a single model with the beam search algorithm (beam size = 2). ATT: Att2all2. TD: Topdown. Img: ImageNet pretrained feature. Cap: Caption-aware feature. XE: Pretrained with cross-entropy. RL: Finetuned by reinforcement learning. B4: BLEU-4. C: CIDEr-D. M: METEOR. R: ROGUE-L. S: SPICE.
Figure 3. Qualitative results of human pose retrieval.
Figure 4. Qualitative results of room layout retrieval. For an easier evaluation, the retrieved images are blended with their groundtruth masks, and their mIoU scores are reported together. Binary Tri.: L(Triplet)+M(Binary). ImgNet: ImageNet pretraiend ResNet101.
Figure 5. Qualitative results of caption-aware image retrieval. Binary Tri.: L(Triplet)+M(Binary). ImgNet: ImageNet pretraiend ResNet101.
Figure 6. Attention maps of typical examples of from reinforcement learned att2all2 model with ImageNet pretrained feature (Img RL), caption aware feature(Cap RL)
Check our GitHub repository: [github]