Abstract

Supervision for metric learning has long been given in the form of equivalence between human-labeled classes. Although this type of supervision has been a basis of metric learning for decades, we argue that it hinders further advances in the field. In this regard, we propose a new regularization method, dubbed HIER, to discover the latent semantic hierarchy of training data, and to deploy the hierarchy to provide richer and more fine-grained supervision than inter-class separability induced by common metric learning losses. HIER achieves this goal with no annotation for the semantic hierarchy but by learning hierarchical proxies in hyperbolic spaces. The hierarchical proxies are learnable parameters, and each of them is trained to serve as an ancestor of a group of data or other proxies to approximate the semantic hierarchy among them. HIER deals with the proxies along with data in hyperbolic space since the geometric properties of the space are well-suited to represent their hierarchical structure. The efficacy of HIER is evaluated on four standard benchmarks, where it consistently improved the performance of conventional methods when integrated with them, and consequently achieved the best records, surpassing even the existing hyperbolic metric learning technique, in almost all settings.

Figure 1. Motivation of HIER. HIER aims to discover an inherent but latent semantic hierarchy of data (colored dots on the boundary) by learning hierarchical proxies (larger black dots) in hyperbolic space. The semantic hierarchy is deployed to provide rich and fine-grained supervision that cannot be derived only by human-labeled classes.

Conceptual illustration of HIER

Figure 2. A conceptual illustration of the learning objective. Each hierarchical proxy is colored in black and different colors indicate different classes. The associations defined by the losses are expressed by edges, where the red solid line means the pulling and the blue dotted line is the pushing. Relevant pairs are pulled into their LCA, and the remaining sample is pulled into LCA of the entire triplet.

Experimental results

1. Performance comparison with the state of the art

Table 1. Performance of metric learning methods on the four datasets. Their network architectures are denoted by abbreviations, R– ResNet50, B–Inception with BatchNorm, De–DeiT, DN–DINO and V–ViT. Note that ViT is pretrained on ImageNet-21k. Superscripts denote their embedding dimensions and † indicates models using larger input images.

2. Visualization of embedding space

Figure 3. UMAP visualization of our embedding space learned on the train split of Cars, CUB, SOP, and In-shop. Pink ones indicate hierarchical proxies and other colors represent distinct classes. The gray lines are the ancestor-descendant relations between data points.

Figure 4. UMAP visualizations of our embedding space learned on the train split of the Cars dataset at different epochs. Pink points indicate hierarchical proxies, and other colors represent distinct classes. The gray line indicates the ancestor-descendant relation between the hierarchical proxy and data points

3. Analysis on semantic hierarchy of HIER

Figure 5. Class-to-class affinity matrices of proxy anchor and ours on the CUB (a) and Cars (b) datasets, which show the inter-class similarity. The different colors (13 colors of CUB and 9 colors of Cars) of the sidebar indicate distinct actual super-classes at the order level referring to the hierarchy labels.

Figure 6. Top-4 neighbors of hierarchical proxies that are close to the boundary of the Poincare ball on the CUB (a) and Cars (b) datasets. The samples in each row are the nearest neighbors of a hierarchical proxy.

4. Ablation studies

Table 2. Accuracy in Recall@1 of ours with two metric learning losses [1, 2], and their variants on the four datasets. The network architecture is DeiT with 128 embedding dimensions. HIERsph denotes HIER over spherical space.

Figure 7. Comparison HIER and Hyp [3] varying embedding dimension on the In-shop dataset using ViT as backbone network.

5. Qualitative results

Figure 8. Qualitative results of ours and proxy anchor on the four public benchmark datasets, CUB (a), Cars (b), SOP (c), and In-Shop (d). Queries and the top 4 retrieval results of our method are presented. The true and false matches are colored in green and red, respectively.

Acknowledgements

This work was supported bythe NRF grant and the IITP grant funded by Ministry of Science and ICT, Korea (NRF-2021R1A2C3012728–20%, IITP-2019-0-01906–10%, IITP-2020-0-00842–50%, IITP-2022-0-00926–20%).

Paper

HIER: Metric Learning Beyond Class Labels via Hierarchical Regularization
Sungyeon Kim, Boseung Jeong and Suha Kwak
CVPR, 2023
[arXiv] [Bibtex]

Code

Check our GitHub repository: [github]

Reference

[1] Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Proxy anchor loss for deep metric learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2020.

[2] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with general pair weighting for deep metric learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2019.

[3] Aleksandr Ermolov, Leyla Mirvakhabova, Valentin Khrulkov, Nicu Sebe, and Ivan Oseledets. Hyperbolic vision transformers: Combining improvements in metric learning. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2022.