Abstract

Attribute-based person search is the task of finding person images that are best matched with a set of text attributes given as query. The main challenge of this task is the large modality gap between attributes and images. To reduce the gap, we present a new loss for learning cross-modal embeddings in the context of attribute-based person search. We regard a set of attributes as a category of people sharing the same traits. In a joint embedding space of the two modalities, our loss pulls images close to their person categories for modality alignment. More importantly, it pushes apart a pair of person categories by a margin determined adaptively by their semantic distance, where the distance metric is learned end-to-end so that the loss considers importance of each attribute when relating person categories. Our loss guided by the adaptive semantic margin leads to more discriminative and semantically well-arranged distributions of person images. As a consequence, it enables a simple embedding model to achieve state-of-the-art records on public benchmarks without bells and whistles.

Overall Architecture

Figure 1. Overall pipeline of our method. Image is embedded by a conventional CNN followed by a MLP while the query set of attributes, called person category, is converted to a binary vector and encoded through a separate embedding network. In their joint embedding space, a positive pair of image embedding and semantic prototype are pulled together while a negative pair is pushed apart for cross-modal alignment. Also, a pair of semantic prototypes pushes or pulls each other by a margin determined adaptively by their semantic affinity.

Learning Objective

Figure 2. A conceptual illustration of the learning objective in Eq. 1 in our paper. The modality alignment loss pulls images close to their person categories within the margin 𝛄. Meanwhile, ASMR controls margins between person categories according to their semantic affinities.

Datasets

1. Statistics

Table 1. Statistics of PETA, Market-1501 Attribute and PA100K

2. Attribute Groups

Table 2. Lists of attribute groups in the three benchmark datasets for attribute-based person search.

Quantitative Results

Table 3. Performance on standard benchmarks in Rank@k and mAP. Dim indicates embedding dimensions of the methods based on cross-modal embeddings. Bold and underline denote the best and the second-best, respectively. † indicates results reproduced by the official implementation.

Qualitative Results

1. Qualitative results on the PETA dataset.

Figure 3.. Top 5 retrieval results of our method on the PETA dataset. Images are sorted from left to right according to their ranks. Green and red boxes indicate true and false matches, respectively. Queries are presented above their retrieved images; blanks indicate attributes that do not exist in the query.

2. Qualitative results on the Market-1501 Attribute dataset.

Figure 4.. Top 10 retrieval results of our method on the Market-1501 Attribute dataset. Images are sorted from left to right according to their ranks. Green and red boxes indicate true and false matches, respectively. Queries are presented above their retrieved images; blanks indicate attributes that do not exist in the query.

3. Qualitative results on the PA100K dataset.

Figure 5.. Top 5 retrieval results of our method on the PA 100K dataset. Images are sorted from left to right according to their ranks. Green and red boxes indicate true and false matches, respectively. Queries are presented above their retrieved images; blanks indicate attributes that do not exist in the query.

4. Effectiveness of ASMR

Figure 6.. Top 5 retrieval results of our method and its variant without ASMR on the Market-1501 Attribute dataset. Images are sorted from left to right according to their ranks. Green and red boxes indicate true and false matches, respectively.

Paper

ASMR: Learning Attribute-Based Person Search with Adaptive Semantic Margin Regularizer
Boseung Jeong*, Jicheol Park* , Suha Kwak
ICCV, 2021
[TBD] [TBD]

Code

TBD: