Abstract

Recent state-of-the-art methods for HOI detection typically build on transformer architectures with two decoder branches, one for human-object pair detection and the other for interaction classification. Such disentangled transformers, however, may suffer from insufficient context exchange between the branches and lead to a lack of context information for relational reasoning, which is critical in discovering HOI instances. In this work, we propose the multiplex relation network (MUREN) that performs rich context exchange between three decoder branches using unary, pairwise, and ternary relations of human, object, and interaction tokens. The proposed method learns comprehensive relational contexts for discovering HOI instances, achieving state-of-the-art performance on two standard benchmarks for HOI detection, HICO-DET and V-COCO.

Multiplex Relation Network (MUREN)

Figure 1. The overall architecture of MUREN. The proposed method adopts three-branch architecture: human branch, object branch, and interaction branch. Each branch is responsible for human detection, object detection, interaction classification. The input image is fed into the CNN backbone followed by the transformer encoder to extract the image tokens. A transformer decoder layer in each branch layer extracts the task-specific tokens for predicting the sub-task. The MURE takes the task-specific tokens as input and generates the multiplex relation context for relational reasoning. The attentive fusion module propagates the multiplex relation context to each sub-task for context exchange. The outputs at the last layer of each branch are fed into to predict the HOI instances.

Multiplex Relation Embedding Module (MURE)

Figure 2. The architecture of the multiplex relation embedding module (MURE). MURE takes i-th task-specific tokens and the image tokens as input, and embed the unary and pairwise relation contexts into the ternary relation context. The multiplex relation context, the output of MURE, is fed into subsequent attentive fusion module for context exchange.

Experimental results

Performance comparison with the state of the art (HICO-DET)

Table 1. Performance comparison on the HICO-DET [1] dataset. The letters in Feature column stand for A: Appearance/Visual features, S: Spatial features, L: Linguistic features, P: Human pose features, M: Multi-scale features. The best score is highlighted in bold, and the second-best score is underscored.

Performance comparison with the state of the art (V-COCO)

Table 2. Performance comparison on V-COCO [2] dataset. The letters in Feature column stand for A: Appearance/Visual features, S: Spatial features, L: Linguistic features, P: Human pose features, M: Multi-scale features. The best score is highlighted in bold, and the second-best score is underscored.

Ablation studies

Table 3. The impact of each relation context information on relational reasoning. The ‘ternary’, ‘unary’, and ‘pairwise’ columns indicate the ternary, unary and pairwise relation context.

Table 4. The impact of the multiplex relation context on each subtask. The ‘human’, ‘object’, and ‘interaction’ columns indicate the propagation of the multiplex relation context to human, object, and interaction branch, respectively.

Table 5. he Impact of disentangling human and object branches. MUREN-(k) denotes the sharing of parameters between the human and object branches across k layers. The parameters are shared only between corresponding layers. MUREN is variant of MUREN by adjusting the number of layer L.

Qualitative results

Figure 3. Visualization of HOI detection results on HICO-DET [1]. Red boxes, blue boxes and green lines indicate humans, objects, and interactions, respectively.

Attention map visualization

Figure 4. The visualization of the HOI detection results and the cross-attention map in each branch and the multiplex relation embedding module (MURE).

Acknowledgements

This work was supported by the IITP grants (2021-0-00537: Visual common sense through selfsupervised learning for restoration of invisible parts in images (50%), 2021-0-02068: AI Innovation Hub (40%), and 2019-0-01906: AI graduate school program at POSTECH (10%)) funded by the Korea government (MSIT).

Paper

Relational Context Learning for Human-Object Interaction Detection
Sanghyun Kim, Deunsol Jung and Minsu Cho
CVPR, 2023
[arXiv] [Bibtex]

Code

Check our GitHub repository: [github]

Reference

[1] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia Deng. Learning to detect human-object interactions. In 2018 ieee winter conference on applications of computer vision (wacv), pages 381–389. IEEE, 2018.

[2] Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.