Experimental results
Performance comparison with the state of the art (HICO-DET)
Table 1.
Performance comparison on the HICO-DET [1] dataset.
The letters in Feature column stand for A: Appearance/Visual features, S: Spatial features, L: Linguistic features, P: Human pose features, M: Multi-scale features.
The best score is highlighted in bold, and the second-best score is underscored.
Performance comparison with the state of the art (V-COCO)
Table 2.
Performance comparison on V-COCO [2] dataset.
The letters in Feature column stand for A: Appearance/Visual features, S: Spatial features, L: Linguistic features, P: Human pose features, M: Multi-scale features.
The best score is highlighted in bold, and the second-best score is underscored.
Ablation studies
Table 3.
The impact of each relation context information on relational reasoning.
The ‘ternary’, ‘unary’, and ‘pairwise’ columns indicate the ternary, unary and pairwise relation context.
Table 4.
The impact of the multiplex relation context on each subtask.
The ‘human’, ‘object’, and ‘interaction’ columns indicate the propagation of the multiplex relation context to human, object, and interaction branch, respectively.
Table 5.
he Impact of disentangling human and object branches.
MUREN-(k) denotes the sharing of parameters between the human and object branches across k layers. The parameters are shared only between corresponding layers.
MUREN† is variant of MUREN by adjusting the number of layer L.
Qualitative results
Figure 3.
Visualization of HOI detection results on HICO-DET [1]. Red boxes, blue boxes and green lines indicate humans, objects, and interactions, respectively.
Attention map visualization
Figure 4.
The visualization of the HOI detection results and the cross-attention map in each branch and the multiplex relation embedding module (MURE).
Acknowledgements
This work was supported by the IITP grants (2021-0-00537: Visual common sense through selfsupervised learning for restoration of invisible parts in images (50%), 2021-0-02068: AI Innovation Hub (40%), and 2019-0-01906: AI graduate school program at POSTECH (10%)) funded by the Korea government (MSIT).
Paper
Relational Context Learning for Human-Object Interaction Detection
Sanghyun Kim, Deunsol Jung and Minsu Cho
CVPR, 2023
[
arXiv]
[
Bibtex]
Code
Check our GitHub repository: [github]
Reference
[1] Yu-Wei Chao, Yunfan Liu, Xieyang Liu, Huayi Zeng, and Jia
Deng. Learning to detect human-object interactions. In 2018
ieee winter conference on applications of computer vision
(wacv), pages 381–389. IEEE, 2018.
[2] Saurabh Gupta and Jitendra Malik. Visual semantic role labeling. arXiv preprint arXiv:1505.04474, 2015.