Abstract

Scene graph generation aims to construct a semantic graph structure from an image such that its nodes and edges respectively represent objects and their relationships. One of the major challenges for the task lies in the presence of distracting objects and relationships in images; contextual reasoning is strongly distracted by irrelevant objects or backgrounds and, more importantly, a vast number of irrelevant candidate relations. To tackle the issue, we propose the Selective Quad Attention Network (SQUAT) that learns to select relevant object pairs and disambiguate them via diverse contextual interactions. SQUAT consists of two main components: edge selection and quad attention. The edge selection module selects relevant object pairs, i.e., edges in the scene graph, which helps contextual reasoning, and the quad attention module then updates the edge features using both edge-to-node and edge-to-edge cross-attentions to capture contextual information between objects and object pairs. Experiments demonstrate the strong performance and robustness of SQUAT, achieving the state of the art on the Visual Genome and Open Images v6 benchmarks.

SQUAT (Selective Quad Attention Network)

Figure 1. The overall architecture of Selective Quad Attention Networks (SQUAT). SQUAT consists of three components: the node detection module, the edge selection module, and the quad attention module. First, the node detection module extracts nodes N by detecting object candidate boxes and extracting their features. Also, all possible pairs of the nodes are constructed as initial edges E. Second, the edge selection module select valid edges with high relatedness scores. Third, the quad attention module updates the node and edge features via four types of attention. Finally, the output features are passed into a classifier to predict the scene graph.

Quad Attention Module

Figure 2. (left) The overview of the quad attention. The node features are updated by node-to-node (N2N) and node-to-edge (N2E) attentions, and the edge features are updated by edge-to-node (E2N) and edge-to-edge (E2E) attentions. (right) Detailed architecture of the quad attention. The node features are updated by node-to-node and node-to-edge attentions, and the valid edge features, selected by ESMQ, are updated by edge-to-node and edge-to-edge attentions. The key-value of node-to-edge and edge-to-edge attentions are selected by ESMN2E and ESME2E.

Experimental Analysis

Results on Visual Genome

Table 1. The scene graph generation performance of three subtasks on Visual Genome (VG) dataset with graph constraints. † denotes that the bi-level sampling is applied for the model. ‡ denotes that the results are reported from [A]. SQUAT outperforms the state-of-the-art models on every setting, PredCls, SGCls and SGDet. Especially, SQUAT outperforms the state-of-the-art models by a large margin of 3.9 in mR@100 on the SGDet setting, which is the most realistic and important setting in practice as there is no perfect object detector.

Effect of the edge selection module

Table 2. The ablation study on message passing for the scene graph generation. There are four settings depending on which graphs are used in the message passing: No, Full, ES, and GT. Every model with the message passing through ground truth outperforms state-of-the-art models by a substantial margin, showing that removing the invalid edges is crucial for scene graph generation. The edge selection module clearly improves not only the performance of SQUAT but also that of BGNN, the previous state-of-the-art model. It indicates that the edge selection module effectively removes the invalid edges and can be used as a plug-and-play module for message-passing-based scene graph methods.

Ablation Study

Table 3. (left) The ablation study on model variants on edge selection. We remove the edge selection module for query selection and key-value selection. (right) The ablation study on model variants on quad attention. N2N, N2E, E2N, E2E denote the node-to-node, node-toedge, edge-to-node, and edge-to-edge attentions, respectively.

Qualitative Results

Figure 3. Qalitative results for edge selection module ESMQ for query selection. The selected edges after edge selection are drawn in the right graph. The green arrows denote the valid pairs, and the gray arrows denote the invalid pairs. The keeping ratio for the two settings is the same ρ = 35%. All of the valid edges remain, and most of the invalid edges are removed.

Paper

Devil's on the Edges: Selective Quad Attention for Scene Graph Generation
Deunsol Jung, Sanghyun Kim, Won Hwa Kim, Minsu Cho
IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023
[arXiv] [Bibtex]

Code

Check our GitHub repository: [Github]

References

[A] Unbiased Scene Graph Generation from Biased Training, Tang et al., CVPR 2020