Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.

Relational Self-Attention (RSA)

Figure 1. Computational graph of RSA. RSA consists of two types of kernels (basic and relational kernel) and two contexts (basic and relational context).

Complexity of RSA

Table 1. Complexity of RSA. B, N, M, C, D, L denotes batch size, input size, context size, channel dimension, latent channel dimension, and number of queries, respectively. We simlify the complexity terms using ML, DC/L for ease description.

Experimental results

1. Performance comparison with other feature transform methods

Table 2. Performance comparison with other spatio-temporal feature transform methods on SS-v1. WjWi(∙) indicates a sequential transform of Wi followed by Wj. σ denotes softmax.

2. Performance comparison on Something-Something V1 & V2, Diving-48, and FineGym

Table 3. Performance comparison on (a) Something-Something v1 and v2, (b) Diving-48, and (c) FineGym.

3. Ablation studies on Something-Something-V1

Table 4. Ablation experiments on SS-V1 (a) Table 4a compares the performance of a single RSA layer with differnt combinations of dynamic kernels and contexts. (b) Table 4b shows the effectiveness of decomposing H with latent channel dimension D. Decomposing H significantly reduces the computational cost. (c) Table 4c summarizes the effect of the group-wise correlation. Hadamard product (G=CQ) performs the highest accuracy. Note that FLOPs are consistent with varying G due to the switched computation order. (d) Table 4d compares the effect of the kernel size M. In most cases, larger kernel results in the higher accuracy.

4. Kernel visualization

Figure 2. Kernel visualization results on SS-V1. From the top to the bottom in each subfigure, we visualize the input RGB frames, the self-attention kernels, the basic kernels, and the relational kernels. The query position and the context are marked as red and yellow in RGB frames, respectively. The size of spatio-temporal kernel M is set as mt × mh × mw = 5 × 7 × 7 and 6 kernels out of L kernels (L=8) are shown for each transform..


This work was supported by the NRF grants (NRF-2017R1E1A1A01077999, NRF-2021R1A2C3012728), and the IITP grant (No.2019-0-01906, AI Graduate School Program - POSTECH) funded by Ministry of Science and ICT, Korea. This work was done while Manjin was working as an intern at Microsoft Research Asia.


Relational Self-Attention: What's Missing in Attention for Video Understanding
Manjin Kim*, Heeseung Kwon*, Chunyu Wang, Suha Kwak, and Minsu Cho
NeurIPS, 2021
[arXiv] [Bibtex]


We will make our code available online as soon as possible. Check our GitHub repository: [github]