Manjin Kim1* | Heeseung Kwon1* | Chunyu Wang2 | Suha Kwak1 | Minsu Cho1 | ||||||||||||||||||||
1 Pohang University of Science and Techonology (POSTECH) | 2 Microsoft Research Asia(MSRA) | |||||||||||||||||||||||
*Equal contribution |
Convolution has been arguably the most important feature transform for modern neural networks, leading to the advance of deep learning. Recent emergence of Transformer networks, which replace convolution layers with self-attention blocks, has revealed the limitation of stationary convolution kernels and opened the door to the era of dynamic feature transforms. The existing dynamic transforms, including self-attention, however, are all limited for video understanding where correspondence relations in space and time, i.e., motion information, are crucial for effective representation. In this work, we introduce a relational feature transform, dubbed the relational self-attention (RSA), that leverages rich structures of spatio-temporal relations in videos by dynamically generating relational kernels and aggregating relational contexts. Our experiments and ablation studies show that the RSA network substantially outperforms convolution and self-attention counterparts, achieving the state of the art on the standard motion-centric benchmarks for video action recognition, such as Something-Something-V1 & V2, Diving48, and FineGym.
This work was supported by the NRF grants (NRF-2017R1E1A1A01077999, NRF-2021R1A2C3012728), and the IITP grant (No.2019-0-01906, AI Graduate School Program - POSTECH) funded by Ministry of Science and ICT, Korea. This work was done while Manjin was working as an intern at Microsoft Research Asia.
We will make our code available online as soon as possible. Check our GitHub repository: [github]