Spatio-temporal convolution often fails to learn motion dynamics in videos and thus an effective motion representation is required for video understanding in the wild. In this paper, we propose a rich and robust motion representation based on spatio-temporal self-similarity (STSS). Given a sequence of frames, STSS represents each local region as similarities to its neighbors in space and time. By converting appearance features into relational values, it enables the learner to better recognize structural patterns in space and time. We leverage the whole volume of STSS and let our model learn to extract an effective motion representation from it. The proposed neural block, dubbed SELFY, can be easily inserted into neural architectures and trained end-to-end without additional supervision. With a sufficient volume of the neighborhood in space and time, it effectively captures long-term interaction and fast motion in the video, leading to robust action recognition. Our experimental analysis demonstrates its superiority over previous methods for motion modeling as well as its complementarity to spatio-temporal features from direct convolution. On the standard action recognition benchmarks, Something-Something-V1 & V2, Diving-48, and FineGym, the proposed method achieves the state-of-the-art results.

Spatio-temporal self-similarity (STSS) representation learning.

Figure 1. STSS describes each position (query) by its similarities (STSS tensor) with its neighbors in space and time (neighborhood). It allows to take a generalized, far-sighted view on motion, i.e., both short-term and long-term, both forward and backward, as well as spatial self-motion. Our method learns to extract a rich motion representation from STSS without additional supervision.

Overview of our self-similarity representation block (SELFY).

Figure 2. SELFY block takes as input a video feature tensor V, transforms it to a STSS tensor S, and extracts a feature tensor F from S. It then produces the final STSS representation Z via the feature integration, which is the same size as the input V. The resultant representation Z is fused into the input feature V by element-wise addition, thus making SELFY act as a residual block. See text for details.

Experimental results

1. Performance comparison on Something-Something V1&V2.

Table 1. Top-1, 5 accuracy (%) and FLOPs (G) are shown.

2. Performance comparison on Diving-48 & FineGym.

Table 2. Table 2a & 2b shows performance comparison on Diving-48 & FineGym, respectively. Top-1 & 5 accuracy (%) and FLOPs are shown in Table 2a, and averaged per-class accuracy (%) is shown in Table 2b.

3. Ablation studies on Something-Something-V1.

Table 3. Table 3a shows performance comparison with different types of similarity in SELFY block, and {·} denotes a set of temporal offset l. Table 3b shows performance comparison with different feature extraction and feature integration methods. Smax denotes the soft-argmax operation, and MLP consists of 4 FC layers. The 1×1×1 layer in the feature integration stage is omitted.

4. Relation with self-attention mechanisms.

Table 4. Table 4 shows performance comparison with self-attention methods. LSA, NL, and MHSA denote a local self-attention block, non-local block, and multi-head self-attention block, respectively.

5. Complementarity of STSS features.

Figure 3 & Table 5. Figure 3 illustrates basic blocks and their combinations. (a) spatio-temporal convolution block (STCB), (b) SELFY-s block, and (c-f) their different combinations. Table 5 validates that STSS features are complementary to spatio-temporal features. The basic blocks and their combinations in Fig.3 are compared on Something-Something-V1.

6. Improving robustness wih STSS.

Figure 4. Figure 4 illustrates the results of robustness experiments. (a) and (b) show top-1 accuracy of SELFYNet variants (Table 3a) when differnt degrees of occlusion and motion blur, respectively, are added to input. (c) shows qualitative examples where SELFYNet ({-3, ... ,3}) succeeds while SELFYNet ({1}) fails.


This work is supported by Samsung Advanced Institute of Technology (SAIT), the NRF grant (NRF-2021R1A2C3012728), and the IITP grant (No.2019-0-01906, AI Graduate School Program - POSTECH) funded by Ministry of Science and ICT, Korea.


Learning Self-Similarity as Generalized Motion for Video Action Recognition
Heeseung Kwon*, Manjin Kim*, Suha Kwak, and Minsu Cho
ICCV, 2021
[arXiv] [Bibtex]


We will make our code available online as soon as possible. Check our GitHub repository: [github]