Abstract

Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information typically using optical flows extracted by a separate off-the-shelf method. As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We propose a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.

Overall architecture

Figure 1. Overall architecture of the proposed approach. The model first takes video frames as input and converts them into frame-wise appearance features using convolutional layers. The proposed MotionSqueeze (MS) module generates motion features using the frame-wise appearance features, and combines the motion features into the next downstream layer.

MotionSqueeze (MS) module

Figure 2. Overall process of MotionSqueeze (MS) module. The MS module estimates motion across two frame-wise feature maps of adjacent frames. A correlation tensor is obtained by computing correlations, and then a displacement tensor is estimated using the tensor. Through the transformation process of convolution layers, the final motion feature is obtained.

Quantitative results

1. Results on Something-Something V1&V2.

Figure 3. Video classification performance comparison on Something-Something-V1 in terms of accuracy, computational cost, and model size. The proposed architecture (MSNet) achieves the best trade-off between accuracy and efficiency compared to state-of-the-art methods of TSM [21], TRN [47], ECO [48], I3D [2], NL-I3D [41], and GCN [42].

Table 1. The table summarized the performance on Something-Something-V1&V2.

2. Results on other datasets.

Table 2. The table summarized the performance on Kinetics and HMDB51.

Qualitative results

1. Qualitative results on Something-Something V1.

Figure 5. Visualization on Something-Something-V1: (a) "Pulling two ends of something so that it gets stretched" and (b) "Wiping off something of something". Video RGB frames, displacement maps and confidence maps are shown from the top row in each subfigure.

2. Qualitative results on Kinetics.

Figure 6. Visualization on Kinetics: (a) "Pull ups" and (b) "Skateboarding". Video RGB frames, displacement maps and confidence maps are shown from the top row in each subfigure.

Paper

MotionSqueeze: Neural Motion Feature Learning for Video Understanding
Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho
ECCV, 2020
[arXiv] [Bibtex]

Code

We will make our code available online as soon as possible. Check our GitHub repository: [github]