Heeseung Kwon1,2 | Manjin Kim1 | Suha Kwak1 | Minsu Cho1,2 | ||||||||||||||||
1POSTECH | 2NPRC |
Motion plays a crucial role in understanding videos and most state-of-the-art neural models for video classification incorporate motion information typically using optical flows extracted by a separate off-the-shelf method. As the frame-by-frame optical flows require heavy computation, incorporating motion information has remained a major computational bottleneck for video understanding. In this work, we replace external and heavy computation of optical flows with internal and light-weight learning of motion features. We propose a trainable neural module, dubbed MotionSqueeze, for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate that the proposed method provides a significant gain on four standard benchmarks for action recognition with only a small amount of additional cost, outperforming the state of the art on Something-Something-V1&V2 datasets.
Figure 1. Overall architecture of the proposed approach. The model first takes video frames as input and converts them into frame-wise appearance features using convolutional layers. The proposed MotionSqueeze (MS) module generates motion features using the frame-wise appearance features, and combines the motion features into the next downstream layer.
Figure 2. Overall process of MotionSqueeze (MS) module. The MS module estimates motion across two frame-wise feature maps of adjacent frames. A correlation tensor is obtained by computing correlations, and then a displacement tensor is estimated using the tensor. Through the transformation process of convolution layers, the final motion feature is obtained.
Figure 3. Video classification performance comparison on Something-Something-V1 in terms of accuracy, computational cost, and model size. The proposed architecture (MSNet) achieves the best trade-off between accuracy and efficiency compared to state-of-the-art methods of TSM [21], TRN [47], ECO [48], I3D [2], NL-I3D [41], and GCN [42].
Table 1. The table summarized the performance on Something-Something-V1&V2.
Table 2. The table summarized the performance on Kinetics and HMDB51.
Figure 5. Visualization on Something-Something-V1: (a) "Pulling two ends of something so that it gets stretched" and (b) "Wiping off something of something". Video RGB frames, displacement maps and confidence maps are shown from the top row in each subfigure.
Figure 6. Visualization on Kinetics: (a) "Pull ups" and (b) "Skateboarding". Video RGB frames, displacement maps and confidence maps are shown from the top row in each subfigure.
We will make our code available online as soon as possible. Check our GitHub repository: [github]