Motion plays a crucial role in understanding videos and thus most state-of-the-art neural models for video classification incorporate motion information typically by extracting optical flows frame-by-frame using a separate off-the-shelf method. As the optical flows require heavy computation, incorporating motion information has remained a main computational bottleneck for video understanding. In this work, we attempt to replace the external and heavy computation of optical flows with an internal and light-weight learning of motion features. The proposed method, dubbed the MotionCapture (MoCap) module, is an end-to-end trainable block for effective motion feature extraction. Inserted in the middle of any neural network, it learns to establish correspondences across frames and convert them into motion features, which are readily fed to the next downstream layer for better prediction. We demonstrate the effectiveness of our method on three standard benchmarks for video action recognition, where inserting the proposed module achieves the state-of-the-art with only a small amount of additional cost.

Overall architecture

Figure 1. Overall architecture of the proposed approach. The model first takes multiple video frames as input and convert them into frame-wise appearance features using convolutional layers. The proposed MotionCapture (MoCap) module generates motion features using the frame-wise appearance features, and combines the motion features into the next downstream layer.

MotionCapture (MoCap) module

Figure 2. The MoCap module estimates motion across two frame-wise feature maps of adjacent frames. A correlation tensor is obtained by computing correlations, and then a displacement tensor is estimated using the tensor. Through the transformation process of three convolution layers, the final motion feature is obtained.

Quantitative results

1. Results on Something-Something V1.

Figure 3. Each section of the table contains the results of 2D CNN methods [33,39], 3D CNN methods [35,36,41], TSM ResNet [21], and the proposed method, respectively. Compared to the baseline, our method obtains significant gain at top-1 and top-5 accuracy at the cost of only 6.2\% and 1.2\% increase in FLOPs and parameters, respectively. We show the best trade-off of our method in terms of accuracy, FLOPs, and number of parameters in the left Figure.

2. Results on other datasets.

Figure 4. Each table summarized the result of Kinetics-400 and HMDB-51, respectively.

Qualitative results

1. Qualitative results on Something-Something V1.

Figure 5. Visualization on Something-Something-V1: (a) "Moving something and something closer to each other" and (b) "Wiping off something of something". Video frames, reconstructed frames, displacement maps and confidence maps are shown from the top row in each subfigure.

2. Qualitative results on Kinetics-400.

Figure 6. Visualization on Kinetics-400: (a) "Pull ups" and (b) "Skateboarding". Video frames, reconstructed frames, displacement maps and confidence maps are shown from the top row in each subfigure.


MotionCapture: Neural Motion Feature Learning for Video Understanding
Heeseung Kwon, Manjin Kim, Suha Kwak, and Minsu Cho
arXiv, 2019


We will make our code available online as soon as possible.