Future Transformer for Long-term Action Anticipation

Abstract

The task of predicting future actions from a video is crucial for a real-world agent interacting with others. When anticipating actions in the distant future, we humans typically consider long-term relations over the whole sequence of actions, i.e., not only observed actions in the past but also potential actions in the future. In a similar spirit, we propose an end-to-end attention model for action anticipation, dubbed Future Transformer (FUTR), that leverages global attention over all input frames and output tokens to predict a minutes-long sequence of future actions. Unlike the previous autoregressive models, the proposed method learns to predict the whole sequence of future actions in parallel decoding, enabling more accurate and fast inference for long-term anticipation. We evaluate our method on two standard benchmarks for long-term action anticipation, Breakfast and 50 Salads, achieving state-of-the-art results.

Future Transformer (FUTR)

Figure 1. Overall architecture of FUTR. The proposed method is composed of an encoder and a decoder; each classifies action labels of past frames (action segmentation) and anticipates future action labels and corresponding durations (action anticipation), respectively. The encoder learns distinctive feature representation from past actions via self-attention, and the decoder learns long-term relations between past and future actions via self-attention and cross-attention.

Experimental results

1. Performance comparison with the state of the art

Table 1. Comparison with the state of the art. Our models set a new state of the art on Breakfast, and 50 Salads. The numbers in bold-faced and in underline indicates the highest and the second highest accuracy, respectively.

2. Analysis of FUTR on Breakfast

Table 2. Analysis of FUTR on Breakfast. (a) Table 2a shows the effectiveness of parallel decoding of the decoder in FUTR. FUTR-A autoregressively anticipates action labels and durations similar to transformer (Vaswani et al., Neurips17'). FUTR-M is equivalent to FUTR except for applying masked self-attention in the decoder. (b) Table 2b verifies the efficacy of using global self-attention (GSA) compared to local self-attention (LSA) in both the encoder and decoder. (c) Table 2c demonstrates the effectiveness of output structuring of FUTR. In training FUTR, ground truths of label and duration are directly assigned to the outputs of the action queries in sequential order. FUTR-S is a variant of our method that is trained to predict a temporal window of starting and ending points. FUTR-H is a DETR-like variant (Carion et al., ECCV20') that considers the action queries as an unordered set, not a sequence, and predicts a start-end window from each in the query set. (d) Table 2d shows the effectiveness of using action segmentation loss.

3. Cross-attention map visualization

Figure 2. Cross-attention map visualization on Breakfast. We visualize cross-attention layers in the decoder in Figure 2. FUTR effectively leverages long-term dependencies using entire past frames regardless of position and detects key frames of the activity. The horizontal and vertical axis indicates the index of the past frames and the action queries, respectively. The brighter color implies a higher attention score. We highlight a video frame with a yellow box where the attention score of the frame is highly activated.

4. Prediction results

Figure 3. Prediction results on Breakfast. Each subfigure visualizes the ground-truth labels and predicted results of the FUTR and the Cycle Cons. (Farha et al., GCPR20'). We set alpha as 0.3 and beta as 0.5 in this experiment. We decode action labels and durations as frame-wise action classes. Each color in the color bar indicates an action label written above.

Acknowledgements

This research was supported by NCSOFT, the IITP grant funded by MSIT~(No.2019-0-01906, AI Graduate School Program - POSTECH), and the Center for Applied Research in Artificial Intelligence (CARAI) grant funded by DAPA and ADD~(UD190031RD).