Detector-Free Weakly Supervised Group Activity Recognition

Abstract

Group activity recognition is the task of understanding the activity conducted by a group of people as a whole in a multi-person video. Existing models for this task are often impractical in that they demand ground-truth bounding box labels of actors even in testing or rely on off-the-shelf object detectors. Motivated by this, we propose a novel model for group activity recognition that depends neither on bounding box labels nor on object detector. Our model based on Transformer localizes and encodes partial contexts of a group activity by leveraging the attention mechanism, and represents a video clip as a set of partial context embeddings. The embedding vectors are then aggregated to form a single group representation that reflects the entire context of an activity while capturing temporal evolution of each partial context. Our method achieves outstanding performance on two benchmarks, Volleyball and NBA datasets, surpassing not only the state of the art trained with the same level of supervision, but also some of existing models relying on stronger supervision.

Proposed Method

Figure 1. Overall architecture of our model. A CNN incorporating motion feature computation modules extracts a motion-augmented feature map per frame. At each frame, a set of learnable tokens (unpainted pieces of Jigsaw puzzles) is embedded to localize clues useful for group activity recognition through the attention mechanism of the transformer encoder. The token embeddings (painted pieces of Jigsaw puzzles) are then fused to form a group representation in two steps: First aggregate embeddings of the same token (pieces with the same shape) across time, and then aggregate the results of different tokens (pieces with different shapes and colors). Finally, the group representation is fed into the classifier which predicts group activity class scores.

Experimental Results

1. Results on NBA and Volleyball

Table 1. Comparison with the state-of-the-art methods on NBA and Volleyball. Numbers in bold indicate the best performance and underlined ones are the second best.

2. Ablation studies on NBA

Table 2. (a) Ablation on the proposed modules. (b) Ablation on the number of tokens per frame. (c) Ablation on the token aggregation methods. (d) Ablation on the position of the motion featue module.

3. Qualitative results

Figure 2. Visualization of the Transformer encoder attention maps on the NBA dataset.

4. t-SNE visualization

Figure 3. t-SNE visualization of feature embedding learned by different model variant on the NBA dataset.

Acknowledgement

This work was supported by the NRF grant and the IITP grant funded by Ministry of Science and ICT, Korea (NRF-2021R1A2C3012728, NRF-2018R1A5A1060031, IITP-2020-0-00842, IITP-2021-0-00537, No.2019-0-01906 Artificial Intelligence Graduate School Program-POSTECH).

Dongkeun Kim¹	Jinsung Lee²	Minsu Cho^{1, 2}	Suha Kwak^{1, 2}
¹Department of CSE, POSTECH		²Graduate School of AI, POSTECH