Figure 1. Visualization of the constructed graph for our absorbing Markov chain.


We propose a simple but effective unsupervised learning algorithm to detect a common activity (co-activity) from a set of videos, which is formulated using absorbing Markov chain in a principled way. In our algorithm, a complete multipartite graph is first constructed, where vertices correspond to subsequences extracted from videos using a temporal sliding window and edges connect between the vertices originated from different videos; the weight of an edge is proportional to the similarity between the features of two end vertices. Then, we extend the graph structure by adding edges between temporally overlapped subsequences in a video to handle variable-length co-activities using temporal locality, and create an absorbing vertex connected from all other nodes. The proposed algorithm identifies a subset of subsequences as co-activity by estimating absorption time in the constructed graph efficiently. The great advantage of our algorithm lies in the properties that it can handle more than two videos naturally and identify multiple instances of a co-activity with variable lengths in a video. Our algorithm is evaluated intensively in a challenging dataset and illustrates outstanding performance quantitatively and qualitatively. data through ensemble with the fully convolutional network.


YouTube co-activity dataset
Accuracy of individual co-activity classes in YouTube co-activity dataset.

Additional datasets
The proposed algorithm is evaluated in three more datasets: THUMOS14, Hollywood and UCF101 co-activity dataset. The first dataset is THUMOS14 (Jiang et al. 2014), which is a large scale dataset for temporal action detection and contains more than 200 temporally untrimmed videos for 20 activity classes. This is a very challenging dataset since the best known supervised action detection algorithm achieves only 14.7% in mean average precision. The second dataset is Hollywood (Laptev et al. 2008) which is orignially designed for activity recognition and contains more than 400 temporally untrimmed videos for 8 classes of activity videos. We generate another dataset, UCF101 co-activity dataset, by concatenating temporally trimmed videos from UCF101 (Soomro, Zamir, and Shah 2012). This dataset consists of all 101 classes, each class contains 25 videos, and each video is constructed from 4 trimmed videos and has 2 co-activity instances. We exclude the results in these dataset in the main paper since the annotations of Hollywood dataset are noisy and the concatenated videos in UCF 101 co-activity videos are artificial.

Figure 2. Quantitative co-activity detection results when all videos in each class are tested.

Figure 3. Quantitative co-activity detection results when every pair of videos in each class is tested.

Code and Dataset

Code and YouTube co-activity dataset is now available.

Paper Link


This work is funded by the Samsung Electronics Co., (DMC R&D center).


[Jiang et al. 2014] Jiang, Y.-G.; Liu, J.; Roshan Zamir, A.; Toderici, G.; Laptev, I.; Shah, M.; and Sukthankar, R. 2014. THUMOS challenge: Action recognition with a large number of classes. http: //
[Laptev et al. 2008] Laptev, I.; Marszalek, M.; Schmid, C.; and Rozenfeld, B. 2008. Learning realistic human actions from movies. In CVPR, 1–8.
[Soomro, Zamir, and Shah 2012] Soomro, K.; Zamir, A. R.; and Shah, M. 2012. Ucf101: A dataset of 101 human actions classes from videos in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence abs/1212.0402.