We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed automatically from gameplay videos. Our work is motivated by the fact that existing models are often tested only on datasets that require excessively high-level reasoning or mostly contain instances accessible through single frame inferences. Hence, it is difficult to measure capacity and flexibility of trained models, and existing techniques often rely on ad-hoc implementations of deep neural networks without clear insight into datasets and models. We are particularly interested in understanding temporal relationships between video events to solve VideoQA problems; this is because reasoning temporal dependency is one of the most distinct components in videos from images. To address this objective, we automatically generate a customized synthetic VideoQA dataset using Super Mario Bros. gameplay videos so that it contains events with different levels of reasoning complexity. Using the dataset, we show that properly constructed datasets with events in various complexity levels are critical to learn effective models and improve overall performance.
Figure 1. Ovarall QA generation procedure in our MarioQA dataset. Given a gameplay video and event logs shown on the left, (a) target event is selected (marked as a green box), (b) question semantic chunk is generated from the target event, (c) question template is sampled from template pool, and (d) QA pairs are generated by filling the template and the linguistically realizing answer.
NT question | ET question | HT question |
---|---|---|
Q: How did Mario attack a red koopa troopa? A: by stomping |
Q: How did a green koopa troopa die after a red koopa troopa was killed by a shell? A: by a fireball |
Q: How was an enemy killed before Mario killed a red koopa troopa by stomping? A: by a shell |
Figure 2. As we have interest in understanding of temporal relationship between multiple events, which is an important aspect in VideoQA; thus we consider the situation that the temporal region, which a target event (marked as red box) would occurs in, is constrained by a reference event (marked as blue box). Thus, we generate our MarioQA dataset with three subsets, which contain questions with different characteristics in temporal relationships: questions with no temporal relationship (NT), with easy temporal relationship (ET) and with hard temporal relationship (HT). NT asks questions about unique events in the entire video without any temporal relationship phrase. ET and HT have questions with temporal relationships in different levels of difficulty. While ET contains question about globally unique events, HT involves distracting events (marked as green box) making a VQA system choose a right answer out of multiple identical events using temporal reasoning; for a target event kill(PGoomba, stomping), any kill(*,*) events in the same video clip are considered as distracting events.
The MarioQA dataset consists of three types of question: event-centric, counting and state question. The questions are also composed with three different levels of temporal relationship. The example of each question is as follows.
Event-Centric (NT) | Counting (NT) | State (NT) |
---|---|---|
Q: How did a flower die? A: by a shell |
Q: How many coin blocks did Mario hit? A: 5 |
Q: What is the current stage? A: cave |
Event-Centric (ET) | Counting (ET) | State (ET) |
Q: Where did Mario throw a shell after jumping? A: pipe |
Q: How many coin blocks were hit by Mario before a spiky appeared? A: 2 |
Q: What is the state of Mario after a mushroon was eaten by Mario? A: super form |
Event-Centric (HT) | Counting (HT) | State (HT) |
Q: Which item appeared after Mario held a shell? A: fireflower |
Q: How often did Mario jump after a goomba was killed by Mario's stomp? A: 2 |
Q: What is Mario's state before Mario ate a mushroom? A: tiny form |
Table 1. Accuracies (%) for the models on test splits; refer to the paper for details of models. Models are trained on different combinations of subsets: NT (case 1), NT+ET (case 2) and NT+ET+HT (case 3) to see the impact of each subset on accuracies. Each trained model is then tested on test split of each subset.
Check out GitHub repository to obtain QA generation code and to train/test models: GitHub Repository
To obtain a copy of the MarioQA dataset, please download agreement and read it carefully. Then, send the fully completed, signed and scanned agreement to Jonghwan Mun (jonghwan.mun [at] postech.ac.kr). Note that we recommend to write all students or researchers that would use our dataset on your team. We will verify your request manually and send the download script. Note that if you send the mail with the title [MarioQA Dataset Request], we will be able to reply quickly.