Abstract

We present a framework to analyze various aspects of models for video question answering (VideoQA) using customizable synthetic datasets, which are constructed automatically from gameplay videos. Our work is motivated by the fact that existing models are often tested only on datasets that require excessively high-level reasoning or mostly contain instances accessible through single frame inferences. Hence, it is difficult to measure capacity and flexibility of trained models, and existing techniques often rely on ad-hoc implementations of deep neural networks without clear insight into datasets and models. We are particularly interested in understanding temporal relationships between video events to solve VideoQA problems; this is because reasoning temporal dependency is one of the most distinct components in videos from images. To address this objective, we automatically generate a customized synthetic VideoQA dataset using Super Mario Bros. gameplay videos so that it contains events with different levels of reasoning complexity. Using the dataset, we show that properly constructed datasets with events in various complexity levels are critical to learn effective models and improve overall performance.

MarioQA Dataset Construction

Figure 1. Ovarall QA generation procedure in our MarioQA dataset. Given a gameplay video and event logs shown on the left, (a) target event is selected (marked as a green box), (b) question semantic chunk is generated from the target event, (c) question template is sampled from template pool, and (d) QA pairs are generated by filling the template and the linguistically realizing answer.

MarioQA Dataset Characteristics

NT question ET question HT question
Q: How did Mario attack a red koopa troopa?
A: by stomping
Q: How did a green koopa troopa die after a red koopa troopa was killed by a shell?
A: by a fireball
Q: How was an enemy killed before Mario killed a red koopa troopa by stomping?
A: by a shell

Figure 2. As we have interest in understanding of temporal relationship between multiple events, which is an important aspect in VideoQA; thus we consider the situation that the temporal region, which a target event (marked as red box) would occurs in, is constrained by a reference event (marked as blue box). Thus, we generate our MarioQA dataset with three subsets, which contain questions with different characteristics in temporal relationships: questions with no temporal relationship (NT), with easy temporal relationship (ET) and with hard temporal relationship (HT). NT asks questions about unique events in the entire video without any temporal relationship phrase. ET and HT have questions with temporal relationships in different levels of difficulty. While ET contains question about globally unique events, HT involves distracting events (marked as green box) making a VQA system choose a right answer out of multiple identical events using temporal reasoning; for a target event kill(PGoomba, stomping), any kill(*,*) events in the same video clip are considered as distracting events.

MarioQA Dataset Examples

The MarioQA dataset consists of three types of question: event-centric, counting and state question. The questions are also composed with three different levels of temporal relationship. The example of each question is as follows.

Event-Centric (NT) Counting (NT) State (NT)
Q: How did a flower die?
A: by a shell
Q: How many coin blocks did Mario hit?
A: 5
Q: What is the current stage?
A: cave
Event-Centric (ET) Counting (ET) State (ET)
Q: Where did Mario throw a shell after jumping?
A: pipe
Q: How many coin blocks were hit by Mario before a spiky appeared?
A: 2
Q: What is the state of Mario after a mushroon was eaten by Mario?
A: super form
Event-Centric (HT) Counting (HT) State (HT)
Q: Which item appeared after Mario held a shell?
A: fireflower
Q: How often did Mario jump after a goomba was killed by Mario's stomp?
A: 2
Q: What is Mario's state before Mario ate a mushroom?
A: tiny form

Quantitative Results

Table 1. Accuracies (%) for the models on test splits; refer to the paper for details of models. Models are trained on different combinations of subsets: NT (case 1), NT+ET (case 2) and NT+ET+HT (case 3) to see the impact of each subset on accuracies. Each trained model is then tested on test split of each subset.

Paper

MarioQA: Answering Questions by Watching Gameplay Videos
Jonghwan Mun*, Paul Hongsuck Seo*, Ilchae Jung, Bohyung Han (* equal contribution)
In ICCV, 2017
[arXiv preprint] [Bibtex]

Code

Check out GitHub repository to obtain QA generation code and to train/test models: GitHub Repository

Dataset

To obtain a copy of the MarioQA dataset, please download agreement and read it carefully. Then, send the fully completed, signed and scanned agreement to Jonghwan Mun (jonghwan.mun [at] postech.ac.kr). Note that we recommend to write all students or researchers that would use our dataset on your team. We will verify your request manually and send the download script. Note that if you send the mail with the title [MarioQA Dataset Request], we will be able to reply quickly.