Overview

We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on a question. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent units (GRU) taking a question as its input and a fully-connected layer generating a set of candidate weights as its output. Since the dynamic parameter layer is a fully connected layer, it is challenging to predict a large number of parameters in the layer to construct the CNN for ImageQA. We reduce the complexity of this problem by incorporating a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer. The proposed network---joint network with the CNN for ImageQA and the parameter prediction network---is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU. The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks.

DPPnet (Dynamic Parameter Prediction Network)

Figure 1. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classification network and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from the candidate weights obtained from the parameter prediction network.

ImageQA using DPPnet

  • Q: How does the woman feel?
    DPPnet: happy
  • Q: What type of hat is she wearing?
    DPPnet: cowboy
  • Q: Is it raining?
    DPPnet: no
  • Q: What is he holding?
    DPPnet: umbrella
  • Q: What is he doing?
    DPPnet: skateboarding
  • Q: Is this person dancing?
    DPPnet: no
  • Q: How many cranes are in the image?
    DPPnet: 2(3)
  • Q: How many people are on the bench?
    DPPnet: 2(1)
Q: What is the boy holding? Q: What animal is shown?
DPPnet: surfboard DPPnet: bat DPPnet: giraffe DPPnet: elephant
Q: What is this room? Q: What is the animal doing?
DPPnet: living room DPPnet: kitchen DPPnet: resting (relaxing) DPPnet: swimming (fishing)

Figure 2. Sample images and questions in VQA dataset [1]. Each question requires different type and/or level of understanding of the corresponding input image to find correct answer. Answers in blue are correct while answers in red are incorrect.

Performance

The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks (VQA [1], COCO-QA [2], DAQUAR [3]). Table 1 desmonstrate the quantitative results on VQA. Results on other datasets (COCO-QA, DAQUAR) can be found in our paper.

Table 1: Evaluation results on VQA test-dev in terms of official evaluation metric of VQA [1].

Open-Ended Multiple-Choice
All Y/N Num Others All Y/N Num Others
Question [1] 48.09 75.66 36.70 27.14 53.68 75.71 37.05 38.64
Image [1] 28.13 64.01 00.42 03.77 30.53 69.87 00.45 03.76
Q+I [1] 52.64 75.55 33.67 37.37 58.97 75.59 34.35 50.33
LSTM Q [1] 48.76 78.20 35.68 26.59 54.75 78.22 36.82 38.78
LSTM Q+I [1] 53.74 78.94 35.24 36.42 57.17 78.95 35.80 43.41
DPPnet 57.22 80.71 37.24 41.69 62.48 80.79 38.94 52.16

Paper

Image Question Answering using Convolutional Neural Network with Dynamic Parameter Prediction
Hyeonwoo Noh, Paul Hongsuck Seo, Bohyung Han

Please refer to our arxiv pre-print paper for more details.

Code

Check out GitHub repository: DPPnet GitHub Repository

Supplementary Examples

We provide more comprehensive results of our algorithm to help understanding. Supplementary examples include more results on sentence retrival experiment described in the paper and more qualitative results on various ImageQA pairs. Supplementary examples can be found at the following link.

References