Hyeonwoo Noh | Paul Hongsuck Seo | Bohyung Han |
POSTECH |
We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on a question. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent units (GRU) taking a question as its input and a fully-connected layer generating a set of candidate weights as its output. Since the dynamic parameter layer is a fully connected layer, it is challenging to predict a large number of parameters in the layer to construct the CNN for ImageQA. We reduce the complexity of this problem by incorporating a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer. The proposed network---joint network with the CNN for ImageQA and the parameter prediction network---is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU. The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks.
Figure 1. Overall architecture of the proposed Dynamic Parameter Prediction network (DPPnet), which is composed of the classification network and the parameter prediction network. The weights in the dynamic parameter layer are mapped by a hashing trick from the candidate weights obtained from the parameter prediction network.
![]() |
![]() |
![]() |
![]() |
|
|
|
|
Q: What is the boy holding? | Q: What animal is shown? | ||
![]() |
![]() |
![]() |
![]() |
DPPnet: surfboard | DPPnet: bat | DPPnet: giraffe | DPPnet: elephant |
Q: What is this room? | Q: What is the animal doing? | ||
![]() |
![]() |
![]() |
![]() |
DPPnet: living room | DPPnet: kitchen | DPPnet: resting (relaxing) | DPPnet: swimming (fishing) |
Figure 2. Sample images and questions in VQA dataset [1]. Each question requires different type and/or level of understanding of the corresponding input image to find correct answer. Answers in blue are correct while answers in red are incorrect.
The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks (VQA [1], COCO-QA [2], DAQUAR [3]). Table 1 desmonstrate the quantitative results on VQA. Results on other datasets (COCO-QA, DAQUAR) can be found in our paper.
Table 1: Evaluation results on VQA test-dev in terms of official evaluation metric of VQA [1].
Open-Ended | Multiple-Choice | |||||||
All | Y/N | Num | Others | All | Y/N | Num | Others | |
Question [1] | 48.09 | 75.66 | 36.70 | 27.14 | 53.68 | 75.71 | 37.05 | 38.64 |
Image [1] | 28.13 | 64.01 | 00.42 | 03.77 | 30.53 | 69.87 | 00.45 | 03.76 |
Q+I [1] | 52.64 | 75.55 | 33.67 | 37.37 | 58.97 | 75.59 | 34.35 | 50.33 |
LSTM Q [1] | 48.76 | 78.20 | 35.68 | 26.59 | 54.75 | 78.22 | 36.82 | 38.78 |
LSTM Q+I [1] | 53.74 | 78.94 | 35.24 | 36.42 | 57.17 | 78.95 | 35.80 | 43.41 |
DPPnet | 57.22 | 80.71 | 37.24 | 41.69 | 62.48 | 80.79 | 38.94 | 52.16 |
Check out GitHub repository: DPPnet GitHub Repository
We provide more comprehensive results of our algorithm to help understanding. Supplementary examples include more results on sentence retrival experiment described in the paper and more qualitative results on various ImageQA pairs. Supplementary examples can be found at the following link.