Hyeonwoo Noh | Paul Hongsuck Seo | Bohyung Han |
POSTECH |
We tackle image question answering (ImageQA) problem by learning a convolutional neural network (CNN) with a dynamic parameter layer whose weights are determined adaptively based on a question. For the adaptive parameter prediction, we employ a separate parameter prediction network, which consists of gated recurrent units (GRU) taking a question as its input and a fully-connected layer generating a set of candidate weights as its output. Since the dynamic parameter layer is a fully connected layer, it is challenging to predict a large number of parameters in the layer to construct the CNN for ImageQA. We reduce the complexity of this problem by incorporating a hashing technique, where the candidate weights given by the parameter prediction network are selected using a predefined hash function to determine individual weights in the dynamic parameter layer. The proposed network---joint network with the CNN for ImageQA and the parameter prediction network---is trained end-to-end through back-propagation, where its weights are initialized using a pre-trained CNN and GRU. The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks.
|
|
|
|
Q: What is the boy holding? | Q: What animal is shown? | ||
DPPnet: surfboard | DPPnet: bat | DPPnet: giraffe | DPPnet: elephant |
Q: What is this room? | Q: What is the animal doing? | ||
DPPnet: living room | DPPnet: kitchen | DPPnet: resting (relaxing) | DPPnet: swimming (fishing) |
The proposed algorithm illustrates the state-of-the-art performance on all available public ImageQA benchmarks (VQA [1], COCO-QA [2], DAQUAR [3]). Table 1 desmonstrate the quantitative results on VQA. Results on other datasets (COCO-QA, DAQUAR) can be found in our paper.
Table 1: Evaluation results on VQA test-dev in terms of official evaluation metric of VQA [1].
Open-Ended | Multiple-Choice | |||||||
All | Y/N | Num | Others | All | Y/N | Num | Others | |
Question [1] | 48.09 | 75.66 | 36.70 | 27.14 | 53.68 | 75.71 | 37.05 | 38.64 |
Image [1] | 28.13 | 64.01 | 00.42 | 03.77 | 30.53 | 69.87 | 00.45 | 03.76 |
Q+I [1] | 52.64 | 75.55 | 33.67 | 37.37 | 58.97 | 75.59 | 34.35 | 50.33 |
LSTM Q [1] | 48.76 | 78.20 | 35.68 | 26.59 | 54.75 | 78.22 | 36.82 | 38.78 |
LSTM Q+I [1] | 53.74 | 78.94 | 35.24 | 36.42 | 57.17 | 78.95 | 35.80 | 43.41 |
DPPnet | 57.22 | 80.71 | 37.24 | 41.69 | 62.48 | 80.79 | 38.94 | 52.16 |
Check out GitHub repository: DPPnet GitHub Repository
We provide more comprehensive results of our algorithm to help understanding. Supplementary examples include more results on sentence retrival experiment described in the paper and more qualitative results on various ImageQA pairs. Supplementary examples can be found at the following link.