Training Recurrent Answering Units with Joint Loss Minimization for VQA

Overview

We propose a novel algorithm for visual question answering based on a recurrent deep neural network, where every module in the network corresponds to a complete answering unit with attention mechanism by itself. The network is optimized by minimizing loss aggregated from all the units, which share model parameters while receiving different information to compute attention probability. For training, our model attends to a region within image feature map, updates its memory based on the question and attended image feature, and answers the question based on its memory state. This procedure is performed to compute loss in each step. The motivation of this approach is our observation that multi-step inferences are often required to answer questions while each problem may have a unique desirable number of steps, which is difficult to identify in practice. Hence, we always make the first unit in the network solve problems, but allow it to learn the knowledge from the rest of units by backpropagation unless it degrades the model. To implement this idea, we early-stop training each unit as soon as it starts to overfit. Note that, since more complex models tend to overfit on easier questions quickly, the last answering unit in the unfolded recurrent neural network is typically killed first while the first one remains last. We make a single-step prediction for a new question using the shared model. This strategy works better than the other options within our framework since the selected model is trained effectively from all units without overfitting. The proposed algorithm achieves the state-of-the-art performance on the standard benchmark dataset without data augmentation.

Recurrent Answering Units

Figure 1. Overall architecture of the proposed network. The proposed network is a recurrent deep neural network, where each recurrent unit corresponds to a complete module for answering a question about an image. For training, we unfold the network to predict answer and give supervision for every steps. For testing, we use a single answering unit to answer a question about an image.

Effect of Joint Loss Minimization

The most interesting observation from our paper is that training recurrent answering units with multi-step joint loss minimization and earlystopping improves VQA accuracy of a single-step answering unit. This effect cannot be expected from the previous approaches based on multi-step attention based methods [1,2,3], because they use independent weight parameters for each attention steps. Interestingly, our single-step answering unit trained with proposed training method outperforms other approaches based on multi-step attention and prediction.

Figure 2. Training and validation accuracy of Ours_FULL and Ours_SS. Both Ours_FULL and Ours_SS use a single-step for prediction, but trained differently. Ours_FULL is trained with proposed multi-step joint loss minimization and early-stopping, and Ours_SS is trained with single-step prediction loss minimization.

Qualitative Results

Image	Ours_FULL	Ours_SS	Question / Answer	Image	Ours_FULL	Ours_SS	Question / Answer
			Q: What color is the man's hat? Ours_FULL: orange Ours_SS: blue Correct Answer: orange				Q: Is it a rainy dat? Ours_FULL: no Ours_SS: yes Correct Answer: no
			Q: What color is her hair? Ours_FULL: black Ours_SS: blonde Correct Answer: black				Q: What is the pattern of the woman's dress? Ours_FULL: stripes Ours_SS: plaid Correct Answer: stripes
			Q: Can you see the man's hands? Ours_FULL: no Ours_SS: yes Correct Answer: no				Q: Which sign is the man showing? Ours_FULL: peace Ours_SS: stop Correct Answer: peace
			Q: What is the girl walking with? Ours_FULL: dog Ours_SS: suitcase Correct Answer: dog				Q: What kind of bedspread is that? Ours_FULL: plaid Ours_SS: white Correct Answer: plaid

Figure 2. Qualitative comparisons of attention between Ours_Full and Ours_SS.

Performance

The proposed algorithm outperforms other multi-step attention based models on VQA datasets[4].

Table 1: Comparison with other multi-step attention based models on VQA test-dev [4].

	Open-Ended				Multiple-Choice
	All	Y/N	Num	Others	All	Y/N	Num	Others
SAN (VGG)[1]	58.7	79.3	36.6	46.1	-	-	-	-
DMN (VGG) [2]	60.3	80.5	36.8	48.3	-	-	-	-
HieCoAtt (VGG) [3]	60.5	79.6	38.4	49.1	64.9	79.7	40.1	57.9
Ours_FULL (VGG)	61.3	81.5	37.0	49.6	66.1	81.5	39.5	58.9
HieCoAtt (ResNet) [3]	61.8	79.7	38.7	51.7	65.8	79.7	40.0	59.8
Ours_ResNet (ResNet)	63.3	81.9	39.0	53.0	67.7	81.9	41.1	61.5

References

Z. Yang, X. He, J. Gao, Li Deng, and A. Smola. Stacked attention networks for image question answering. In CVPR, 2016. [link]
C. Xiong, S. Merity, and R. Socher. Dynamic memory networks for visual and textual question answering. In ICML, 2016. [link]
J. Lu, J Yang, D Batra, D Parikh. Hierarchical Question-Image Co-Attention for Visual Question Answering. In NIPS, 2016. [link]
S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. L.Zitnick, and D. Parikh. VQA: visual question answering. In ICCV, 2015. [link]

Training Recurrent Answering Units
with Joint Loss Minimization for VQA

Overview

Recurrent Answering Units

Effect of Joint Loss Minimization

Qualitative Results

Performance

Paper

Code

References