supplementary examples

1. Retrieved sentences before and after fine-tuning GRU

Table A.1 presents additional examples to Table 6 in the main paper, which illustrates how the parameter prediction network understands questions. The retrieved sentences are determined by common words before fine-tuning while they focus on the task to be solved after fine-tuning. More examples of retrieved sentences can be found at the link below.

Query question: What is written on the teddy bear's feet?

Before fine-tuning	After fine-tuning
What is the orange item below the person's feet?	What is written on the surfboard?
What is near the baby elephant's feet?	What is written on the surfboard?
What color are the shoes on the person's feet?	What is written on the skateboard?
What are the long sticks under the person's feet?	What is written on the bottom of the snowboard?
What is on the person's feet?	What is written on the airplane's tail?
What is underneath the girl's feet?	What is written on the hydrant?
What are on the person's feet?	What is written on the signpost?
What is below the zebra's feet?	What is written on the pipes?
What are on the child's feet?	What is written on the leftmost pant leg?
What color are the bear's feet?	What is written on the ramp?
What color are the horse's feet?	What is written on the kayak?
What shape is on the bottom of the bear's feet?	What is written on the black bag ?
Where are the brown faced dog's feet?	What is written on the wall next to the skier?
What is attached to the man's feet?	What is written on the chalkboard?
What is laying by the man's feet?	What is written on the ramp rail?
What is on everyone's feet?	What is written at the top of the racket body?
What is on the bottom of the man's feet?	What is written on the wall of the pitch?
What is on the woman's feet?	What is written on the donuts?

Table A.1. Retrieved sentences before and after fine-tuning GRU. Retrieved sentences are listed in descending order based on cosine similarity with the query question. (column 1) retrieved sentences before fine-tuning, (column 2) retrieved sentences after fine-tuning.

More examples

2. Results on multiple questions for a single image

Figure A.1 presents additional examples to Figure 4(a) in the main paper, which illustrates the network's ability to perform various recognition tasks depending on questions. The network often fails in the questions involving the difficult tasks to learn only with image level annotations (e.g. object detection). More results can be found at the link below.


Q: What is the color of the horse ? DPPnet: brown Q: Is the man on the horse skinny? DPPnet: no Q: Does this man like horseback riding? DPPnet: yes	Q: Is the beach crowded? DPPnet: yes Q: What is on the ground? DPPnet: sand Q: How many people catching the ball? DPPnet: 5 (2)	Q: What is this place? DPPnet: train station Q: Is this a black and white photo? DPPnet: yes Q: How many trains? DPPnet: 1 (5)	Q: Is this person wearing a tie? DPPnet: yes Q: Is he wearing glasses? DPPnet: yes Q: Is the man dancing? DPPnet: no

Figure A.1. Results of the proposed algorithm on multiple questions for a single image. Each question requires a different type and/or level of understanding of the corresponding input image to find correct answer. Answers in blue are correct while answers in red are incorrect.

More examples

3. Results on a single common question for multiple images

Figure A.2 presents additional examples to Figure 4(b) in the main paper, which illustrates that the network performs various recognition tasks determined by the question fairly on various images. More results can be found at the link below.

Q: What game is he playing?

DPPnet: baseball	DPPnet: tennis	DPPnet: wii	DPPnet: frisbee
Q: Where is the horse ?

DPPnet: sidewalk	DPPnet: street	DPPnet: in field	DPPnet: behind fence

Figure A.2. Results of the proposed algorithm on a single common question for multiple images. Depending on given questions above images, the network perform fairly on the relevant recognition task. Answers in blue are correct while answers in red are incorrect.

More examples

Image Question Answering using Convolutional Neural Network
with Dynamic Parameter Prediction