Visual dialog is a task of answering a series of inter-dependent questions given an input image, and often requires to resolve visual references among the questions. This problem is different from visual question answering (VQA), which relies on spatial attention (a.k.a. visual grounding) estimated from an image and question pair. We propose a novel attention mechanism that exploits visual attentions in the past to resolve the current reference in the visual dialog scenario. The proposed model is equipped with an associative attention memory storing a sequence of previous (attention, key) pairs. From this memory, the model retrieves the previous attention, taking into account recency, which is most relevant for the current question, in order to resolve potentially ambiguous references. The model then merges the retrieved attention with a tentative one to obtain the final attention for the current question; specifically, we use dynamic parameter prediction to combine the two attentions conditioned on the question. Through extensive experiments on a new synthetic visual dialog dataset, we show that our model significantly outperforms the state-of-the-art (by ~16 % points) in situations, where visual reference resolution plays an important role. Moreover, the proposed model achieves superior performance (~ 2 % points improvement) in the Visual Dialog dataset, despite having significantly fewer parameters than the baselines.

Overall Architecture: Attention based Encoder-Decoder

Figure 1. Overall architecture of the proposed network. An input (question, history, image) is encoded into a vector which is decoded to an answer by the answer decoder. The gray box represents the proposed attention process.

Attention with Attention Memory

Figure 2. Attention process for visual dialog task. (a) The tentative and relevant attentions are first obtained independently and then dynamically combined depending on the question embedding. (b) Two boxes represent memory containing attentions and corresponding keys. Question embedding is projected into the key space and campared with keys using inner products, denoted by crossed circles, to generate address vector. The address vector is then used as weights for computing a weighted average of all memory entries to retrieve memory entry.

MNIST Dialog Dataset

Figure 3. Example from MNIST Dialog. Each pair consists of an image (left) and a set of sequential questions with answers (right). You can download the dataset generation code and the generated images with metadata.


Figure 4. Results on MNIST Dialog. Answer prediction accuracy [%] of all models for all questions (left) and accuracy curves of four models at different dialog steps (right). +H and +SEQ represent the use of history embeddings in models and addressing with sequential preference, respectively.

Figure 5. Characteristics of dynamically predicted weights for attention combination. Dynamic weights are computed from 1,500 random samples at dialog step 3 and plotted by t-SNE. Each figure presents clusters formed by different semantics of questions. (left) Clusters generated by different question types. (middle) Subclusters formed by types of spatial relationships in attribute questions. (right) Subclusters formed by ways of specifying targets in counting questions; cluster sub_targets contains questions whose target digits are included in the targets of the previous question.

Figure 6. Qualitative analysis on MNIST Dialog. Given an input image and a series of questions with their visual grounding history, we present the memory retrieved and final attentions for the current question in the second and third columns, respectively. The proposed network correctly attends to target reference and predicts correct answer. The last two columns present the manually modified attention and the final attention obtained from the modified attention, respectively. Experiment shows consistency of transformation between attentions and semantic interpretability of our model.

Figure 7. Experimental results on VisDial. We show the number of parameters, mean reciprocal rank (MRR), recall@k and mean rank (MR). +H and ATT indicate use of history embeddings in prediction and attention mechanism, respectively.


Visual Reference Resolution using Attention Memory
Paul Hongsuck Seo, Andreas Lehrmann, Bohyung Han, Leonid Sigal

Please refer to our paper for more details.