Visual attention plays an important role to understand images and demonstrates its effectiveness in generating natural language descriptions of images. On the other hand, recent studies show that language associated with an image can steer visual attention in the scene during our cognitive process. Inspired by this, we introduce a text-guided attention model for image captioning, which learns to drive visual attention using associated captions. For this model, we propose an exemplar based learning approach that retrieves from training data associated captions with each image, and use them to learn attention on visual features. Our attention model enables to describe a detailed state of scenes by distinguishing small or confusable objects effectively. We validate our model on MS-COCO Captioning benchmark and achieve the state-of-the-art performance in standard metrics.

Overall Architecture

Figure 1. Overall architecture for image captioning with text-guided attention. Given an input image, we first retrieve top k candidate captions (CC) from training data using both visual similarity and caption consensus scores. We randomly select one among them as the guidance caption in training while using all candidates as guidance captions in testing time. The text-guided attention layer (T-ATT) computes an attention weight map where regions relevant to the given guidance caption have higher attention weights. A context vector is obtained by aggregating image feature vectors weighted by the attention map. Finally, the LSTM decoder generates an output caption from the context vector.


The proposed algorithm outperforms other state-of-the-art methods.

1. Results on MS-COCO test split

Figure 2. Performance on MS-COCO test split. Numbers in red and blue denote the best and second-best algorithms, respectively.

2. Results on MS-COCO Image Captioning Challenge

Figure 3. Performance on MS-COCO Image Captioning Challenge. Numbers in red and blue denote the best and second-best algorithms, respectively.

Qualitative Results

Figure 4. Qualitative results of our text-guided attention model. Two images in each example mean input image (left) and attention map (right). Three captions below two images represent the guidance caption (top), the generated captions from our proposed model (middle) and uniform attention model (bottom), respectively. The more appropriate expressions generated by our attention model are marked in bold-faced blue and the corresponding phrases in the two other models are marked in bold-faced black.


Text-guided Attention Model for Image Captioning
Jonghwan Mun, Minsu Cho, Bohyung Han
In AAAI, 2017
[arXiv preprint] [Bibtex]


Check out GitHub repository: GitHub Repository


This work is funded by the Samsung Electronics Co., (DMC R&D center).


[Vinyals et al. 2015] Vinyals, O.; Toshev, A.; Bengio, S.; and Erhan, D. 2015. Show and tell: A neural image caption generator. In CVPR.
[Mao et al. 2015] Mao, J.; Xu, w.; Yang, Y.; Wang, J.; Huang, Z.; and Yuille, A. 2015. Deep captioning with multimodal recurrent neural networks (m-rnn). In ICLR.
[Donahue et al. 2015] Donahue, J.; Anne Hendricks, L.; Guadarrama, S.; Rohrbach, M.; Venugopalan, S.; Saenko, K.; and Darrell, T. 2015. Long-term recurrent convolutional networks for visual recognition and description. In CVPR.
[Fang et al. 2015] Fang, H.; Gupta, S.; Iandola, F.; Srivastava, R. K.; Deng, L.; Doll ́ar, P.; Gao, J.; He, X.; Mitchell, M.; Platt, J. C.; et al. 2015. From captions to visual concepts and back. In CVPR.
[Xu et al. 2015] Xu, K.; Ba, J.; Kiros, R.; Cho, K.; Courville, A.; Salakhutdi- nov, R.; Zemel, R.; and Bengio, Y. 2015. Show, attend and tell: Neural image caption generation with visual attention. In ICML.
[You et al. 2015] You, q.; Jin, H.; Wang, Z.; Fang, C.; and Luo, J. 2016. Image captioning with semantic attention. In CVPR.
[Wu et al. 2015] Wu, q.; Shen, C.; Liu, L.; Dick, A.; and van den Hengel, A. 2016. What value do explicit high level concepts have in vision to language problems? In CVPR.