ReSTR: Convolution-free Referring Image Segmentation Using Transformers

Abstract

Referring image segmentation is an advanced semantic segmentation task where target is not a predefined class but is described in natural language. Most of existing methods for this task rely heavily on convolutional neural networks, which however have trouble capturing long-range dependencies between entities in the language expression and are not flexible enough for modeling interactions between the two different modalities. To address these issues, we present the first convolution-free model for referring image segmentation using transformers, dubbed ReSTR. Since it extracts features of both modalities through transformer encoders, it can capture long-range dependencies between entities within each modality. Also, ReSTR fuses features of the two modalities by a self-attention encoder, which enables flexible and adaptive interactions between the two modalities in the fusion process. The fused features are fed to a segmentation module, which works adaptively according to the image and language expression in hand. ReSTR is evaluated and compared with previous work on all public benchmarks, where it outperforms all existing models.

Overall Architecture of ReSTR

Figure 1. (Left) Overall architecture of ReSTR. (a) The feature extractors for the two modalities are composed on transformer encoders, independently. (b) The multimodal fusion encoder consists of the two transformer encoders: the visual-linguistic encoder and the linguistic-seed encoder. (c) The coarse-to-fine segmentation decoder transforms a patch-level prediction to a pixel-level prediction. (Right) Transformer encoder used in all encoders and the composition of the coarse-to-fine segmentation decoder.

Experimental results

1. Performance comparison with other methods

Table 1. Quantitative results on four datasets in IoU (%). DCRF denotes using post-procession by DenseCRF [21]. The best results are in bold, while second-best ones are underlined.

2. Relationship between language expression length and performance

Table 2. Performance according to variants of referring length on Gref, UNC, UNC+ and ReferIt in IoU (%). The best results are in bold, while second-best ones are underlined.

3. Ablation studies on Gref val set

Table 3. Performance for ablation study of ReSTR on Gref val set. # layers denotes the number of transformer layers in the multimodal fusion encoder. w share denotes weight sharing of the multimodal fusion encoder.

4. In-depth Analysis of ReSTR

Figure 2. The variants of the multimodal fusion encoder based on transformer architecture. (a) Self-attention fusion encoder on all sequences in parallel. (b) Independent fusion encoder between the visual features and the class seed embedding. (c) Indirect conjugating fusion encoder between the visual features and the class seed embedding.

Table 4. (a) Averaged attention score (%) of the visual and linguistic features to the class seed embedding at each transformer layer of VME on Gref train set. (b) Performance of the variants of the multimodal fusion encoder on Gref val set in IoU (%). (c) Comparison of computations and performance with recent methods. MACs is computed with an input image of 320 × 320. † denotes the fusion encoder with weight sharing. The best results are in bold, while second-best ones are underlined.

5. Qualitative results

Figure 3. (a) Qualitative results of ReSTR on Gref val set. (b) Visualization examples of ReSTR according to different language expression queries for an image on Gref val set.

Acknowledgements

We thank Manjin Kim and Sehyun Hwang for fruitful discussions. This work was supported by MSRA Collaborative Research Program, and the NRF grant and the IITP grant funded by Ministry of Science and ICT, Korea (NRF2021R1A2C3012728, IITP-2020-0-00842, No.2019-0-01906 Artificial Intelligence Graduate School Program-POSTECH)

ReSTR: Convolution-free Referring Image Segmentation
Using Transformers