Abstract

This paper studies semi-supervised learning of semantic segmentation, which assumes that only a small portion of training images are labeled and the others remain unlabeled. The unlabeled images are usually assigned pseudo labels to be used in training, which however often causes the risk of performance degradation due to the confirmation bias towards errors on the pseudo labels. We present a novel method that resolves this chronic issue of pseudo labeling. At the heart of our method lies error localization network (ELN), an auxiliary module that takes an image and its segmentation prediction as input and identifies pixels whose pseudo labels are likely to be wrong. ELN enables semi-supervised learning to be robust against inaccurate pseudo labels by disregarding label noises during training and can be naturally integrated with self-training and contrastive learning. Moreover, we introduce a new learning strategy for ELN that simulates plausible and diverse segmentation errors during training of ELN to enhance its generalization. Our method is evaluated on PASCAL VOC 2012 and Cityscapes, where it outperforms all existing methods in every evaluation setting.

Overall Architecture

Figure 1. Our semi-supervised learning framework incorporating ELN. It employs two segmentation networks, the student (s), which will be our final model, and the teacher (t) used for generating pseudo labels. The student is trained using the pseudo labels of the teacher in two different ways, self-training and contrastive learning. To be specific, the decoder has two heads, one for segmentation (Seg) and the other for feature embedding (Proj); self-training and contrastive learning are applied to outputs of the Seg and Proj heads, respectively. Then the teacher is updated by an exponential moving average (EMA) of the student. ELN allows both self-training and contrastive learning to be robust against noises on pseudo labels by identifying and disregarding pixels whose pseudo labels are likely to be noisy.

Figure 2. Training ELN along with the main segmentation network and the auxiliary decoders. (left) The main segmentation network is trained with the ordinary cross-entropy loss Lsup, but the auxiliary decoders are trained with constrained cross-entropy losses Laux so that they are inferior to the main segmentation network, and their predictions contain plausible and diverse errors intentionally. (right) All predictions from the decoders are used as training input to ELN, which learns to localize errors on the predictions. Note that ELN and other components are trained simultaneously, although their training processes are drawn separately in this figure for brevity.

Performance

1. Results on PASCAL VOC 2012

Table 1. mIoU value in the PASCAL VOC 2012 val set with different labeled-unlabeled ratios. All results of our experiments areaveraged from three different subsets of the same ratio.

2. Results on Cityscapes

Table 2. mIoU value in the Cityscapes val set with different labeled-unlabeled ratios. All results of our experiments are averaged from three different subsets of the same ratio.

Qualitative results

1. Results on PASCAL VOC 2012

Figure 3. Qualitative results on a val set of PASCAL VOC 2012 in various proportions of labeled data to unlabeled data.

Figure 4. Qualitative results on unlabeled data of training set on PASCAL VOC 2012 in the labeled ratio of 1/20. (a) Segmentation prediction from the main segmentation network. (b) Ground truth binary mask. (c) Binary mask predicted by ELN. (d) Filtered segmentation prediction by the predicted binary mask. Erroneous predictions colored in white in (d) are not used as pseudo labels.

2. Results on Cityscapes

Figure 5. Qualitative results on a val set of Cityscapes in various proportions of labeled data to unlabeled data.

Figure 6. Qualitative results on unlabeled data of training set on Cityscapes in the labeled ratio of 1/8. (a) Segmentation prediction from the main segmentation network. (b) Ground truth binary mask. (c) Binary mask predicted by ELN. (d) Filtered segmentation prediction by the predicted binary mask. Erroneous predictions colored in white in (d) are not used as pseudo labels.

Paper

Semi-supervised Semantic Segmentation with Error Localization Network
Donghyeon Kwon, Suha Kwak
CVPR, 2022
[Paper] [Bibtex]

Code

Check our GitHub repository: GitHub Repository