Seunghoon Hong1,3 | Donghun Yeo1 | Suha Kwak2 | Honglak Lee3 | Bohyung Han1 |
1Dept. of Computer Science and Engineering, POSTECH, Korea | ||||
2Dept. of Information and Communication Engineering, DGIST, Korea | ||||
3Dept. of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, USA |
We propose a novel algorithm for weakly supervised semantic segmentation based on image-level class labels only. In weakly supervised setting, it is commonly observed that trained model overly focuses on discriminative parts rather than the entire object area. Our goal is to overcome this limitation with no additional human intervention by retrieving videos relevant to target class labels from web repository, and generating segmentation labels from the retrieved videos to simulate strong supervision for semantic segmentation. During this process, we take advantage of image classification with discriminative localization technique to reject false alarms in retrieved videos and identify relevant spatio-temporal volumes within retrieved videos. Although the entire procedure does not require any additional supervision, the segmentation annotations obtained from videos are sufficiently strong to learn a model for semantic segmentation. The proposed algorithm substantially outper- forms existing methods based on the same level of supervision and is even as competitive as the approaches relying on extra annotations.
Our method substantially outperforms existing approaches based on image-level labels, improving the state-of-the-art result by more than 7% mIoU. Performance of our method is even as competitive as the approaches based on extra supervision, which rely on additional human intervention. Especially, our method outperforms some approaches based on relatively stronger supervision (e.g., point supervision [1] and segmentation annotations of other classes [3]). These results show that segmentation annotations obtained from videos are sufficiently strong to simulate segmentation supervision missing in weakly annotated images.
Table 1. Evaluation results on PASCAL VOC 2012 test set.
|
|
[arxiv preprint] |
The code and trained model for the proposed method will be released soon.
Belows present more comprehensive results of our algorithm described in the paper.
The following figures provide additional qualitative results of our method on the PASCAL VOC validation set. Compared to previous approaches using image-level labels only [9] or weakly annotated videos [8], segmentation results of our approach capture more accurate object boundary and extent of object area. More qualitative results can be found at the link below.
Input image | Ground-truth | SEC [9] | MCNN [8] | Ours |
The following videos illustrate examples of YouTube videos sanitized by our method, and segmentation results on the videos by the procedure described in Section 3.3 of the main paper. We sampled few videos for each category for clear demonstration. The segmentation results from videos are sometimes inaccurate, and contain noises caused by inaccurate attention map, background clutter, static motion, etc. Despite of these challenges, our model effectively learns a model for segmentation since the noises in the segmentation annotations usually have no clear patterns while foreground area coresponding to the target object are consistently captured by the segmentation results.
aeroplane | bicycle |
bird | boat |
bottle | bus |
car | cat |
chair | cow |
dog | horse |
motorbike | person |
plant | sofa |
table | sheep |
train | tv/monitor |