We propose a novel algorithm for weakly supervised semantic segmentation based on image-level class labels only. In weakly supervised setting, it is commonly observed that trained model overly focuses on discriminative parts rather than the entire object area. Our goal is to overcome this limitation with no additional human intervention by retrieving videos relevant to target class labels from web repository, and generating segmentation labels from the retrieved videos to simulate strong supervision for semantic segmentation. During this process, we take advantage of image classification with discriminative localization technique to reject false alarms in retrieved videos and identify relevant spatio-temporal volumes within retrieved videos. Although the entire procedure does not require any additional supervision, the segmentation annotations obtained from videos are sufficiently strong to learn a model for semantic segmentation. The proposed algorithm substantially outper- forms existing methods based on the same level of supervision and is even as competitive as the approaches relying on extra annotations.

Architecture Overview

Figure 1. Overall framework of the proposed algorithm. Our algorithm first learns a model for classification and localization from a set of weakly annotated images. The learned model is used to eliminate noisy frames and generate coarse localization maps in web-crawled videos, where the per-pixel segmentation masks are obtained by solving a graph-based optimization problem. The obtained segmentations are served as annotations to train a decoder. Semantic segmentation on still images is then performed by applying the entire network to images.


Our method substantially outperforms existing approaches based on image-level labels, improving the state-of-the-art result by more than 7% mIoU. Performance of our method is even as competitive as the approaches based on extra supervision, which rely on additional human intervention. Especially, our method outperforms some approaches based on relatively stronger supervision (e.g., point supervision [1] and segmentation annotations of other classes [3]). These results show that segmentation annotations obtained from videos are sufficiently strong to simulate segmentation supervision missing in weakly annotated images.

Table 1. Evaluation results on PASCAL VOC 2012 test set.


Weakly Supervised Semantic Segmentation using Web-Crawled Videos
Seunghoon Hong, Donghun Yeo, Suha Kwak, Honglak Lee and Bohyung Han
In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017
  title={Weakly Supervised Semantic Segmentation using Web-Crawled Videos},
  author={Hong, Seunghoon and Yeo, Donghun and Kwak, Suha and Lee, Honglak and Han, Bohyung },
  journal = {CVPR},
[arxiv preprint]


The code and trained model for the proposed method will be released soon.

Supplementary Examples

Belows present more comprehensive results of our algorithm described in the paper.

1. Qualitative Results on Semantic Segmentation

The following figures provide additional qualitative results of our method on the PASCAL VOC validation set. Compared to previous approaches using image-level labels only [9] or weakly annotated videos [8], segmentation results of our approach capture more accurate object boundary and extent of object area. More qualitative results can be found at the link below.

Input imageGround-truthSEC [9]MCNN [8]Ours
More images

2. Qualitative Results on Video Segmentation

The following videos illustrate examples of YouTube videos sanitized by our method, and segmentation results on the videos by the procedure described in Section 3.3 of the main paper. We sampled few videos for each category for clear demonstration. The segmentation results from videos are sometimes inaccurate, and contain noises caused by inaccurate attention map, background clutter, static motion, etc. Despite of these challenges, our model effectively learns a model for segmentation since the noises in the segmentation annotations usually have no clear patterns while foreground area coresponding to the target object are consistently captured by the segmentation results.



  • A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei. What’s the Point: Semantic Segmentation with Point Supervision. In ECCV, 2016.
  • J. Dai, K. He, and J. Sun. BoxSup: Exploiting bounding boxes to supervise convolutional networks for semantic segmentation. In ICCV, 2015.
  • S. Hong, J. Oh, H. Lee, and B. Han. Learning transferrable knowledge for semantic segmentation with deep convolutional neural network. In CVPR, 2016.
  • A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 2016.
  • G. Papandreou, L.-C. Chen, K. Murphy, and A. L. Yuille. Weakly-and semi-supervised learning of a DCNN for semantic image segmentation. In ICCV, 2015.
  • D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional neural networks for weakly supervised segmentation. In ICCV, 2015
  • P. O. Pinheiro and R. Collobert. From image-level to pixel-level labeling with convolutional networks. In CVPR, 2015.
  • P. Tokmakov, K. Alahari, and C. Schmid. Learning semantic segmentation with weakly-annotated videos. In ECCV, 2016.
  • A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three principles for weakly-supervised image segmentation. In ECCV, 2016.