WEAKLY- AND SEMI-SUPERVISED PANOPTIC SEGMENTATIONQizhu Li*, Anurag Arnab*, Philip H.S. Torr
* Indicates equal contribution
Scan me!
Project Page
INTRODUCTION
We present a weakly supervised model that jointly performs both semantic- and
instance-segmentation – a particularly relevant problem given the substantial cost of
obtaining pixel-perfect annotation for these tasks. In contrast to many popular instance
segmentation approaches based on object detectors, our method does not predict
any overlapping instances. Moreover, we are able toAvailable for
download!segment both “thing” and “stuff” classes, and thus
explain all the pixels in the image. MethodValidation Test
𝐴𝑃𝑣𝑜𝑙𝑟 th. 𝐴𝑃𝑣𝑜𝑙
𝑟 st. 𝐴𝑃𝑣𝑜𝑙𝑟 all 𝑃𝑄 th. 𝑃𝑄 st. 𝑃𝑄 all 𝐴𝑃𝑣𝑜𝑙
𝑟 th.
Ours (weak, ImageNet init.) 17.0 33.1 26.3 35.8 43.9 40.5 12.8
Ours (full, ImageNet init.) 24.3 42.6 34.9 39.6 52.9 47.3 18.8
Ours (full, PSPNet [8] init.) [1] 28.6 52.6 42.5 42.5 62.1 53.8 23.4
Pixel Encoding [3] 9.9 - - - - - 8.9
RecAttend [4] - - - - - - 9.5
InstanceCut [5] - - - - - - 13.0
DWT [6] 21.2 - - - - - 19.4
SGN [7] 29.2 - - - - - 25.0
Dataset𝐼𝑜𝑈 𝐴𝑃𝑣𝑜𝑙
𝑟 𝑃𝑄VOC COCO
Weak Weak 75.7 55.5 59.5
Weak Full 75.8 56.1 59.8
Full Weak 77.5 58.9 62.7
Full Full 79.0 59.5 63.1
Method 𝐼𝑜𝑈 (weak) 𝐼𝑜𝑈 (full) 𝐹𝑆%
Ours (th.) 68.2 70.4 96.9
Ours (st.) 60.2 72.4 83.1
Ours (all) 63.6 71.6 88.8
Table 1. Semantic and instance segmentation
performance on Pascal VOC with varying levels of
supervision. We obtain state-of-the-art results for
both full and weak supervision.
Table 2. Semantic segmentation results on the
Cityscapes val. set. Using more informative,
bounding-box cues for “thing” classes leads to
its higher 𝐹𝑆% than that of “stuff” classes,
which are trained with only image-level tags.
Table 3. Instance-level segmentation results on Cityscapes. On the validation set, we report results for
both “thing” (th.) and “stuff” (st.) classes. The online server, which evaluates the test set, only computes
the 𝐴𝑃𝑣𝑜𝑙𝑟 for “thing” classes. We compare to other fully supervised methods which produce non-
overlapping instances.
Semantic and instance segmentation
on Pascal VOC (weak, semi, full):
Semantic segmentation on
Cityscapes (weak, full):
Instance-level segmentation on Cityscapes (weak, full):
Input
image
“Thing”
detector
Fully
convolutional
network
Box
consistency
term
Global
term
Instance
CRF
Instance-level
segmentation
Category-level
Seg. Module
Instance-level
Seg. Module
Forward and
backward
Forward
only
𝐻 ×𝑊 × 3
𝐻 ×𝑊 × (𝐶𝑠 + 𝐶𝑡)
5 × 𝐷𝑡 𝐻 ×𝑊 × (𝐶𝑠 + 𝐷𝑡)
𝐻 ×𝑊 × (𝐶𝑠 + 𝐷𝑡)
𝐻 ×𝑊 × (𝐶𝑠 + 𝐷𝑡)
We use the network architecture proposed in our previous fully-supervised work [1], which produce
non-overlapping instances. Each of the 𝐷𝑡 detections (variable number per image) defines a possible
“thing” instance. We assume that there can only be a single instance of a "stuff" class in an image.
Therefore, there can be (𝐶𝑠 + 𝐷𝑡) instances per image which we need to label.
The box consistency term 𝜓𝐵𝑜𝑥 encourages pixels inside a bounding box 𝐵𝑖 (given by the detector
for “things”, or covering the whole image for “stuff”) to associate with the 𝑖-th instance:
𝜓𝐵𝑜𝑥 𝑉𝑘 = 𝑖 = ቊ𝑠𝑖𝑄𝑘 𝑙𝑖 , 𝑘 ∈ 𝐵𝑖0, otherwise
The global term 𝜓𝐺𝑙𝑜𝑏𝑎𝑙 handles poor detection localisation:
𝜓𝐺𝑙𝑜𝑏𝑎𝑙 𝑉𝑘 = 𝑖 = 𝑄𝑘(𝑙𝑖)
We use the same CRF formulation as our earlier work [1] with densely connected pairwise terms [2]:
𝐸 𝑽 = 𝒗 = −
𝑖
𝑁
ln 𝑤1𝜓𝐵𝑜𝑥 𝑣𝑖 + 𝑤2𝜓𝐺𝑙𝑜𝑏𝑎𝑙 𝑣𝑖 + 𝜀 +
𝑖<𝑗
𝑁
𝜓𝑃𝑎𝑖𝑟𝑤𝑖𝑠𝑒(𝑣𝑖 , 𝑣𝑗)
(a) Input image (b) Weakly supervised model (c) Fully supervised model
[1] A Arnab, et al. Pixelwise instance segmentation with a dynamically instantiated network. In CVPR, 2017.
[2] P Krahenbuhl and V Koltun. Efficient Inference in fully connected CRFs with Gaussian edge potentials. In NIPS, 2011.
[3] J Uhrig, et al. Pixel-level encoding and depth layering for instance-level semantic labeling. In GCPR, 2016.
[4] M Ren and RS Zemel. End-to-end instance segmentation with recurrent attention. In CVPR, 2017.
[5] A Kirillov, et al. Instancecut: from edgesto instances with multicut. In CVPR, 2017.
[6] M Bai and R Urtasun. Deep watershed transform for instance segmentation. In CVPR, 2017.
[7] S Liu, et al. Sgn: Sequential grouping networks for instance segmentation. In ICCV, 2017.
[8] H Zhao, et al. Pyramid scene parsing network. In CVPR, 2017.Project page: qizhuli.github.io/publication/weakly-supervised-panoptic-segmentation/
Code release: github.com/qizhuli/Weakly-Supervised-Panoptic-Segmentation
SEGMENTATION NETWORK STRUCTURE
QUANTITATIVE RESULTS
QUALITATIVE RESULTS
MergePseudo
ground truth
Image-level
tags
Train multi-
label
classifier
Class
activation
maps (CAM)
“STUFF” BRANCH
Bounding
boxes
Run MCG
and GrabCut
Coarse
foreground
masks (FM)
“THING” BRANCH
MergeTrain seg.
network
Network
predictions
Better
pseudo
ground truth
ITERATIVE TRAINING
Data
Compute
Input
Output
Input n
times
Legend
WEAKLY- AND SEMI-SUPERVISED TRAINING
1
2
3
(1a) Input image (1b) Localisation
heatmaps
(1c) Approximate
ground truth
(2a)
B-Boxes
Figure 1. Approximate ground truth generated from
image-level tags using weak localisation cues from a
multi-label classification network.
Figure 3. (3a-3e): By using the output of the trained network, the initial approximate ground truth
produced above (Iteration 0) can be iteratively refined. Black regions are “ignore” labels over which
the loss is not computed in training. Note for instance segmentation, permutations of instance labels
of the same class are equivalent. (3f): Panoptic quality (𝑃𝑄) of our panoptic segmentation results show
significant improvement due to iterative training.
(3a) Input image (3b) Iteration 0 (3c) Iteration 2 (3d) Iteration 5 (3e) Ground truth (3f) 𝑃𝑄 vs Iter.
4
3 4
1
(2b) Appr.
Semantic GT
Figure 2. Approximate ground truth
generated from bounding boxes usingcoarse object masks from MCG&GrabCut.
2
(2c) Appr.
Instance GT
road building
vegetation sky