new interactive full image segmentation by considering all regions...

Interactive Full Image Segmentation by Considering All Regions Jointly

Eirikur Agustsson

Google Research

[email protected]

Jasper R. R. Uijlings

Google Research

[email protected]

Vittorio Ferrari

Google Research

[email protected]

I) Input image with extreme points provided by annotator

II) Machine predictions from extreme points

III) corrective scribbles provided by annotator

IV) Machine predictions from extreme points and corrective scribbles

Figure 1. Illustration of our interactive full image segmentation workflow. First (I) the annotator marks extreme points. Then (II) our

model (Sec. 3) uses them to generate a segmentation. This is presented to the annotator, after which we iterate: (III) the annotator makes

corrections using scribbles (Sec. 4), and (IV) our model uses them to update the predicted segmentation (Sec. 3).

Abstract

We address interactive full image annotation, where the

goal is to accurately segment all object and stuff regions in

an image. We propose an interactive, scribble-based an-

notation framework which operates on the whole image to

produce segmentations for all regions. This enables shar-

ing scribble corrections across regions, and allows the an-

notator to focus on the largest errors made by the machine

across the whole image. To realize this, we adapt Mask-

RCNN [22] into a fast interactive segmentation framework

and introduce an instance-aware loss measured at the pixel-

level in the full image canvas, which lets predictions for

nearby regions properly compete for space. Finally, we

compare to interactive single object segmentation on the

COCO panoptic dataset [11, 27, 34]. We demonstrate that

our interactive full image segmentation approach leads to a

5% IoU gain, reaching 90% IoU at a budget of four extreme

clicks and four corrective scribbles per region.

1. Introduction

We address the task of interactive full image segmen-

tation, where the goal is to obtain accurate segmentations

for all object and stuff regions in the image. Full im-

age annotations are important for many applications such

as self-driving cars [17, 19], navigation assistance for the

blind [51], and automatic image captioning [25, 56]. How-

ever, creating such datasets requires large amounts of hu-

man labor. For example, annotating a single image took

1.5 hours for Cityscapes [17]. For COCO+stuff [11, 34],

annotating one image took 19 minutes (80 seconds per ob-

ject [34] plus 3 minutes for stuff regions [11]), which totals

39k hours for the 123k images. So there is a clear need for

faster annotation tools.

This paper proposes an efficient interactive framework

for full image segmentation (Fig. 1 and 2). Given an image,

an annotator first marks extreme points [41] on all object

and stuff regions. These provide a tight bounding box with

four boundary points for each region, and can be efficiently

collected (7s per region [41]). Next, the machine predicts

an initial segmentation for the full image based on these

extreme points. Afterwards we present the whole image

with the predicted segmentation to the annotator and iterate

between (A) the annotator providing scribbles on the errors

of the current segmentation, and (B) the machine updating

the predicted segmentation accordingly (Fig. 1).

Our approach of full image segmentation brings several

advantages over modern interactive single object segmenta-

tion methods [24, 30, 31, 32, 36, 37, 60]: (I) It enables the

annotator to focus on the largest errors in the whole image,

rather than on the largest error of one given object. (II) It

shares annotations across multiple object and stuff regions.

In our approach, a single scribble correction specifies the

extension of one region and the shrinkage of neighboring re-

11622

gions (Sec. 3.2 and Fig. 3). In interactive single object seg-

mentation, corrections are used for the given target object

only. (III) Our approach lets regions compete for space in

the common image canvas, ensuring that a pixel is assigned

exactly one label (Fig. 3.1). In single object segmentation

instead, pixels along boundary regions may be assigned to

multiple objects, leading to contradictory labels, or to none,

leading to holes. At the same time, since regions compete,

corrections of one region influences nearby regions in our

framework (e.g. Fig. 6). (IV) Instead of only annotating

object instances, we also annotate stuff regions, capturing

important classes such as pavement or river.

We realize interactive interactive full image segmenta-

tion by adapting Mask-RCNN [22] (Fig. 2). We start from

extreme points [41], which define bounding boxes. There-

fore we bypass the Region Proposal Network of Mask-

RCNN and use these boxes directly to extract Region-of-

Interest (RoI) features (Sec. 3.1). Afterwards, we incorpo-

rate extreme points and scribble annotations inside Mask-

RCNN by concatenating them to the RoI-features. We en-

code annotations in a way that allows to share them across

regions, enabling advantage (II) above (Sec. 3.2). Finally,

while Mask-RCNN [22] predicts each mask separately, we

project the mask predictions back on the pixels in the com-

mon image canvas 3.1. Then we define a new loss which

is instance-aware yet lets predictions properly compete for

space, enabling advantage (III) above (Sec. 3.3).

To the best of our knowledge, all deep interactive single

object segmentation methods [24, 30, 31, 32, 36, 37, 60]

are based on Fully Convolutional Networks (FCNs) [14,

35, 46]. We chose to start from Mask-RCNN [22] for ef-

ficiency. FCN-style interactive segmentation methods con-

catenate corrections to a crop of an RGB image, and pass

that through a large neural net (e.g. ResNet-101 [23]). This

requires a full inference pass for each region at each correc-

tion iteration. In our Mask-RCNN framework instead, the

RGB image is first passed through the large backbone net-

work. Afterwards, for each region only a pass over the final

segmentation head is required (Fig. 2). This is much faster

and more memory efficient (Sec. 6).

We perform thorough experiments in increasingly com-

plex settings: (1) Single object segmentation: On the

COCO dataset [34], our Mask-RCNN style architecture

achieves similar performance to DEXTR [37] on single ob-

ject segmentation from extreme points [41]. (2) Full image

segmentation: We evaluate on the COCO panoptic chal-

lenge [11, 27, 34] the task of segmenting all object and stuff

regions in an image, starting from extreme points. Our idea

to share annotations across regions in combination with our

pixel-wise loss yield a +3% IoU gain over an interactive

single region segmentation baseline. (3) Interactive full im-

age segmentation: On the COCO panoptic challenge, we

demonstrate the combined effects of our three advantages

(I)-(III) above: at a budget of four extreme clicks and four

scribbles per region, we get a total +5% IoU gain over the

interactive single region segmentation baseline.

2. Related Work

Semantic segmentation from weakly labeled data. Many

works address semantic segmentation by training from

weakly labeled data, such as image-level labels [28, 44, 58],

point-clicks [5, 6, 13, 57], boxes [26, 37, 41] and scrib-

bles [33, 59]. Boxes can be efficiently annotated using ex-

treme points [41] which can also be used as an extra sig-

nal for generating segmentations [37, 41]. This is related

as our method starts from extreme points for each region.

However, the above methods operate from annotations col-

lected before any machine processing. Our work instead is

in the interactive scenario, where the annotator iteratively

provide corrective annotations for the current machine seg-

mentation.

Interactive object segmentation. Interactive object seg-

mentation is a long standing research topic. Most classi-

cal approaches [3, 4, 8, 47, 18, 16, 21, 38] formulate object

segmentation as energy minimization on a regular graph de-

fined over pixels, with unary potential capturing low-level

appearance properties and pairwise or higher-order poten-

tials encouraging regular segmentation outputs.

Starting from Xu et al. [60], recent methods ad-

dress interactive object segmentation with deep neural net-

works [24, 30, 31, 32, 36, 37, 60]. These works build

on Fully Convolutional architectures such as FCNs [35] or

Deeplab [14]. They input the RGB image plus two extra

channels for object and non-object corrections, and output

a binary mask.

In [15] they perform interactive object segmentation in

video. They use Deeplab [14] to create a pixel-wise embed-

ding space. Annotator corrections are used to create a near-

est neighbor classifier on top of this embedding, enabling

quick updates of the object predictions.

Finally, Polygon-RNN [1, 12] is an interesting alterna-

tive approach. Instead of predicting a mask, they uses a re-

current neural net to predict polygon vertices. Corrections

made by the annotator are used by the machine to refine its

vertex predictions.

Interactive full image segmentation. Recently, [2] pro-

posed Fluid Annotation, which also addresses the task of

full image annotation. Our work shares the spirit of fo-

cusing annotator effort on the biggest errors made by the

machine across the whole image. However, [2] uses Mask-

RCNN [22] to create a large pool of fixed segments and then

provides an efficient interface for the annotator to rapidly

select which of these should form the final segmentation. In

contrast, in our work all segments are created from the ini-

tial extreme points and are all part of the final annotation.

Our method then enables to correct the shape of segments

11623

RoI CropSoftmaxCanvas

Projection

+

+

+

+

Backbone

(ResNet)

Segmentation

Head

P<latexit sha1_base64="kv0A1g8Rykmhn4krc+ahYmdNbMw=">AAAB8XicbZA9SwNBEIbn/IzxK2ppsxgEq3Bno40YtLGMYD4wOcLeZi5Zsrd37O4J4Qj4I2wsFLH1h9jb+W/cS1Jo4gsLD+/MsO9MkAiujet+O0vLK6tr64WN4ubW9s5uaW+/oeNUMayzWMSqFVCNgkusG24EthKFNAoENoPhdV5vPqDSPJZ3ZpSgH9G+5CFn1FjrvhNRMwjCrDbulspuxZ2ILII3g/LlZ/HiEQBq3dJXpxezNEJpmKBatz03MX5GleFM4LjYSTUmlA1pH9sWJY1Q+9kk8ZgcW6dHwljZJw2ZuL8nMhppPYoC25kn1PO13Pyv1k5NeO5nXCapQcmmH4WpICYm+fqkxxUyI0YWKFPcZiVsQBVlxh6paI/gza+8CI3Timf51itXr2CqAhzCEZyAB2dQhRuoQR0YSHiCF3h1tPPsvDnv09YlZzZzAH/kfPwALsqSvg==</latexit><latexit sha1_base64="u7vK/E3+Dqb0AeNsYZUC8HNqM3I=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKKuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyBZlDI=</latexit><latexit sha1_base64="u7vK/E3+Dqb0AeNsYZUC8HNqM3I=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKKuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyBZlDI=</latexit><latexit sha1_base64="++4IxT5RIzIkxOzNFo281vbUCiQ=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiRu7LLoxmUF+8A2lMn0ph06mYSZiVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSATXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6DhVDNssFrHqBVSj4BLbhhuBvUQhjQKB3WB6m/vdJ1Sax/LBzBL0IzqWPOSMGis9DiJqJkGYtebDas2tuwuQdeIVpAYFWsPq12AUszRCaZigWvc9NzF+RpXhTOC8Mkg1JpRN6Rj7lkoaofazReI5ubDKiISxsk8aslB/b2Q00noWBXYyT6hXvVz8z+unJmz4GZdJalCy5UdhKoiJSX4+GXGFzIiZJZQpbrMSNqGKMmNLqtgSvNWT10nnqu5Zfu/VmjdFHWU4g3O4BA+uoQl30II2MJDwDK/w5mjnxXl3PpajJafYOYU/cD5/AL+DkPE=</latexit>

li<latexit sha1_base64="8tEy/p5YoQgDd9MUcDWSbaxcSm0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZZBG8sEzAOSJc5OZpMhs7PLzF0hLPkNGwtFbP0ZOxu/xdkkhSYeGDiccy/3zAkSKQy67pdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDsBNVwKxZsoUPJOojmNAsnbwfg299uPXBsRq3ucJNyP6FCJUDCKVur1IoqjIMzktC/65YpbdWcgq8RbkErtpPH9AAD1fvmzN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42ezzFNybpUBCWNtn0IyU39vZDQyZhIFdjLPaJa9XPzP66YYXvuZUEmKXLH5oTCVBGOSF0AGQnOGcmIJZVrYrISNqKYMbU0lW4K3/OVV0rqsepY3vErtBuYowimcwQV4cAU1uIM6NIFBAk/wAq9O6jw7b877fLTgLHaO4Q+cjx/6lpPO</latexit><latexit sha1_base64="uKLl/PODvpmIL5PSPvpOKBaS/P0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMj7ts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFL5pWK</latexit><latexit sha1_base64="uKLl/PODvpmIL5PSPvpOKBaS/P0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMj7ts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFL5pWK</latexit><latexit sha1_base64="I9PDnzXy9/NY+p2F76+8YlSbOQQ=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfUATymR60w6dTMLMRCihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTpoJr47rfTmVjc2t7p7pb29s/ODyqH590dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9K/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEka5mA/5sN5wm+4CZJ14JWlAifaw/uWPEpbFKA0TVOuB56YmyKkynAmc1/xMY0rZlI5xYKmkMeogX2SekwurjEiUKPukIQv190ZOY61ncWgni4x61SvE/7xBZqKbIOcyzQxKtjwUZYKYhBQFkBFXyIyYWUKZ4jYrYROqKDO2ppotwVv98jrpXjU9yx+8Ruu2rKMKZ3AOl+DBNbTgHtrQAQYpPMMrvDmZ8+K8Ox/L0YpT7pzCHzifP2v+kek=</latexit>

si<latexit sha1_base64="FT1gPrQg0GUDv1qDB5DRjGQT/a0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZZBG8sEzAOSJc5OZpMhs7PLzF0hLPkNGwtFbP0ZOxu/xdkkhSYeGDiccy/3zAkSKQy67pdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDsBNVwKxZsoUPJOojmNAsnbwfg299uPXBsRq3ucJNyP6FCJUDCKVur1IoqjIMzMtC/65YpbdWcgq8RbkErtpPH9AAD1fvmzN4hZGnGFTFJjup6boJ9RjYJJPi31UsMTysZ0yLuWKhpx42ezzFNybpUBCWNtn0IyU39vZDQyZhIFdjLPaJa9XPzP66YYXvuZUEmKXLH5oTCVBGOSF0AGQnOGcmIJZVrYrISNqKYMbU0lW4K3/OVV0rqsepY3vErtBuYowimcwQV4cAU1uIM6NIFBAk/wAq9O6jw7b877fLTgLHaO4Q+cjx8FVpPV</latexit><latexit sha1_base64="2xMyJ9SAOFeEJoVSnCepx6mj1ac=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMjXts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFWl5WR</latexit><latexit sha1_base64="2xMyJ9SAOFeEJoVSnCepx6mj1ac=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMMjXts3654lbdOdA68ZakUjtrfrNZ/aPRL3/6g4SYmApNOFaq57mpDjIsNSOcTku+UTTFZIyHtGepwDFVQTbPPEWXVhmgKJH2CY3m6u+NDMdKTeLQTuYZ1aqXi/95PaOj2yBjIjWaCrI4FBmOdILyAtCASUo0n1iCiWQ2KyIjLDHRtqaSLcFb/fI6aV9XPcubXqVWhwWKcA4XcAUe3EAN7qEBLSCQwhO8wKtjnGfnzXlfjBac5c4p/IEz+wFWl5WR</latexit><latexit sha1_base64="2Dg25NtOyfCvM/pjWFGXO03oEWw=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfUATymR60w6dTMLMRCihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTpoJr47rfTmVjc2t7p7pb29s/ODyqH590dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9K/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEka5ng/5sN5wm+4CZJ14JWlAifaw/uWPEpbFKA0TVOuB56YmyKkynAmc1/xMY0rZlI5xYKmkMeogX2SekwurjEiUKPukIQv190ZOY61ncWgni4x61SvE/7xBZqKbIOcyzQxKtjwUZYKYhBQFkBFXyIyYWUKZ4jYrYROqKDO2ppotwVv98jrpXjU9yx+8Ruu2rKMKZ3AOl+DBNbTgHtrQAQYpPMMrvDmZ8+K8Ox/L0YpT7pzCHzifP3avkfA=</latexit>

Z<latexit sha1_base64="PBoE7Nr4jx6TALBNCGe+6Rzp5TQ=">AAAB8XicbZDLSgMxFIZP6q2Ot6pLN8EiuCozbnQjFt24rGAvtB1KJs20oZnMkGSEMhR8CDcuFHHrg7h359uYabvQ1h8CH/85h/znBIng2rjuNyqsrK6tbxQ3na3tnd290v5BQ8epoqxOYxGrVkA0E1yyuuFGsFaiGIkCwZrB6CavNx+Y0jyW92acMD8iA8lDTomxVrsbETMMwqw96ZXKbsWdCi+DN4fy1adz+QgAtV7pq9uPaRoxaaggWnc8NzF+RpThVLCJ0001SwgdkQHrWJQkYtrPpokn+MQ6fRzGyj5p8NT9PZGRSOtxFNjOPKFerOXmf7VOasILP+MySQ2TdPZRmApsYpyvj/tcMWrE2AKhitusmA6JItTYIzn2CN7iysvQOKt4lu+8cvUaZirCERzDKXhwDlW4hRrUgYKEJ3iBV6TRM3pD77PWAprPHMIfoY8fPfySyA==</latexit><latexit sha1_base64="YsJboe/DWZB+8rcnAJ5BhG2YdBg=">AAAB8XicbZDLSgMxFIbPeK3jrerSTbAIrsqMG92IRTcuK9gLbYeSSTNtaCYTkoxQhr6FGxeK6NIHce9GfBszbRfa+kPg4z/nkP+cUHKmjed9O0vLK6tr64UNd3Nre2e3uLdf10mqCK2RhCeqGWJNORO0ZpjhtCkVxXHIaSMcXuf1xj1VmiXizowkDWLcFyxiBBtrtToxNoMwylrjbrHklb2J0CL4MyhdfrgX8u3LrXaLn51eQtKYCkM41rrte9IEGVaGEU7HbifVVGIyxH3atihwTHWQTRKP0bF1eihKlH3CoIn7eyLDsdajOLSdeUI9X8vN/2rt1ETnQcaETA0VZPpRlHJkEpSvj3pMUWL4yAImitmsiAywwsTYI7n2CP78yotQPy37lm/9UuUKpirAIRzBCfhwBhW4gSrUgICAB3iCZ0c7j86L8zptXXJmMwfwR877Dy+LlDw=</latexit><latexit sha1_base64="YsJboe/DWZB+8rcnAJ5BhG2YdBg=">AAAB8XicbZDLSgMxFIbPeK3jrerSTbAIrsqMG92IRTcuK9gLbYeSSTNtaCYTkoxQhr6FGxeK6NIHce9GfBszbRfa+kPg4z/nkP+cUHKmjed9O0vLK6tr64UNd3Nre2e3uLdf10mqCK2RhCeqGWJNORO0ZpjhtCkVxXHIaSMcXuf1xj1VmiXizowkDWLcFyxiBBtrtToxNoMwylrjbrHklb2J0CL4MyhdfrgX8u3LrXaLn51eQtKYCkM41rrte9IEGVaGEU7HbifVVGIyxH3atihwTHWQTRKP0bF1eihKlH3CoIn7eyLDsdajOLSdeUI9X8vN/2rt1ETnQcaETA0VZPpRlHJkEpSvj3pMUWL4yAImitmsiAywwsTYI7n2CP78yotQPy37lm/9UuUKpirAIRzBCfhwBhW4gSrUgICAB3iCZ0c7j86L8zptXXJmMwfwR877Dy+LlDw=</latexit><latexit sha1_base64="XPBiC87+uQ72eY6Esz5oPxUuAT0=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfdA2lMn0ph06mYSZiVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSATXxnW/ndLG5tb2Tnm3srd/cHhUPT5p6zhVDFssFrHqBlSj4BJbhhuB3UQhjQKBnWB6l/udJ1Sax/LRzBL0IzqWPOSMGiv1BhE1kyDMevNhtebW3QXIOvEKUoMCzWH1azCKWRqhNExQrfuemxg/o8pwJnBeGaQaE8qmdIx9SyWNUPvZIvGcXFhlRMJY2ScNWai/NzIaaT2LAjuZJ9SrXi7+5/VTE974GZdJalCy5UdhKoiJSX4+GXGFzIiZJZQpbrMSNqGKMmNLqtgSvNWT10n7qu5Z/uDVGrdFHWU4g3O4BA+uoQH30IQWMJDwDK/w5mjnxXl3PpajJafYOYU/cD5/AM61kPs=</latexit>

X<latexit sha1_base64="lo8+OkdEkXTnL+EvESuTbluL5Dg=">AAAB8XicbZA9SwNBEIbn/IzxK2ppsxgEq3Bno40YtLGMYD4wOcLeZi9Zsrd37M4J4Qj4I2wsFLH1h9jb+W/cS1Jo4gsLD+/MsO9MkEhh0HW/naXlldW19cJGcXNre2e3tLffMHGqGa+zWMa6FVDDpVC8jgIlbyWa0yiQvBkMr/N684FrI2J1h6OE+xHtKxEKRtFa952I4iAIs9a4Wyq7FXcisgjeDMqXn8WLRwCodUtfnV7M0ogrZJIa0/bcBP2MahRM8nGxkxqeUDakfd62qGjEjZ9NEo/JsXV6JIy1fQrJxP09kdHImFEU2M48oZmv5eZ/tXaK4bmfCZWkyBWbfhSmkmBM8vVJT2jOUI4sUKaFzUrYgGrK0B6paI/gza+8CI3Timf51itXr2CqAhzCEZyAB2dQhRuoQR0YKHiCF3h1jPPsvDnv09YlZzZzAH/kfPwAOvKSxg==</latexit><latexit sha1_base64="wUzgBpZ6I/Z1XUhfpVg6r2dkQyc=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKGuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyyBlDo=</latexit><latexit sha1_base64="wUzgBpZ6I/Z1XUhfpVg6r2dkQyc=">AAAB8XicbZDLSgMxFIbP1Fsdb1WXboJFcFVm3OhGLLpxWcFesB1KJs20oZlMSDJCGfoWblwooksfxL0b8W3MtF1o6w+Bj/+cQ/5zQsmZNp737RSWlldW14rr7sbm1vZOaXevoZNUEVonCU9UK8SaciZo3TDDaUsqiuOQ02Y4vMrrzXuqNEvErRlJGsS4L1jECDbWuuvE2AzCKGuNu6WyV/EmQovgz6B88eGey7cvt9YtfXZ6CUljKgzhWOu270kTZFgZRjgdu51UU4nJEPdp26LAMdVBNkk8RkfW6aEoUfYJgybu74kMx1qP4tB25gn1fC03/6u1UxOdBRkTMjVUkOlHUcqRSVC+PuoxRYnhIwuYKGazIjLAChNjj+TaI/jzKy9C46TiW77xy9VLmKoIB3AIx+DDKVThGmpQBwICHuAJnh3tPDovzuu0teDMZvbhj5z3HyyBlDo=</latexit><latexit sha1_base64="rFnNEVvySqBvSiMnVOW/eBQRw04=">AAAB8XicbVDLSsNAFL2pr1pfVZduBovgqiRu7LLoxmUF+8A2lMn0ph06mYSZiVBC/8KNC0Xc+jfu/BsnbRbaemDgcM69zLknSATXxnW/ndLG5tb2Tnm3srd/cHhUPT7p6DhVDNssFrHqBVSj4BLbhhuBvUQhjQKB3WB6m/vdJ1Sax/LBzBL0IzqWPOSMGis9DiJqJkGY9ebDas2tuwuQdeIVpAYFWsPq12AUszRCaZigWvc9NzF+RpXhTOC8Mkg1JpRN6Rj7lkoaofazReI5ubDKiISxsk8aslB/b2Q00noWBXYyT6hXvVz8z+unJmz4GZdJalCy5UdhKoiJSX4+GXGFzIiZJZQpbrMSNqGKMmNLqtgSvNWT10nnqu5Zfu/VmjdFHWU4g3O4BA+uoQl30II2MJDwDK/w5mjnxXl3PpajJafYOYU/cD5/AMurkPk=</latexit> L

<latexit sha1_base64="3tg8NWwVITql15AEqocFyaE74gQ=">AAAB8XicbZC7SgNBFIbPxltcb1FLm8EgWIVdG23EoI2FRQRzwWQJs5PZZMjszDIzK4Ql4EPYWChi64PY2/k2ziYpNPGHgY//nMP854QJZ9p43rdTWFpeWV0rrrsbm1vbO6XdvYaWqSK0TiSXqhViTTkTtG6Y4bSVKIrjkNNmOLzK680HqjST4s6MEhrEuC9YxAg21rrvxNgMwii7GXdLZa/iTYQWwZ9B+eLTPX8EgFq39NXpSZLGVBjCsdZt30tMkGFlGOF07HZSTRNMhrhP2xYFjqkOskniMTqyTg9FUtknDJq4vycyHGs9ikPbmSfU87Xc/K/WTk10FmRMJKmhgkw/ilKOjET5+qjHFCWGjyxgopjNisgAK0yMPZJrj+DPr7wIjZOKb/nWL1cvYaoiHMAhHIMPp1CFa6hBHQgIeIIXeHW08+y8Oe/T1oIzm9mHP3I+fgAotpK6</latexit><latexit sha1_base64="ykYVAFzl3gWN0PFR3iQN9g//q7Y=">AAAB8XicbZC7SgNBFIbPeo3rLWppMxgEq7Bro40YtLGwiGAumCxhdjKbDJmdHWZmhbDkLWwsFNHSB7G3Ed/G2SSFJv4w8PGfc5j/nFBypo3nfTsLi0vLK6uFNXd9Y3Nru7izW9dJqgitkYQnqhliTTkTtGaY4bQpFcVxyGkjHFzm9cY9VZol4tYMJQ1i3BMsYgQba921Y2z6YZRdjzrFklf2xkLz4E+hdP7hnsm3L7faKX62uwlJYyoM4Vjrlu9JE2RYGUY4HbntVFOJyQD3aMuiwDHVQTZOPEKH1umiKFH2CYPG7u+JDMdaD+PQduYJ9WwtN/+rtVITnQYZEzI1VJDJR1HKkUlQvj7qMkWJ4UMLmChmsyLSxwoTY4/k2iP4syvPQ/247Fu+8UuVC5ioAPtwAEfgwwlU4AqqUAMCAh7gCZ4d7Tw6L87rpHXBmc7swR857z8aRZQu</latexit><latexit sha1_base64="ykYVAFzl3gWN0PFR3iQN9g//q7Y=">AAAB8XicbZC7SgNBFIbPeo3rLWppMxgEq7Bro40YtLGwiGAumCxhdjKbDJmdHWZmhbDkLWwsFNHSB7G3Ed/G2SSFJv4w8PGfc5j/nFBypo3nfTsLi0vLK6uFNXd9Y3Nru7izW9dJqgitkYQnqhliTTkTtGaY4bQpFcVxyGkjHFzm9cY9VZol4tYMJQ1i3BMsYgQba921Y2z6YZRdjzrFklf2xkLz4E+hdP7hnsm3L7faKX62uwlJYyoM4Vjrlu9JE2RYGUY4HbntVFOJyQD3aMuiwDHVQTZOPEKH1umiKFH2CYPG7u+JDMdaD+PQduYJ9WwtN/+rtVITnQYZEzI1VJDJR1HKkUlQvj7qMkWJ4UMLmChmsyLSxwoTY4/k2iP4syvPQ/247Fu+8UuVC5ioAPtwAEfgwwlU4AqqUAMCAh7gCZ4d7Tw6L87rpHXBmc7swR857z8aRZQu</latexit><latexit sha1_base64="3EvzcnuxzwYwhOZpGilHKE7C5nQ=">AAAB8XicbVC7TsMwFL0pr1JeBUYWiwqJqUpYYKxgYWAoEn2INqoc96a16jiR7SBVUf+ChQGEWPkbNv4Gp80ALUeydHTOvfK5J0gE18Z1v53S2vrG5lZ5u7Kzu7d/UD08aus4VQxbLBax6gZUo+ASW4Ybgd1EIY0CgZ1gcpP7nSdUmsfywUwT9CM6kjzkjBorPfYjasZBmN3NBtWaW3fnIKvEK0gNCjQH1a/+MGZphNIwQbXueW5i/Iwqw5nAWaWfakwom9AR9iyVNELtZ/PEM3JmlSEJY2WfNGSu/t7IaKT1NArsZJ5QL3u5+J/XS0145WdcJqlByRYfhakgJib5+WTIFTIjppZQprjNStiYKsqMLaliS/CWT14l7Yu6Z/m9V2tcF3WU4QRO4Rw8uIQG3EITWsBAwjO8wpujnRfn3flYjJacYucY/sD5/AG5b5Dt</latexit>

zi<latexit sha1_base64="33F+OFvkLuwNh1qOQ3L6jKLflD0=">AAAB83icbVC7SgNBFL3rM8ZXVLCxGQyCVdi10TJoY5mAeUCyxNnJbDJkdnaZuSvEJb9hY6GIrT9jZ+O3OJuk0MQDA4dz7uWeOUEihUHX/XJWVtfWNzYLW8Xtnd29/dLBYdPEqWa8wWIZ63ZADZdC8QYKlLydaE6jQPJWMLrJ/dYD10bE6g7HCfcjOlAiFIyilbrdiOIwCLPHSU/0SmW34k5Blok3J+Xqcf37HgBqvdJntx+zNOIKmaTGdDw3QT+jGgWTfFLspoYnlI3ogHcsVTTixs+mmSfkzCp9EsbaPoVkqv7eyGhkzDgK7GSe0Sx6ufif10kxvPIzoZIUuWKzQ2EqCcYkL4D0heYM5dgSyrSwWQkbUk0Z2pqKtgRv8cvLpHlR8Syve+XqNcxQgBM4hXPw4BKqcAs1aACDBJ7gBV6d1Hl23pz32eiKM985gj9wPn4AEAeT3A==</latexit><latexit sha1_base64="AqBNvmB5FvYxICTjcEz5LDS6KdE=">AAAB83icbVC7SgNBFL3rM8ZXVLCxGQyCVdi10TLExjIB84DsEmYns8mQ2dllHkJc8hs2ForY5i/8Ajsbv8XZJIUmHhg4nHMv98wJU86Udt0vZ219Y3Nru7BT3N3bPzgsHR23VGIkoU2S8ER2QqwoZ4I2NdOcdlJJcRxy2g5Ht7nffqBSsUTc63FKgxgPBIsYwdpKvh9jPQyj7HHSY71S2a24M6BV4i1IuXra+GbT2ke9V/r0+wkxMRWacKxU13NTHWRYakY4nRR9o2iKyQgPaNdSgWOqgmyWeYIurNJHUSLtExrN1N8bGY6VGsehncwzqmUvF//zukZHN0HGRGo0FWR+KDIc6QTlBaA+k5RoPrYEE8lsVkSGWGKibU1FW4K3/OVV0rqqeJY3vHK1BnMU4AzO4RI8uIYq3EEdmkAghSd4gVfHOM/Om/M+H11zFjsn8AfO9AdhSJWY</latexit><latexit sha1_base64="AqBNvmB5FvYxICTjcEz5LDS6KdE=">AAAB83icbVC7SgNBFL3rM8ZXVLCxGQyCVdi10TLExjIB84DsEmYns8mQ2dllHkJc8hs2ForY5i/8Ajsbv8XZJIUmHhg4nHMv98wJU86Udt0vZ219Y3Nru7BT3N3bPzgsHR23VGIkoU2S8ER2QqwoZ4I2NdOcdlJJcRxy2g5Ht7nffqBSsUTc63FKgxgPBIsYwdpKvh9jPQyj7HHSY71S2a24M6BV4i1IuXra+GbT2ke9V/r0+wkxMRWacKxU13NTHWRYakY4nRR9o2iKyQgPaNdSgWOqgmyWeYIurNJHUSLtExrN1N8bGY6VGsehncwzqmUvF//zukZHN0HGRGo0FWR+KDIc6QTlBaA+k5RoPrYEE8lsVkSGWGKibU1FW4K3/OVV0rqqeJY3vHK1BnMU4AzO4RI8uIYq3EEdmkAghSd4gVfHOM/Om/M+H11zFjsn8AfO9AdhSJWY</latexit><latexit sha1_base64="J1p9KYiuhco7OkhjieQVld22MFQ=">AAAB83icbVBNS8NAFHypX7V+VT16WSyCp5J40WPRi8cKthWaUDbbl3bpZhN2N0IN/RtePCji1T/jzX/jps1BWwcWhpn3eLMTpoJr47rfTmVtfWNzq7pd29nd2z+oHx51dZIphh2WiEQ9hFSj4BI7hhuBD6lCGocCe+HkpvB7j6g0T+S9maYYxHQkecQZNVby/ZiacRjlT7MBH9QbbtOdg6wSryQNKNEe1L/8YcKyGKVhgmrd99zUBDlVhjOBs5qfaUwpm9AR9i2VNEYd5PPMM3JmlSGJEmWfNGSu/t7Iaaz1NA7tZJFRL3uF+J/Xz0x0FeRcpplByRaHokwQk5CiADLkCpkRU0soU9xmJWxMFWXG1lSzJXjLX14l3YumZ/md12hdl3VU4QRO4Rw8uIQW3EIbOsAghWd4hTcnc16cd+djMVpxyp1j+APn8weBYJH3</latexit>

bi<latexit sha1_base64="scFVmzDNNvwJgiRtgXHvgnyThs0=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZZBG8sEzAOSJc5OZpMhs7PLzF0hLPkNGwtFbP0ZOxu/xdkkhSYeGDiccy/3zAkSKQy67pdTWFvf2Nwqbpd2dvf2D8qHRy0Tp5rxJotlrDsBNVwKxZsoUPJOojmNAsnbwfg299uPXBsRq3ucJNyP6FCJUDCKVur1IoqjIMyCaV/0yxW36s5AVom3IJXaSeP7AQDq/fJnbxCzNOIKmaTGdD03QT+jGgWTfFrqpYYnlI3pkHctVTTixs9mmafk3CoDEsbaPoVkpv7eyGhkzCQK7GSe0Sx7ufif100xvPYzoZIUuWLzQ2EqCcYkL4AMhOYM5cQSyrSwWQkbUU0Z2ppKtgRv+curpHVZ9SxveJXaDcxRhFM4gwvw4ApqcAd1aAKDBJ7gBV6d1Hl23pz3+WjBWewcwx84Hz/rUJPE</latexit><latexit sha1_base64="Oaw5BFxOvqwrCdB3C4B8cbnBWhs=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMsnDaZ/1yxa26c6B14i1JpXbW/Gaz+kejX/70BwkxMRWacKxUz3NTHWRYakY4nZZ8o2iKyRgPac9SgWOqgmyeeYourTJAUSLtExrN1d8bGY6VmsShncwzqlUvF//zekZHt0HGRGo0FWRxKDIc6QTlBaABk5RoPrEEE8lsVkRGWGKibU0lW4K3+uV10r6uepY3vUqtDgsU4Rwu4Ao8uIEa3EMDWkAghSd4gVfHOM/Om/O+GC04y51T+ANn9gM8oJWA</latexit><latexit sha1_base64="Oaw5BFxOvqwrCdB3C4B8cbnBWhs=">AAAB83icbVC7SgNBFL0bXzG+ooKNzWAQrMKujZYhNpYJmAdklzA7mU2GzM4u8xDCkt+wsVDENn/hF9jZ+C3OJik08cDA4Zx7uWdOmHKmtOt+OYWNza3tneJuaW//4PCofHzSVomRhLZIwhPZDbGinAna0kxz2k0lxXHIaScc3+V+55FKxRLxoCcpDWI8FCxiBGsr+X6M9SiMsnDaZ/1yxa26c6B14i1JpXbW/Gaz+kejX/70BwkxMRWacKxUz3NTHWRYakY4nZZ8o2iKyRgPac9SgWOqgmyeeYourTJAUSLtExrN1d8bGY6VmsShncwzqlUvF//zekZHt0HGRGo0FWRxKDIc6QTlBaABk5RoPrEEE8lsVkRGWGKibU0lW4K3+uV10r6uepY3vUqtDgsU4Rwu4Ao8uIEa3EMDWkAghSd4gVfHOM/Om/O+GC04y51T+ANn9gM8oJWA</latexit><latexit sha1_base64="O3/aSo7bnJYUt4uydn3X97F8X2k=">AAAB83icbVDLSsNAFL2pr1pfVZduBovgqiRudFl047KCfUATymR60w6dTMLMRCihv+HGhSJu/Rl3/o2TNgttPTBwOOde7pkTpoJr47rfTmVjc2t7p7pb29s/ODyqH590dZIphh2WiET1Q6pRcIkdw43AfqqQxqHAXji9K/zeEyrNE/loZikGMR1LHnFGjZV8P6ZmEkZ5OB/yYb3hNt0FyDrxStKAEu1h/csfJSyLURomqNYDz01NkFNlOBM4r/mZxpSyKR3jwFJJY9RBvsg8JxdWGZEoUfZJQxbq742cxlrP4tBOFhn1qleI/3mDzEQ3Qc5lmhmUbHkoygQxCSkKICOukBkxs4QyxW1WwiZUUWZsTTVbgrf65XXSvWp6lj94jdZtWUcVzuAcLsGDa2jBPbShAwxSeIZXeHMy58V5dz6WoxWn3DmFP3A+fwBcuJHf</latexit>

Boxes from

extreme points Annotations

Full Image

Segmentation

Image-level

logit predictions

Region-level

logit predictions

RoI

features

Backbone

features

Input

image

Figure 2: Our proposed region based model for interactive full image segmentation (see Sec. 3.1 for details). We start from Mask-

RCNN [22], but use user provided boxes (from extreme points) instead of a box proposal networks for RoI cropping, and concatenate

the RoI features with annotator provided corrective scribbles. Instead of predicting binary masks for each region, we project all region

prediction into the common image canvas, where they compete for space. The network is trained end-to-end for a novel pixel-wise loss for

the full image segmentation task (see Sec. 3.3).

to precisely match object boundaries.

Several older works on interactive segmentation handle

multiple labels in a single image [39, 40, 50, 53]. We

present the first interactive deep learning framework which

does this. Moreover, in contrast to those works, we explic-

itly demonstrate the benefits of interactive full image seg-

mentation over interactive single object segmentation.

Other works on interactive annotation. In [48] they com-

bine a segmentation network with a language module to al-

low a human to correct the segmentation by typing feedback

in natural language, such as “there are no clouds visible in

this image”. The work of [42] annotates bounding boxes

using only human verification, while [29] trained agents to

determine whether it is more efficient to verify or draw a

bounding box. The avant-garde work of [49] had a ma-

chine dispatching many labeling questions to annotators, in-

cluding whether an object class is present, box verification,

box drawing, and finding missing instances of a particular

class in the image. In [54] they estimate the informative-

ness of having an image label, a box, or a segmentation

for an image, which they use to guide an active learning

scheme. Finally, several works tackle fine-grained classi-

fication through attributes interactive provided by annota-

tors [9, 43, 7, 55].

3. Our interactive segmentation model

This section describes our model which we use to pre-

dict a segmentation from extreme points and scribble cor-

rections (Fig. 1). We first discuss the model architecture

(Sec. 3.1). We then describe how we feed annotations to the

model (extreme points and scribble corrections, Sec. 3.2).

Finally, we describe model training with our new loss func-

tion (Sec. 3.3).

3.1. Model architecture

Our model is based on Mask-RCNN [22]. In Mask-

RCNN inference is done as follows: (1) An input image X

is passed through a deep neural network backbone such as

ResNet [23], producing a feature map Z. (2) A specialized

network module (RPN [45]) predicts box proposals based

on Z. (3) These box proposals are used to crop out Region-

of-Interest (RoI) features z from Z with a RoI cropping

layer (RoI-align [22]). (4) Then each RoI feature z is fed

into three separate network modules which predict a class

label, refined box coordinates, and a segmentation mask.

Fig. 2 illustrates how we adapt Mask-RCNN [22] for in-

teractive full image segmentation. In particular, our net-

work takes three types of inputs: (1) an image X of

size W × H × 3; (2) N annotation maps S1, · · · ,SN of

size W × H (for extreme points and scribble corrections,

Sec. 3.2); and (3) N boxes b1, · · · ,bN determined by the

extreme points provided by annotators. Here N is the num-

ber of regions that we want to segment, which is determined

by the annotator, and which may vary per image.

As in Mask-RCNN, an image X is fed into our backbone

architecture (ResNet [23]) to produce feature map Z of size1rW × 1

rH ×C, where C is the number of feature channels

and r is a reduction factor. Both C and r are determined by

the choice of backbone architecture.

In contrast to Mask-RCNN, we already have boxes

11624

b1, · · · ,bN , so we do not need a box proposal module. In-

stead, we use each box bi directly to crop out an RoI feature

zi from feature map Z. All features zi have the same fixed

size w×h×C (i.e. w and h are only dependent on the RoI

cropping layer). We concatenate to this the corresponding

annotation map si which is described in Sec. 3.2, and obtain

a feature map vi, which is of size w × h× (C + 2).Using vi, our network predicts a logit map li of size

w′ × h′ which represents the prediction of a single mask.

While Mask-RCNN stops at such mask predictions and pro-

cesses them with a sigmoid to obtain binary masks, we want

to have predictions influence each other. Therefore we use

the boxes b1, · · · ,bN to re-project the logit predictions of

all masks li back into the original image resolution which

results in N prediction maps Li. We concatenate these pre-

diction maps into a single tensor L of size W ×H×N . For

each pixel, we then obtain region probabilities P of dimen-

sion W ×H ×N by applying a softmax to the logits,

(P(x,y)1 , · · · , P

(x,y)N ) = softmax(L

(x,y)1 , · · · , L

(x,y)N ), (1)

where P(x,y)i denotes the probability that pixel (x, y) is as-

signed to region i. This makes multiple nearby regions com-

pete for space in the common image canvas.

3.2. Incorporating annotations

Our model in Fig. 2 concatenates RoI features z with

annotation map s. We now describe how we create s. First,

for each region i we create a positive annotation map Si

which is of the same size W ×H as the image. We choose

the annotation map to be binary and we create it by pasting

all extreme points and corrective scribbles for region i onto

it. Extreme points are represented by a circle which is 6

pixels in diameter. Scribbles are 3 pixels wide.

For each region i, we collapse all annotations which

do not belong to it into a single negative annotation map∑

j 6=i Sj . Then, we concatenate the positive and negative

annotation maps into a two-channel annotation map Fi

Fi :=(

Si,∑

j 6=i

Sj

)

, (2)

which is illustrated in Fig. 3. Finally, we apply RoI-

align [22] to Fi using box bi to obtain the desired cropped

annotation map si.

The way we construct Fi enables the sharing of all an-

notation information across multiple object and stuff re-

gions in the image. The negative annotations for one re-

gion are formed by collecting the positive annotations of

all other regions. In contrast, in single object segmentation

works [3, 8, 18, 16, 21, 24, 30, 31, 32, 38, 36, 37, 47, 60]

both positive and negative annotations are made only on the

target object and they are never shared, so they only have an

effect on that one object.

3.3. Training

Training data. As training data, we have ground-truth

masks for all objects and stuff regions in all images. We

represent the (non-overlapping) N ground truth masks of

an image X with region indices. This results in a map Y

of dimension W ×H , which assigns each pixel X(x,y) to a

region Y (x,y) ∈ {1, ...N}.

Pixel-wise loss. Standard Mask-RCNN is trained with Bi-

nary Cross Entropy (BCE) losses for each mask prediction

separately. This means that there is no direct interaction

between adjacent masks, and they might even overlap. In-

stead, we propose a novel instance-aware loss which lets

predictions compete for space in the original image canvas.

In particular, as described in Sec. 3.1 we project all

region-specific logits into a single image-level logit tensor

L, which is softmaxed into a region assignment probabili-

ties P of size W ×H ×N .

As described above, the ground-truth segmentation is

represented by Y with values in {1, · · · , N}, which spec-

ifies for each pixel its region index. Since we simulate the

extreme points from the ground-truth masks, there is a di-

rect correspondence between the region assignment proba-

bilities P1, · · · ,PN and Y. Thus, we can train our network

end-to-end for the Categorical Cross Entropy (CCE) loss for

the region assignments:

Lpixelwise =∑

(x,y)

− logP(x,y)

Y (x,y) (3)

We note that while the CCE loss is commonly used in fully

convolutional networks for semantic segmentation [14, 35,

46], we instead use it in an architecture based on Mask-

RCNN [22]. Furthermore, usually the loss is defined over a

fixed number of classes [14, 35, 46], whereas we define it

over the number of regions N . This number of regions may

vary per image.

The loss in (3) is computed over the pixels in the full

resolution common image canvas. Consequently, larger re-

gions have a greater impact on the loss. However, in our

experiments we measure Intersection-over-Union (IoU) be-

tween ground-truth masks and predictions, which considers

all regions equally independent of their size. Therefore we

weigh the terms in (3) as follows. For each pixel we find the

smallest box bi which contains it, and reweigh the loss for

that pixel by the inverse of the size of bi. This causes each

region to contribute to the loss approximately equally.

Our loss shares similarities with [10]. They used Fast-

RCNN [20] with selective search regions [52] and generate

a class prediction vector for each region. Then they project

this vector back into the image canvas using its correspond-

ing region, while resolving conflicts using a max operator.

In our work instead, we project a full logit map back into

the image (Fig. 2). Furthermore, while in [10] the number

11625

Negative channelPositive channelAnnotations

Figure 3: We illustrate how we combine all annotations near a region (red) into two

annotation maps specific for that region. The colored regions denote the current pre-

dicted segmentation and the white boundaries depict true object boundaries. For the red

region, the extreme points and the single positive scribble are combined into a single

positive binary channel. All scribbles from other nearby regions are collected into a

single negative binary channel.

Scribble simulation

Figure 4: To simulate a corrective scribble,

we first sample an initial control point to in-

dicate which region we want to expand (yel-

low), followed by two control points (orange)

sampled uniformly from the error region.

of logit channels is equal to the number of classes C, in our

work it depends on the number of regions N , which may

vary per image.

3.4. Implementation details

The original implementation of Mask-RCNN [22] cre-

ates for each RoI feature mask predictions for all classes

that it is trained on. At inference time, it uses the predicted

class to select the corresponding predicted mask. Since we

build on Mask-RCNN, we also do this in our framework

for convenience of the implementation. During training we

use the class labels to train class-specific mask prediction

logits. During inference, for each region i we use the class

label predicted by Mask-RCNN to select which mask log-

its which we use as li. Hence during inference time, we

have implicit class labels. However, class labels are never

exposed to the annotator and are considered to be irrelevant

for this paper.

4. Annotations and their simulation

Our annotations consists of both extreme points and

scribble corrections. We chose scribble corrections [3, 8,

47] instead of click corrections [24, 30, 31, 32, 36, 37, 60]

as they are a more natural choice in our scenario. As we

consider multiple regions in an image, any annotation first

needs to indicate which region should be extended. With

scribbles one can start inside the region to be extended, fol-

lowed by a path which specifies how to extend the region.

In all our experiments we simulate annotations, follow-

ing previous interactive segmentation works [1, 12, 24, 30,

31, 32, 36, 37, 60].

Simulating extreme points. To simulate the extreme points

that the annotator provides at the beginning, we use the code

provided by [37].

Simulating scribble corrections. To simulate scribble cor-

rections during the interactive segmentation process, we

first need to select an error region. Error regions are de-

fined as a connected group of pixels of a ground-truth re-

gion which has been wrongly assigned to a different region

(Fig. 4). We assess the importance of an error region by

measuring how much segmentation quality (IoU) would im-

prove if it was completely corrected. We use this to create

annotator corrections on the most important error regions

(the exact way depends on the particular experiment, details

in Sec. 5).

To correct an error, we need a scribble that starts inside

the ground-truth region and extends into the error region

(Fig. 1). We simulate such scribbles with a three-step pro-

cess, illustrated in Fig. 4: (1) first we randomly sample the

first point on the border of the error region that touches

the ground-truth region (yellow point in Fig. 4; (2) then

we sample two more points uniformly inside the error re-

gion (yellow points in Fig. 4). (3) we construct a scribble

as a smooth trajectory through these three points (using a

bezier curve). We repeat this process ten times, and keep the

longest scribble that is exclusively inside the ground-truth

region (while all simulated points are within the ground-

truth, the curve could cover parts outside the ground-truth).

5. Results

We use Mask-RCNN as basic segmentation framework

instead of Fully Convolutional architectures [14, 35, 46]

commonly used in single object segmentation works [24,

30, 31, 32, 36, 37, 60]. We first demonstrate in Sec. 5.1

that this is a valid choice by comparing to DEXTR [37] in

the non-interactive setting where we generate masks start-

ing from extreme points [41]. In Sec 5.2 we move to the full

image segmentation task and demonstrate improvements re-

sulting from sharing extreme points across regions and from

our new pixel-wise loss. Finally, in Sec. 5.3 we show results

on interactive full image segmentation, where we also share

scribble corrections across regions, and allow the annotator

to freely allocate scribbles to regions while considering the

whole image.

11626

Method IoU

DEXTR [37] 82.1

DEXTR (released model) 81.9

Our single region model 81.6

Table 1: Performance on COCO (objects only). The accuracy of

our single region model is comparable to DEXTR [37].

X-points not shared X-points shared

Mask-wise loss 75.8 76.0

Pixel-wise loss 78.4 79.1

Table 2: Performance on the COCO Panoptic validation set when

predicting masks from extreme points (X-points). We vary the

loss and whether extreme points are shared across regions. The

top-left entry corresponds to our single region model, the

bottom-right entry corresponds to our full image model.

5.1. Single object segmentation

DEXTR. In DEXTR [37] they predict object masks from

four extreme points [41]. DEXTR is based on Deeplab-

v2 [14], using a ResNet-101 [23] backbone architecture and

a Pyramid Scene Parsing network [61] as prediction head.

As input they crop a bounding box out of the RGB im-

age based on the extreme points provided by the annotator.

The locations of the extreme points are Gaussian blurred

and fed as a heatmap to the network, concatenated to the

cropped RGB input. The DEXTR segmentation model ob-

tained state-of-the-art results on this task [37].

Details of our model. We compare DEXTR to a single ob-

ject segmentation variant of our model (single region

model). It uses the original Mask-RCNN loss, computed in-

dividually per mask, and does not share annotations across

regions. For fair comparison to DEXTR, here we also use

a ResNet-101 [23] backbone, which due to memory con-

straints limits the resolution of our RoI features to 14 × 14pixels and our predicted mask to 33 × 33. Moreover, we

use their released code to generate simulated extreme point

annotations. In contrast to subsequent experiments, here we

also use the same Gaussian blurred heatmaps to input anno-

tations to our model as used in [37].

Dataset. We follow the experimental setup of [37] on

the COCO dataset [34], which has 80 object classes.

Models are trained on the 2014 training set and evalu-

ated on the 2017 validation set (previously referred to as

2014 minival). We measure performance in terms of

Intersection-over-Union averaged over all instances.

Results. Tab. 1 reports the original results from

DEXTR [37], our reproduction using their publicly released

model, and results of our single region model. Their

publicly released model and our model deliver very similar

results (81.9 and 81.6 IoU). This demonstrate that Mask-

RCNN-style models are competitive to commonly used

FCN-style models for this task.

5.2. Full image segmentation

Experimental setup. Given extreme points for each object

and stuff region, we now predict a full image segmentation.

We demonstrate the benefits of using our pixel-wise loss

(Sec. 3.3) and sharing extreme points across regions (i.e.

extreme points for one region are used as negative informa-

tion for nearby regions, Sec. 3.2).

Details of our model. In preliminary experiments we found

that RoI features of 14× 14 pixels resolution were limiting

accuracy when feeding in annotations into the segmentation

head. Therefore we increased both the RoI features and the

predicted mask to 41 × 41 pixels and switched to ResNet-

50 [23] due to memory constraints. Importantly, in all ex-

periments from now on our model uses the two-channel an-

notation maps described in Sec. 3.2.

Dataset. We perform our experiments on the COCO panop-

tic challenge dataset [11, 27, 34], which has 80 object

classes and 53 stuff classes. Since the final goal is to ef-

ficiently annotate data, we train on only 12.5% of the 2017

training set (15k images). We evaluate on the 2017 valida-

tion set and measure IoU averaged over all object and stuff

regions in all images.

Results. As Tab. 2 shows, our single region model

yields 75.8 IoU. It uses a mask-wise loss and does not share

extreme points across regions. When only sharing extreme

points, we get a small gain of +0.2 IoU. In contrast, when

only switching to our pixel-wise loss, results improve by

+2.6 IoU. Sharing extreme points is more beneficial in com-

bination with our new loss, yielding an additional improve-

ment of +0.7 IoU. Overall this model with both improve-

ments achieves 79.1 IoU, +3.3 higher than the single

region model. We call it our full image model.

5.3. Interactive full image segmentation

We now move to our final system for interactive full im-

age segmentation. We start from the segmentations from

extreme points made by our single region and full

image models from Sec. 5.2. Then we iterate between:

(A) adding scribble corrections by the annotator, and (B)

updating the machine segmentations accordingly.

Dataset and training. As before, we experiment on the

COCO panoptic challenge dataset and report results on the

2017 validation set. Since during iterations our models

input scribble corrections in addition to extreme points,

we train two new interactive models: single region

scribble and full image scribble. These mod-

els have the same architecture as their counterparts in

Sec. 5.2 (which input only extreme points), but are trained

differently. To create training data for one of these inter-

11627

0 2 4 6 8

#scribbles/region

0.75

0.80

0.85

0.90

IoU

IoU vs #scribbles/region

full image scribble (one scribble on average)

full image scribble (one scribble per region)

single region scribble (one scribble per region)

Figure 5: Results on the COCO Panoptic validation set for the

interactive full image segmentation task. We measure average IoU

vs the number of scribbles per region. We compare our full

image scribble model under two scribble allocation strate-

gies to the single region scribble baseline.

active models, we apply its counterpart to another 12.5%

of the 2017 training set. We generate simulated correc-

tive scribbles as described in Sec. 4 and train each model

on the combined extreme points and scribbles annotations

(Sec. 3.2). We keep these models fixed throughout all it-

erations of interactive segmentation. Note how, in addition

to sharing extreme points as in Sec. 5.2, the full image

scribble model also shares scribble corrections across

regions.

Allocation of scribble corrections. When using our

single region scribble model, in every iteration

we allocate exactly one scribble to each region. Instead,

when using our full image scribble model we also

consider an alternative interesting strategy: one scribble

per region on average, but the annotator can freely allocate

these scribbles to the regions in the image. This enables the

annotator to focus efforts on the biggest errors across the

whole image, typically resulting in some regions receiving

multiple scribbles and some receiving none.

Results. Fig. 5 shows annotation quality (IoU) vs cost

(number of scribbles per region). The two starting points

at zero scribbles are the same as the top-left and bottom-

right entries of Tab. 2 since they are made using the same

non-interactive models (from extreme points only).

We first compare single region scribble to

full image scribble while using the same alloca-

tion strategy: exactly one scribble per region. Fig. 5

shows that for both models accuracy rapidly improves

with more scribble corrections. However, full image

scribble always offers a better trade-off between an-

notation effort and segmentation quality, e.g. to reach

85% IoU it takes 4 scribbles per region for the single

region scribble model but only 2 scribbles for our

full image scribble model. Similarly, to reach

88% IoU it takes 7 scribbles vs 4 scribbles. This confirms

that the benefits of sharing annotations across regions and

of our pixel-wise loss persist also in the interactive setting.

We now compare the two scribble allocation strate-

gies on the full image scribble model. As Fig. 5

shows, using the strategy of freely allocating scribbles to

regions (one scribble on average) brings further efficiency

gains. full image scribble reaches a very high

90% IoU at just 4 scribbles per region on average. Reaching

this IoU instead requires allocating exactly 8 scribbles per

region with the other strategy. This demonstrate the benefits

of focusing annotation effort on the largest errors across the

whole image.

Overall, at a budget of four extreme clicks and four scrib-

bles per region, we get a total 5% IoU gain over single

region scribble (90% vs 85%). This gain is brought

by the combined effects of our contributions: sharing anno-

tations across regions, the pixel-wise loss which lets regions

compete on the common image canvas, and the free scribble

allocation strategy.

Fig. 6 shows various examples for how annotation pro-

gresses over iterations. Notice how in the first example, the

corrective scribble on the left bear induces a negative scrib-

ble for the rock, which in turn improves the segmentation

of the right bear. This demonstrates the benefit of sharing

scribble annotations and competition between regions.

6. Discussion

Mask-RCNN vs FCNs. Our work builds on Mask-

RCNN [22] rather than FCN-based models [14, 35, 46] be-

cause it is faster and requires less memory. To see this, we

can reinterpret Fig. 2 as an FCN-based model: ignore the

backbone network, replace the backbone features Z by the

RGB image, and make the segmentation head a full FCN.

At inference time, we need to do a forward pass through

the segmentation head for every region for every correction.

When using Mask-RCNN, the heavy ResNet [23] backbone

network is applied only once for the whole image, and then

only a small 4-layer segmentation head is applied to each re-

gion. For the FCN-style alternative instead, nothing can be

precomputed and the segmentation head itself is the heavy

ResNet. Hence our framework is much faster during inter-

active annotation.

During training, typically all intermediate network ac-

tivations are stored in memory. Crucially, for each region

distinct activations are generated in the segmentation head.

For FCN-style models this is a heavy ResNet and requires

lots of memory. This is why DEXTR [37] reports a maxi-

mal batch size of 5 regions. Therefore, it would be difficult

to train with our pixel-wise loss in an FCN-style model, as

that requires processing all regions in each image simulta-

11628

Input image Machine predictions Corrective scribbles Machine predictions Final result Ground-truth

with extreme points from extreme points 1 scribble/region 1 scribble/region 9 scribbles/region

provided by annotator provided by annotator and extreme points and extreme points

Figure 6: We show example results obtained by our system using the full image scribble model with a free allocation strategy.

The first two columns show the input image with extreme points and predictions. Column 3 shows the first annotation step with one scribble

correction per region on average, and column 4 shows the updated predictions. The last two columns compare the final result after 9 steps

(using 9 scribbles per region on average) with the COCO ground-truth segmentation.

neously (15 regions per image on average).

In fact our Mask-RCNN based architecture (Fig. 2) and

its reinterpretation as an FCN-based model span a contin-

uum. Its design space can be explored by varying the size of

the backbone and the segmentation head, as well as their in-

put and output resolution. We leave such exploration of the

trade-off between memory requirements, inference speed,

and model accuracy for future work.

Scribble and point simulations. Like other interactive seg-

mentation works [1, 12, 24, 30, 31, 32, 36, 37, 60], we sim-

ulate annotations. It remains to be studied how to best se-

lect the simulation parameters so that the models generalize

well to real human annotators. The optimal parameters will

likely depend on various factors, such as the desired anno-

tation quality and the accuracy of the provided corrections.

7. Conclusions

We proposed an interactive annotation framework which

operates on the whole image to produce segmentations for

all object and stuff regions. Our key contributions derive

from considering the full image at once: sharing annota-

tions across regions, focusing annotator effort on the biggest

errors across the whole image, and a pixel-wise loss for

Mask-RCNN that lets regions compete on the common im-

age canvas. We have shown through experiments on the

COCO panoptic challenge dataset [11, 27, 34] that all the

elements we propose improve the trade-off between anno-

tation cost and quality, leading to a very high IoU of 90%

using just four extreme points and four corrective scribbles

per region (compared to 85% for the baseline).

11629

References

[1] D. Acuna, H. Ling, A. Kar, and S. Fidler. Efficient interactive

annotation of segmentation datasets with polygon-rnn++. In

CVPR, 2018. 2, 5, 8

[2] M. Andriluka, J. R. R. Uijlings, and V. Ferrari. Fluid an-

notation: A human-machine collaboration interface for full

image annotation. In ACM Multimedia, 2018. 2

[3] X. Bai and G. Sapiro. Geodesic matting: A framework for

fast interactive image and video segmentation and matting.

IJCV, 2009. 2, 4, 5

[4] D. Batra, A. Kowdle, D. Parikh, J. Luo, and T. Chen. Inter-

actively co-segmentating topically related images with intel-

ligent scribble guidance. IJCV, 2011. 2

[5] A. Bearman, O. Russakovsky, V. Ferrari, and L. Fei-Fei.

What’s the point: Semantic segmentation with point super-

vision. 2016. 2

[6] S. Bell, P. Upchurch, N. Snavely, and K. Bala. Material

recognition in the wild with the materials in context database.

In CVPR, 2015. 2

[7] A. Biswas and D. Parikh. Simultaneous active learning of

classifiers & attributes via relative feedback. In CVPR, 2013.

3

[8] Y. Boykov and M. P. Jolly. Interactive graph cuts for optimal

boundary and region segmentation of objects in N-D images.

In ICCV, 2001. 2, 4, 5

[9] S. Branson, C. Wah, F. Schroff, B. Babenko, P. Welinder,

P. Perona, and S. Belongie. Visual recognition with humans

in the loop. In ECCV, 2010. 3

[10] H. Caesar, J. Uijlings, and V. Ferrari. Region-based semantic

segmentation with end-to-end training. In ECCV, 2016. 4

[11] H. Caesar, J. Uijlings, and V. Ferrari. COCO-stuff: Thing

and stuff classes in context. In CVPR, 2018. 1, 2, 6, 8

[12] L. Castrejon, K. Kundu, R. Urtasun, and S. Fidler. Annotat-

ing object instances with a polygon-rnn. In CVPR, 2017. 2,

5, 8

[13] D.-J. Chen, J.-T. Chien, H.-T. Chen, and L.-W. Chang. Tap

and shoot segmentation. In AAAI, 2018. 2

[14] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and

A. Yuille. Deeplab: Semantic image segmentation with deep

convolutional nets, atrous convolution, and fully connected

CRFs. IEEE Trans. on PAMI, 2017. 2, 4, 5, 6, 7

[15] Y. Chen, J. Pont-Tuset, A. Montes, and L. Van Gool. Blaz-

ingly fast video object segmentation with pixel-wise metric

learning. In CVPR, 2018. 2

[16] M.-M. Cheng, V. A. Prisacariu, S. Zheng, P. H. S. Torr, and

C. Rother. Densecut: Densely connected crfs for realtime

grabcut. Computer Graphics Forum, 2015. 2, 4

[17] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler,

R. Benenson, U. Franke, S. Roth, and B. Schiele. The

cityscapes dataset for semantic urban scene understanding.

CVPR, 2016. 1

[18] A. Criminisi, T. Sharp, C. Rother, and P. Perez. Geodesic

image and video editing. In ACM Transactions on Graphics,

2010. 2, 4

[19] A. Geiger, P. Lenz, C. Stiller, and R. Urtasun. Vision

meets robotics: The KITTI dataset. International Journal

of Robotics Research, 2013. 1

[20] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea-

ture hierarchies for accurate object detection and semantic

segmentation. In CVPR, 2014. 4

[21] V. Gulshan, C. Rother, A. Criminisi, A. Blake, and A. Zis-

serman. Geodesic star convexity for interactive image seg-

mentation. In CVPR, 2010. 2, 4

[22] K. He, G. Gkioxari, P. Dollar, and R. Girshick. Mask R-

CNN. In ICCV, 2017. 1, 2, 3, 4, 5, 7

[23] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning

for image recognition. In CVPR, 2016. 2, 3, 6, 7

[24] Y. Hu, A. Soltoggio, R. Lock, and S. Carter. A fully con-

volutional two-stream fusion network for interactive image

segmentation. Neural Networks, 2019. 1, 2, 4, 5, 8

[25] J. Johnson, A. Karpathy, and L. Fei-Fei. DenseCap: Fully

convolutional localization networks for dense captioning. In

CVPR, 2016. 1

[26] A. Khoreva, R. Benenson, J. Hosang, M. Hein, and

B. Schiele. Simple does it: Weakly supervised instance and

semantic segmentation. In CVPR, 2017. 2

[27] A. Kirillov, K. He, R. Girshick, C. Rother, and P. Dollar.

Panoptic segmentation. In ArXiv, 2018. 1, 2, 6, 8

[28] A. Kolesnikov and C. Lampert. Seed, expand and constrain:

Three principles for weakly-supervised image segmentation.

In ECCV, 2016. 2

[29] K. Konyushkova, J. Uijlings, C. Lampert, and V. Ferrari.

Learning intelligent dialogs for bounding box annotation. In

CVPR, 2018. 3

[30] H. Le, L. Mai, B. Price, S. Cohen, H. Jin, and F. Liu. In-

teractive boundary prediction for object selection. In ECCV,

2018. 1, 2, 4, 5, 8

[31] Z. Li, Q. Chen, and V. Koltun. Interactive image segmenta-

tion with latent diversity. In CVPR, 2018. 1, 2, 4, 5, 8

[32] J. Liew, Y. Wei, W. Xiong, S.-H. Ong, and J. Feng. Regional

interactive image segmentation networks. In ICCV, 2017. 1,

2, 4, 5, 8

[33] D. Lin, J. Dai, J. Jia, K. He, and J. Sun. Scribble-

Sup: Scribble-supervised convolutional networks for seman-

tic segmentation. In CVPR, 2016. 2

[34] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-

manan, P. Dollar, and C. Zitnick. Microsoft COCO: Com-

mon objects in context. In ECCV, 2014. 1, 2, 6, 8

[35] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional

networks for semantic segmentation. In CVPR, 2015. 2, 4,

5, 7

[36] S. Mahadevan, P. Voigtlaender, and B. Leibe. Iteratively

trained interactive segmentation. In BMVC, 2018. 1, 2, 4,

5, 8

[37] K.-K. Maninis, S. Caelles, J. Pont-Tuset, and L. Van Gool.

Deep extreme cut: From extreme points to object segmenta-

tion. In CVPR, 2018. 1, 2, 4, 5, 6, 7, 8

[38] N. S. Nagaraja, F. R. Schmidt, and T. Brox. Video segmen-

tation with just a few strokes. In ICCV, 2015. 2, 4

[39] C. Nieuwenhuis and D. Cremers. Spatially varying color

distributions for interactive multilabel segmentation. IEEE

Trans. on PAMI, 2013. 3

[40] C. Nieuwenhuis, S. Hawe, M. Kleinsteuber, and D. Cremers.

Co-sparse textural similarity for interactive segmentation. In

ECCV, 2014. 3

11630

[41] D. P. Papadopoulos, J. R. Uijlings, F. Keller, and V. Ferrari.

Extreme clicking for efficient object annotation. In ICCV,

2017. 1, 2, 5, 6

[42] D. P. Papadopoulos, J. R. R. Uijlings, F. Keller, and V. Fer-

rari. We don’t need no bounding-boxes: Training object class

detectors using only human verification. In CVPR, 2016. 3

[43] A. Parkash and D. Parikh. Attributes for classifier feedback.

In ECCV, 2012. 3

[44] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained con-

volutional neural networks for weakly supervised segmenta-

tion. In ICCV, 2015. 2

[45] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: To-

wards real-time object detection with region proposal net-

works. In NIPS, 2015. 3

[46] O. Ronneberger, P. Fischer, and T. Brox. U-net: Convolu-

tional networks for biomedical image segmentation. In MIC-

CAI, 2015. 2, 4, 5, 7

[47] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Inter-

active foreground extraction using iterated graph cuts. In

SIGGRAPH, 2004. 2, 4, 5

[48] C. Rupprecht, I. Laina, N. Navab, G. D. Hager, and

F. Tombari. Guide me: Interacting with deep networks. In

CVPR, 2018. 3

[49] O. Russakovsky, L.-J. Li, and L. Fei-Fei. Best of both

worlds: human-machine collaboration for object annotation.

In CVPR, 2015. 3

[50] J. Santner, T. Pock, and H. Bischof. Interactive multi-label

segmentation. In ACCV, 2010. 3

[51] M. Serrao, S. Shahrabadi, M. Moreno, J. Jose, J. I. Ro-

drigues, J. M. Rodrigues, and J. H. du Buf. Computer vi-

sion and GIS for the navigation of blind persons in buildings.

Universal Access in the Information Society, 2015. 1

[52] J. R. R. Uijlings, K. E. A. van de Sande, T. Gevers, and

A. W. M. Smeulders. Selective search for object recognition.

IJCV, 2013. 4

[53] V. Vezhnevets and V. Konushin. GrowCut - interactive multi-

label N-D image segmentation by cellular automata. In

GraphiCon, 2005. 3

[54] S. Vijayanarasimhan and K. Grauman. What’s it going to

cost you?: Predicting effort vs. informativeness for multi-

label image annotations. In CVPR, 2009. 3

[55] C. Wah, G. Van Horn, S. Branson, S. Maji, P. Perona, and

S. Belongie. Similarity comparisons for interactive fine-

grained categorization. In CVPR, 2014. 3

[56] G. Wang, P. Luo, L. Lin, and X. Wang. Learning object in-

teractions and descriptions for semantic image segmentation.

In CVPR, 2017. 1

[57] T. Wang, B. Han, and J. Collomosse. Touchcut: Fast im-

age and video segmentation using single-touch interaction.

CVIU, 2014. 2

[58] Y. Wei, H. Xiao, H. Shi, Z. Jie, J. Feng, and T. S. Huang. Re-

visiting dilated convolution: A simple approach for weakly-

and semi-supervised semantic segmentation. In CVPR, 2018.

2

[59] J. Xu, A. G. Schwing, and R. Urtasun. Learning to segment

under various forms of weak supervision. In CVPR, 2015. 2

[60] N. Xu, B. Price, S. Cohen, J. Yang, and T. Huang. Deep

interactive object selection. In CVPR, 2016. 1, 2, 4, 5, 8

[61] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene

parsing network. In CVPR, 2017. 6

11631

new interactive full image segmentation by considering all regions...

Documents