deep snake for real-time instance segmentationsnake uses a neural network to iteratively deform an...

Deep Snake for Real-Time Instance Segmentation

Sida Peng1 Wen Jiang1 Huaijin Pi1 Xiuli Li2 Hujun Bao1 Xiaowei Zhou1∗

1Zhejiang University 2Deepwise AI Lab

Abstract

This paper introduces a novel contour-based approachnamed deep snake for real-time instance segmentation. Un-like some recent methods that directly regress the coordi-nates of the object boundary points from an image, deepsnake uses a neural network to iteratively deform an initialcontour to match the object boundary, which implements theclassic idea of snake algorithms with a learning-based ap-proach. For structured feature learning on the contour, wepropose to use circular convolution in deep snake, whichbetter exploits the cycle-graph structure of a contour com-pared against generic graph convolution. Based on deepsnake, we develop a two-stage pipeline for instance segmen-tation: initial contour proposal and contour deformation,which can handle errors in object localization. Experimentsshow that the proposed approach achieves competitiveperformances on the Cityscapes, KINS, SBD and COCOdatasets while being efficient for real-time applications witha speed of 32.3 fps for 512×512 images on a 1080Ti GPU.The code is available at https://github.com/zju3dv/snake/.

1. Introduction

Instance segmentation is the cornerstone of many com-puter vision tasks, such as video analysis, autonomous driv-ing, and robotic grasping, which require both accuracy andefficiency. Most of the state-of-the-art instance segmenta-tion methods [18, 27, 5, 19] perform pixel-wise segmen-tation within a bounding box given by an object detector[36], which may be sensitive to the inaccurate boundingbox. Moreover, representing an object shape as dense bi-nary pixels generally results in costly post-processing.

An alternative shape representation is the object contour,which is a set of vertices along the object silhouette. In con-trast to pixel-based representation, a contour is not limitedwithin a bounding box and has fewer parameters. Such acontour-based representation has long been used in imagesegmentation since the seminal work by Kass et al. [21],

∗The authors from Zhejiang University are affiliated with the State KeyLab of CAD&CG. Corresponding author: Xiaowei Zhou.

1x1 conv

(a) Initial contour (c) Offsets(b) Feature learning on the contour

Figure 1. The basic idea of deep snake. Given an initial contour,image features are extracted at each vertex (a). Since the contour isa cycle graph, circular convolution is applied for feature learningon the contour (b). The blue, yellow and green nodes denote theinput features, the kernel of circular convolution, and the outputfeatures, respectively. Finally, offsets are regressed at each vertexto deform the contour to match the object boundary (c).

which is well known as snakes or active contours. Given aninitial contour, the snake algorithm iteratively deforms it tomatch the object boundary by optimizing an energy func-tional defined with low-level features, such as image inten-sity or gradient. While many variants [6, 7, 15] have beendeveloped in literature, these methods are prone to local op-tima as the objective functions are handcrafted and typicallynonconvex.

Some recent learning-based segmentation methods [20,42, 41] also represent objects as contours and try to di-rectly regress the coordinates of contour vertices from anRGB image. Although such methods are fast, most of themdo not perform as well as pixel-based methods. Instead,Ling et al. [25] adopt the deformation pipeline of tradi-tional snake algorithms and train a neural network to evolvean initial contour to match the object boundary. Given acontour with image features, it regards the input contouras a graph and uses a graph convolutional network (GCN)to predict vertex-wise offsets between contour points andthe target boundary points. It achieves competitive accu-racy compared with pixel-based methods while being muchfaster. However, the method proposed in [25] is designed tohelp annotation and lacks a complete pipeline for automaticinstance segmentation. Moreover, treating the contour as ageneral graph with a generic GCN does not fully exploit thespecial topology of a contour.

1

arX

iv:2

001.

0162

9v3

[cs

.CV

] 1

Apr

202

0

https://github.com/zju3dv/snake/

In this paper, we propose a learning-based snake algo-rithm, named deep snake, for real-time instance segmen-tation. Inspired by previous methods [21, 25], deep snaketakes an initial contour as input and deforms it by regressingvertex-wise offsets. Our innovation is introducing the circu-lar convolution for efficient feature learning on a contour, asillustrated in Figure 1. We observe that the contour is a cy-cle graph that consists of a sequence of vertices connectedin a closed cycle. Since every vertex has the same degreeequal to two, we can apply the standard 1D convolution onthe vertex features. Considering that the contour is periodic,deep snake introduces the circular convolution, which indi-cates that an aperiodic function (1D kernel) is convolved inthe standard way with a periodic function (features definedon the contour). The kernel of circular convolution encodesnot only the feature of each vertex but also the relationshipamong neighboring vertices. In contrast, the generic GCNperforms pooling to aggregate information from neighbor-ing vertices. The kernel function in our circular convolutionamounts to a learnable aggregation function, which is moreexpressive and results in better performance than using ageneric GCN, as demonstrated by our experimental resultsin Section 5.2.

Based on deep snake, we develop a pipeline for instancesegmentation. Given an initial contour, deep snake can iter-atively deform it to match the object boundary and obtainthe object shape. The remaining question is how to ini-tialize a contour, whose importance has been demonstratedin classic snake algorithms. Inspired by [32, 29, 45], wepropose to generate an octagon formed by object extremepoints as the initial contour, which generally encloses theobject tightly. Specifically, we integrate deep snake with anobject detector. The detected bounding box initializes a di-amond contour defined by four center points on the edges.Then, deep snake takes the diamond as input and outputsoffsets that point from diamond vertices to object extremepoints, which are used to construct an octagon following[45]. Finally, deep snake deforms the octagon contour tomatch the object boundary.

Our approach exhibits competitive performances onCityscapes [8], KINS [35], SBD [16] and COCO [24]datasets, while being efficient for real-time instance seg-mentation, 32.3 fps for 512× 512 images on a GTX 1080tiGPU. The following two facts make learning-based snakefast and accurate. First, our approach can deal with errorsin the object localization stage and thus allows a light detec-tor. Second, the contour representation has fewer parame-ters than the pixel-based representation and does not requirecostly post-processing, e.g., mask upsampling.

In summary, this work has the following contributions:

• We propose a learning-based snake algorithm for real-time instance segmentation and introduce the circularconvolution for feature learning on the contour.

• We propose a two-stage pipeline for instance segmen-tation: initial contour proposal and contour deforma-tion. Both stages can deal with errors in the initial ob-ject localization.

• We demonstrate state-of-the-art performances of ourapproach on Cityscapes, KINS, SBD and COCOdatasets. For 512× 512 images, our algorithm runs at32.3 fps, which is efficient for real-time applications.

2. Related work

Pixel-based methods. Most methods [9, 23, 18, 27] per-form instance segmentation on the pixel level within a re-gion proposal, which works particularly well with standardCNNs. A representative instantiation is Mask R-CNN [18].It first detects objects and then uses a mask predictor to seg-ment instances within the proposed boxes. To better exploitthe spatial information inside the box, PANet [27] fusesmask predictions from fully-connected layers and convo-lutional layers. Such proposal-based approaches achievestate-of-the-art performance. One limitation of these meth-ods is that they cannot resolve errors in localization, suchas too small or shifted boxes. In contrast, our approach de-forms the detected boxes to the object boundaries, so thespatial extension of object shapes will not be limited.

There exist some pixel-based methods [2, 31, 28, 12, 43]that are free of region proposals. In these methods, everypixel produces the auxiliary information, and then a clus-tering algorithm groups pixels into object instances basedon their information. The auxiliary information and group-ing algorithms could be various. [2] predicts the boundary-aware energy for each pixel and uses the watershed trans-form algorithm for grouping. [31] differentiates instancesby learning instance-level embeddings. [28, 12] considerthe input image as a graph and regress pixel affinities, whichare then processed by a graph merge algorithm. Since themask is composed of dense pixels, the post-clustering algo-rithms tend to be time-consuming.

Contour-based methods. In these methods, the objectshape comprises a sequence of vertices along the objectboundary. Traditional snake algorithms [21, 6, 7, 15] firstintroduced the contour-based representation for image seg-mentation. They deform an initial contour to the objectboundary by optimizing a handcrafted energy with respectto the contour coordinates. To improve the robustness ofthese methods, [30] proposed to learn the energy function ina data-driven manner. Instead of iteratively optimizing thecontour, some recent learning-based methods [20, 42] try toregress the coordinates of contour points from an RGB im-age, which is much faster. However, their reported accuracyis not on par with state-of-the-art pixel-based methods.

2

In the field of semi-automatic annotation, [4, 1, 25] havetried to perform the contour labeling using other networksinstead of standard CNNs. [4, 1] predict the contour pointssequentially using a recurrent neural network. To avoid se-quential inference, [25] follows the pipeline of snake algo-rithms and uses a graph convolutional network to predictvertex-wise offsets for contour deformation. This strategysignificantly improves the annotation speed while being asaccurate as pixel-based methods. However, [25] lacks apipeline for instance segmentation and does not fully ex-ploit the special topology of a contour. Instead of treatingthe contour as a general graph, deep snake leverages the cy-cle graph topology and introduces the circular convolutionfor efficient feature learning on a contour.

3. Proposed approach

Inspired by [21, 25], we perform object segmentationby deforming an initial contour to match object bound-ary. Specifically, deep snake takes a contour as input andpredicts per-vertex offsets pointing to the object boundary.Features on contour vertices are extracted from the inputimage with a CNN backbone. To fully exploit the contourtopology, we propose the circular convolution for efficientfeature learning on the contour, which facilitates deep snaketo learn the deformation. Based on deep snake, we also de-velop a pipeline for instance segmentation.

3.1. Learning-based snake algorithm

Given an initial contour, traditional snake algorithmstreat the coordinates of the vertices as a set of variables andoptimize an energy functional with respect to these vari-ables. By designing proper forces at the contour coordi-nates, the algorithms could drive the contour to the objectboundary. However, since the energy functional is typicallynonconvex and handcrafted based on low-level image fea-tures, the optimization tends to find local optimal solutions.

In contrast, deep snake directly learns to evolve the con-tour in an end-to-end manner. Given a contour with N ver-tices {xi|i = 1, ..., N}, we first construct feature vectors foreach vertex. The input feature fi for a vertex xi is a concate-nation of learning-based features and the vertex coordinate:[F (xi);xi], where F denotes the feature maps. The fea-ture maps F are obtained by applying a CNN backbone onthe input image. The CNN backbone is shared with the de-tector in our instance segmentation pipeline, which will bediscussed later. The image feature F (xi) is computed usingthe bilinear interpolation at the vertex coordinate xi. Theappended vertex coordinate is used to encode the spatial re-lationship among contour vertices. Since the deformationshould not be affected by the translation of the contour inthe image, we subtract each dimension of xi by the mini-mum value over all vertices.

Figure 2. Circular Convolution. Theblue nodes are the input features definedon a contour, the yellow nodes repre-sent the kernel function, and the greennodes are the output features. The high-lighted green node is the inner productbetween the kernel function and the high-lighted blue nodes, which is the sameas the standard convolution. The outputfeatures of circular convolution have thesame length as the input features.

Given the input features defined on a contour, deep snakeintroduces the circular convolution for the feature learning,as illustrated in Figure 2. In general, the features of contourvertices can be treated as a 1-D discrete signal f : Z→ RD

and processed by the standard convolution. But this breaksthe topology of the contour. Therefore, we extend f to be aperiodic signal defined as:

(fN )i ,∞∑

j=−∞fi−jN , (1)

and propose to encode the periodic features by the circularconvolution defined as:

(fN ∗ k)i =r∑

j=−r(fN )i+jkj , (2)

where k : [−r, r] → RD is a learnable kernel function andthe operator ∗ is the standard convolution.

Similar to the standard convolution, we can constructa network layer based on the circular convolution for fea-ture learning, which is easy to be integrated into a mod-ern network architecture. After the feature learning, deepsnake applies three 1×1 convolution layers to the outputfeatures for each vertex and predicts vertex-wise offsets be-tween contour points and the target points, which are usedto deform the contour. In all experiments, the kernel size ofcircular convolution is fixed to be nine.

As discussed in the introduction, the proposed circularconvolution better exploits the circular structure of the con-tour than the generic graph convolution. We will showthe experimental comparison in Section 5.2. An alterna-tive method is to use standard CNNs to regress a pixel-wisevector field from the input image to guide the evolution ofthe initial contour [37, 33, 40]. We argue that an impor-tant advantage of deep snake over the standard CNNs is theobject-level structured prediction, i.e., the offset predictionat a vertex depends on other vertices of the same contour.Therefore, deep snake will predict a more reasonable off-set for a vertex located far from the object. Standard CNNsmay have difficulty in this case, as the regressed vector fieldmay drive this vertex to another object which is closer.

3

ConcatAdd

Contour Offsets

Backbone Fusion Prediction

Max

Poo

ling

1x1

Con

v

1x1

Con

v

1x1

Con

v-R

eLU

1x1

Con

v-R

eLU

CirC

onv-

Bn-

ReL

U

CirC

onv-

Bn-

ReL

U

CirC

onv-

Bn-

ReL

U

CirC

onv-

Bn-

ReL

U

CirC

onv-

Bn-

ReL

U

(a) Deep snake (b) Pipeline for instance segmentation

Input Image Detected box DeformationDiamond contour

Extreme pointsOctagon contourObject shape Deformation

Figure 3. Proposed contour-based model for instance segmentation. (a) Deep snake consists of three parts: a backbone, a fusionblock, and a prediction head. It takes a contour as input and outputs vertex-wise offsets to deform the contour. (b) Based on deep snake,we propose a two-stage pipeline for instance segmentation: initial contour proposal and contour deformation. The box proposed by thedetector gives a diamond contour, whose four vertices are then shifted to object extreme points by deep snake. An octagon is constructedbased on the extreme points. Taking the octagon as the initial contour, deep snake iteratively deforms it to match the object boundary.

Network architecture. Figure 3(a) shows the detailedschematic. Following ideas from [34, 39, 22], deep snakeconsists of three parts: a backbone, a fusion block, and aprediction head. The backbone is comprised of 8 “CirConv-Bn-ReLU” layers and uses residual skip connections for alllayers, where “CirConv” means circular convolution. Thefusion block aims to fuse the information across all contourpoints at multiple scales. It concatenates features from alllayers in the backbone and forwards them through a 1×1convolution layer followed by max pooling. The fused fea-ture is then concatenated with the feature of each vertex.The prediction head applies three 1×1 convolution layersto the vertex features and output vertex-wise offsets.

3.2. Deep snake for instance segmentation

Figure 3(b) overviews the proposed pipeline for instancesegmentation. We combine deep snake with an object de-tector. The detector first produces object bounding boxesthat are used to construct diamond contours. Then deepsnake shifts the diamond vertices to object extreme points,which are used to construct octagon contours. Finally, deepsnake takes octagons as initial contours and performs itera-tive contour deformation to obtain the object shape.

Initial contour proposal. Most active contour models re-quire precise initial contours. Since the octagon proposedin [45] tightly encloses the object, we choose it as theinitial contour, as shown in Figure 3(b). This octagon isformed by four extreme points, which are top, leftmost, bot-tom, rightmost pixels in an object, respectively, denoted by{xex

i |i = 1, 2, 3, 4}. Given a detected object box, we ex-tract four center points at the top, left, bottom, right boxedges, denoted by {xbb

i |i = 1, 2, 3, 4}, and then connectthem to get a diamond contour. Deep snake takes this con-tour as input and outputs four offsets that point from eachvertex xbb

i to the extreme point xexi , namely xex

i − xbbi .

In practice, to consider more context information, the di-amond contour is uniformly upsampled to 40 points, anddeep snake correspondingly outputs 40 offsets. The lossfunction only supervises the offsets at xbb

i .We construct the octagon by generating four line seg-

ments based on extreme points and connecting their end-points. Specifically, the four extreme points define a newbounding box. From each extreme point, a line is extendedalong the corresponding box edge in both directions by 1/4of the edge length and truncated if it meets the box corner.Then, the endpoints of the four line segments are connectedto form the octagon.

Contour deformation. We first uniformly sample Npoints along the octagon contour starting from the top ex-treme points xex

1 . Similarly, the ground-truth contour isgenerated by uniformly sampling N vertices along the ob-ject boundary and defining the first vertex as the one nearestto xex

1 . Deep snake takes the initial contour as input andoutputs N offsets that point from each vertex to the targetboundary point. We set N as 128 in all experiments, whichcan uniformly cover most object shapes.

However, regressing the offsets in one pass is challeng-ing, especially for vertices far away from the object. In-spired by [21, 25, 38], we deal with this problem in an iter-ative optimization fashion. Specifically, our approach firstpredicts N offsets based on the current contour and thendeforms this contour by vertex-wise adding the offsets to itsvertex coordinates. The deformed contour can be used forthe next iteration. In experiments, the number of inferenceiteration is set as 3 unless otherwise stated.

Note that the contour is an alternative representation forthe spatial extension of an object. By deforming the ini-tial contour to match the object boundary, deep snake couldaddress the localization errors from the detector.

4

Detector

RoIAlign

Figure 4. Multi-component detection. Given an object box, weperform RoIAlign to obtain the feature map and use a detector todetect the component boxes.

Multi-component detection. Some objects are split intoseveral components due to occlusions, as shown in Figure 4.However, a contour can only outline one component. Toovercome this problem, we propose to use another detec-tor to find the object components within the object box.Figure 4 shows the basic idea. Specifically, using the de-tected box, our approach performs RoIAlign [18] to extracta feature map and adds a detector branch on the feature mapto produce the component boxes. For the detected compo-nents, we use deep snake to segment each of them and thenmerge the segmentation results.

4. Implementation detailsTraining strategy. For the training of deep snake, we usethe smooth `1 loss proposed in [14] to learn the two de-formation processes. The loss function for extreme pointprediction is defined as

Lex =1

4

4∑i=1

`1(xexi − xex

i ), (3)

where xexi is the predicted extreme point. And the loss func-

tion for iterative contour deformation is defined as

Liter =1

N

N∑i=1

`1(xi − xgti ), (4)

where xi is the deformed contour point and xgti is the

ground-truth boundary point. For the detection part, weadopt the same loss function as the original detection model.The training details change with datasets, which will be de-scribed in Section 5.3.

Detector. We adopt CenterNet [44] as the detector for allexperiments. CenterNet reformulates the detection task asa keypoint detection problem and achieves an impressivetrade-off between speed and accuracy. For the object boxdetector, we adopt the same setting as [44], which outputsclass-specific boxes. For the component box detector, aclass-agnostic CenterNet is adopted. Specifically, given anH ×W ×C feature map, the class-agnostic CenterNet out-puts an H×W×1 tensor representing the component centerand an H ×W × 2 tensor representing the box size.

5. Experiments5.1. Datasets and Metrics

Cityscapes [8] contains 2, 975 training, 500 validationand 1, 525 testing images with high quality annotations. Be-sides, it has 20k images with coarse annotations. The per-formance is evaluated in terms of the average precision (AP)metric averaged over eight semantic classes of the dataset.

KINS [35] was created by additionally annotating Kitti[13] dataset with instance-level semantic annotation. Thisdataset is used for amodal instance segmentation, whichaims to recover complete instance shapes even under oc-clusion. KINS consists of 7, 474 training images and 7, 517testing images. Following its setting, we evaluate our ap-proach on seven object categories in terms of the AP metric.

SBD [16] re-annotates 11, 355 images from the PASCALVOC [10] dataset with instance-level boundaries. The rea-son that we don’t evaluate on PASCAL VOC is that its an-notations contain holes, which is not suitable for contour-based methods. SBD is split into 5, 623 training images and5, 732 testing images. We report our results in terms of 2010VOC APvol [17], AP50, AP70 metrics. APvol is the averageof AP with nine thresholds from 0.1 to 0.9.

COCO [24] is one of the most challenging datasets forinstance segmentation. It consists of 115k training , 5k val-idation and 20k testing images. We report our results interms of the AP metric.

5.2. Ablation studies

We conduct ablation studies on the SBD dataset as ithas 20 semantic categories which could fully evaluate theability to handle various object shapes. The three proposedcomponents are evaluated, including our network architec-ture, initial contour proposal, and circular convolution. Inthese experiments, the detector and deep snake are trainedend-to-end for 160 epochs with multi-scale data augmenta-tion. The learning rate starts from 1e−4 and decays by halfat 80 and 120 epochs.

Table 1 summarizes the results of ablation studies. Therow “Baseline” lists the result of a direct combination ofCurve-gcn [25] with CenterNet [44]. Specifically, the detec-tor produces object boxes, which gives ellipses around ob-jects. Then ellipses are deformed towards object boundariesthrough Graph-ResNet. Note that, this baseline method rep-resents the contour as a graph and uses a graph convolutionnetwork for contour deformation.

To validate the advantages of our network, the model inthe second row keeps the convolution operator as graph con-volution and replaces Graph-ResNet with our proposed ar-chitecture, which yields 1.4 APvol improvement. The main

5

APvol AP50 AP70

Baseline 50.9 58.8 43.5+ Architecture 52.3 59.7 46.0+ Initial proposal 53.6 61.1 47.6+ Circular convolution 54.4 62.1 48.3

Table 1. Ablation studies on SBD val set . The baseline is adirect combination of Curve-gcn [25] and CenterNet [44]. Thesecond model reserves the graph convolution and replaces the net-work architecture with our proposed one, which yields 1.4 APvol

improvement. Then we add the initial contour proposal beforecontour deformation, which improves APvol by 1.3. The fourthrow shows that replacing graph convolution with circular convolu-tion further yields 0.8 APvol improvement.

Iter. 1 Iter. 2 Iter. 3 Iter. 4 Iter. 5Graph conv 50.2 51.5 53.6 52.2 51.6Circular conv 50.6 54.2 54.4 54.0 53.2

Table 2. Results of models with different convolution opera-tors and different iterations on SBD in terms of the APvol met-ric. Circular convolution outperforms graph convolution across allinference iterations. Furthermore, circular convolution with twoiterations outperforms graph convolution with three iterations by0.6 AP, indicating a stronger deforming ability. We also find thatadding more iterations does not necessarily improve the perfor-mance, which shows that it might be harder to train the networkwith more iterations.

difference between the two networks is that our architectureappends a global fusion block before the prediction head.

When exploring the influence of the contour initializa-tion, we add the initial contour proposal before the con-tour deformation. Instead of directly using the ellipse, theproposal step generates an octagon initialization by predict-ing four object extreme points, which not only compensatesfor the detection errors but also encloses the object moretightly. The comparison between the second and the thirdrow shows a 1.3 improvement in terms of APvol.

Finally, the graph convolution is replaced with the cir-cular convolution, which achieves 0.8 APvol improvement.To fully validate the importance of circular convolution, wefurther compare models with different convolution opera-tors and different inference iterations, as shown in table 2.Circular convolution outperforms graph convolution acrossall inference iterations. Circular convolution with two iter-ations outperforms graph convolution with three iterationsby 0.6 APvol. Figure 5 shows qualitative results of graphand circular convolution on SBD, where circular convolu-tion gives a sharper boundary. Both the quantitative andqualitative results indicate that models with the circular con-volution have a stronger ability to deform contours.

5.3. Comparison with the state-of-the-art methods

Performance on Cityscapes. Since fragmented instancesare very common in Cityscapes, the proposed multi-component detection strategy is adopted. Our network istrained with multi-scale data augmentation and tested at a

Iteration 1

Gra

ph C

onvo

lutio

nC

ircul

ar C

onvo

lutio

n

Iteration 2 Iteration 3

Figure 5. Comparison between graph convolution (top) and cir-cular convolution (bottom) on SBD. The result of circular con-volution with two iterations is visually better than that of graphconvolution with three iterations.

single resolution of 1216×2432. No testing tricks are used.The detector is first trained alone for 140 epochs, and thelearning rate starts from 1e−4 and drops by half at 80, 120epochs. Then the detection and snake branches are trainedend-to-end for 200 epochs, and the learning rate starts from1e−4 and drops by half at 80, 120, 150 epochs. We choosea model that performs best on the validation set.

Table 3 compares our results with other state-of-the-artmethods on the Cityscapes validation and test sets. Allmethods are tested without tricks. Using only the fine anno-tations, our approach achieves state-of-the-art performanceson both validation and test sets. We outperform PANet by0.9 AP on the validation set and 1.3 AP50 on the test set.Our approach achieves 28.2 AP on the test set when thestrategy of handling fragmented instances is not adopted.Visual results are shown in Figure 6.

Performance on KINS. The KINS dataset is for amodalinstance segmentation, where objects are all annotated assingle-component, so the multi-component detection strat-egy is not adopted. We train the detector and snake end-to-end for 150 epochs. The learning rate starts from 1e−4

and decays with 0.5 and 0.1 at 80 and 120 epochs, respec-tively. We perform multi-scale training and test the modelat a single resolution of 768× 2496.

Table 4 shows the comparison with [9, 23, 11, 18, 27]on the KINS dataset in terms of the AP metric. Our ap-proach achieves the best performance across all methods.We find that the snake branch can improve the detection per-formance. When CenterNet is trained alone, it obtains 30.5AP on detection. When trained with the snake branch, itsperformance improves by 2.3 AP. For an image resolutionof 768 × 2496 on the KINS dataset, our approach runs at7.6 fps on a 1080 Ti GPU. Figure 6 shows some qualitativeresults on KINS.

6

Figure 6. Qualitative results on Cityscapes test and KINS test sets. The first two rows show the results on Cityscapes, and the last rowlists the results on KINS. Note that the results on KINS are for amodal instance segmentation.

training data fps AP [val] AP AP50 person rider car truck bus train mcycle bicycleSGN [26] fine + coarse 0.6 29.2 25.0 44.9 21.8 20.1 39.4 24.8 33.2 30.8 17.7 12.4PolygonRNN++ [1] fine - - 25.5 45.5 29.4 21.8 48.3 21.1 32.3 23.7 13.6 13.6Mask R-CNN [18] fine 2.2 31.5 26.2 49.9 30.5 23.7 46.9 22.8 32.2 18.6 19.1 16.0GMIS [28] fine + coarse - - 27.6 49.6 29.3 24.1 42.7 25.4 37.2 32.9 17.6 11.9Spatial [31] fine 11 - 27.6 50.9 34.5 26.1 52.4 21.7 31.2 16.4 20.1 18.9PANet [27] fine <1 36.5 31.8 57.1 36.8 30.4 54.8 27.0 36.3 25.5 22.6 20.8Deep snake fine 4.6 37.4 31.7 58.4 37.2 27.0 56.0 29.5 40.5 28.2 19.0 16.4

Table 3. Results on Cityscapes val (“AP [val]” column) and test (remaining columns) sets. Our approach achieves the state-of-the-artperformance, which outperforms PANet [27] by 0.9 AP on the val set and 1.3 AP50 on the test set. In terms of the inference speed, ourapproach is approximately five times faster than PANet. The timing results of other methods were obtained from [31].

detection amodal seg inmodal segMNC [9] 20.9 18.5 16.1FCIS [23] 25.6 23.5 20.8ORCNN [11] 30.9 29.0 26.4Mask R-CNN [18] 31.1 29.2 ×Mask R-CNN [18] 31.3 29.3 26.6PANet [27] 32.3 30.4 27.6Deep snake 32.8 31.3 ×

Table 4. Results on KINS test set in terms of the AP metric. Theamodal bounding box is used as the ground truth in the detectiontask. × means no such output in the corresponding method.

Performance on SBD. Since annotations of objects onSBD are mostly single-component, the multi-componentdetection strategy is not adopted. For fragmented instances,our approach detects their components separately insteadof detecting the whole object. We train the detection andsnake branches end-to-end for 150 epochs with multi-scaledata augmentation. The learning rate starts from 1e−4 anddrops by half at 80 and 120 epochs. The network is testedat a single scale of 512× 512.

APvol AP50 AP70

STS [20] 29.0 30.0 6.5ESE-50 [42] 32.6 39.1 10.5ESE-20 [42] 35.3 40.7 12.1Deep snake 54.4 62.1 48.3

Table 5. Results on SBD val set. Our approach outperforms othercontour-based methods by a large margin. The improvement in-creases with the IoU threshold: 21.4 in AP50 and 36.2 in AP70.

In Table 5, we compare with other contour-based meth-ods [20, 42] on the SBD dataset in terms of the VOC APmetrics. [20, 42] predict the object contours by regressingshape vectors. STS [20] defines the object contour as a ra-dial vector, and ESE [42] approximates object contour withthe Chebyshev polynomial. We outperform these methodsby a large margin of at least 19.1 APvol. Note that, ourapproach yields 21.4 AP50 and 36.2 AP70 improvements,demonstrating that the improvement increases as the IoUthreshold gets smaller. This indicates that our method out-lines object boundaries more precisely. For 512 × 512 im-

7

Figure 7. Qualitative results on SBD val set. Our approach handles errors in object localization in most cases. For example, in the firstimage, although the detected box doesn’t fully enclose the car, our approach recovers the complete car shape. Zoom in for details.

YOLACT [3] ESE [42] OURSval (segm AP) 29.9 21.6 30.5

test-dev (segm AP) 29.8 - 30.3

Table 6. Comparison with other real-time methods on COCO.

ages on the SBD dataset, our approach runs at 32.3 fps on a1080 Ti. Some qualitative results are illustrated in Figure 7.

Performance on COCO. Similar to the experimenton SBD, the multi-component detection strategy is notadopted. The network is trained with multi-scale data aug-mentation and tested at the original image resolution with-out tricks (e.g., flip augmentation). The detection and snakebranches are trained end-to-end for 160 epochs, where thedetector is initialized with the pretrained model released by[44]. The learning rate starts from 1e−4 and drops by half at80 and 120 epochs. We choose a model that performs beston the validation set. Table 6 compares our method withother real-time methods. Our method achieves 30.3 segmAP and 33.2 bbox AP on COCO test-dev set with 27.2 fps.

5.4. Running time

Table 7 compares our approach with other methods[9, 23, 18, 20, 42] in terms of running time on the PAS-CAL VOC dataset. Since the SBD dataset shares imageswith PASCAL VOC, the running time on the SBD datasetis technically the same as the one on PASCAL VOC. Weobtain the running time of other methods from [42].

For 512×512 images on the SBD dataset, our algorithmruns at 32.3 fps on a desktop with an Intel i7 3.7GHz anda GTX 1080 Ti GPU, which is efficient for real-time in-stance segmentation. Specifically, CenterNet takes 18.4 ms,the initial contour proposal takes 3.1 ms, and each iteration

method MNC FCIS MS STS ESE OURStime (ms) 360 160 180 27 26 31

fps 2.8 6.3 5.6 37.0 38.5 32.3

Table 7. Running time on the PASCAL VOC dataset. “MS” rep-resents Mask R-CNN [18] and “OURS” represents our approach.The last three methods are contour-based methods.

of contour deformation takes 3.3 ms. Since our approachoutputs the object boundary, no post-processing like upsam-pling is required. If the multi-component detection strategyis adopted, the detector additionally takes 3.6 ms.

6. Conclusion

We proposed a learning-based snake algorithm for real-time instance segmentation, which introduces the circularconvolution for efficient feature learning on the contour andregresses vertex-wise offsets for the contour deformation.Based on deep snake, we developed a two-stage pipeline forinstance segmentation: initial contour proposal and contourdeformation. We showed that this pipeline gained a supe-rior performance than direct regression of the coordinates ofthe object boundary points. To overcome the limitation ofthe contour representation that it can only outline one con-nected component, we proposed the multi-component de-tection strategy and demonstrated the effectiveness of thisstrategy on Cityscapes. The proposed model achieved com-petitive results on the Cityscapes, Kins, Sbd and COCOdatasets with a real-time performance.

Acknowledgements: The authors would like to acknowl-edge support from NSFC (No. 61806176) and Fun-damental Research Funds for the Central Universities(2019QNA5022).

8

References[1] David Acuna, Huan Ling, Amlan Kar, and Sanja Fidler. Ef-

ficient interactive annotation of segmentation datasets withpolygon-rnn++. In CVPR, 2018. 3, 7

[2] Min Bai and Raquel Urtasun. Deep watershed transform forinstance segmentation. In CVPR, 2017. 2

[3] Daniel Bolya, Chong Zhou, Fanyi Xiao, and Yong Jae Lee.Yolact: real-time instance segmentation. In ICCV, 2019. 8

[4] Lluis Castrejon, Kaustav Kundu, Raquel Urtasun, and SanjaFidler. Annotating object instances with a polygon-rnn. InCVPR, 2017. 3

[5] Kai Chen, Jiangmiao Pang, Jiaqi Wang, Yu Xiong, Xiaox-iao Li, Shuyang Sun, Wansen Feng, Ziwei Liu, Jianping Shi,Wanli Ouyang, et al. Hybrid task cascade for instance seg-mentation. In CVPR, 2019. 1

[6] Laurent D Cohen. On active contour models and balloons.CVGIP: Image understanding, 53(2):211–218, 1991. 1, 2

[7] Timothy F Cootes, Christopher J Taylor, David H Cooper,and Jim Graham. Active shape models-their training and ap-plication. CVIU, 61(1):38–59, 1995. 1, 2

[8] Marius Cordts, Mohamed Omran, Sebastian Ramos, TimoRehfeld, Markus Enzweiler, Rodrigo Benenson, UweFranke, Stefan Roth, and Bernt Schiele. The cityscapesdataset for semantic urban scene understanding. In CVPR,2016. 2, 5

[9] Jifeng Dai, Kaiming He, and Jian Sun. Instance-aware se-mantic segmentation via multi-task network cascades. InCVPR, 2016. 2, 6, 7, 8

[10] Mark Everingham, Luc Van Gool, Christopher KI Williams,John Winn, and Andrew Zisserman. The pascal visual objectclasses (voc) challenge. IJCV, 88(2):303–338, 2010. 5

[11] Patrick Follmann, Rebecca Ko Nig, Philipp Ha Rtinger,Michael Klostermann, and Tobias Bo Ttger. Learning to seethe invisible: End-to-end trainable amodal instance segmen-tation. In WACV, 2019. 6, 7

[12] Naiyu Gao, Yanhu Shan, Yupei Wang, Xin Zhao, Yinan Yu,Ming Yang, and Kaiqi Huang. Ssap: Single-shot instancesegmentation with affinity pyramid. In ICCV, 2019. 2

[13] Andreas Geiger, Philip Lenz, Christoph Stiller, and RaquelUrtasun. Vision meets robotics: The kitti dataset. IJRR,32(11):1231–1237, 2013. 5

[14] Ross Girshick. Fast r-cnn. In ICCV, 2015. 5[15] Steve R Gunn and Mark S Nixon. A robust snake implemen-

tation; a dual active contour. PAMI, 19(1):63–68, 1997. 1,2

[16] Bharath Hariharan, Pablo Arbelaez, Lubomir Bourdev,Subhransu Maji, and Jitendra Malik. Semantic contours frominverse detectors. In ICCV, 2011. 2, 5

[17] Bharath Hariharan, Pablo Arbelaez, Ross Girshick, and Ji-tendra Malik. Simultaneous detection and segmentation. InECCV, 2014. 5

[18] Kaiming He, Georgia Gkioxari, Piotr Dollar, and Ross Gir-shick. Mask r-cnn. In ICCV, 2017. 1, 2, 5, 6, 7, 8

[19] Zhaojin Huang, Lichao Huang, Yongchao Gong, ChangHuang, and Xinggang Wang. Mask scoring r-cnn. In CVPR,2019. 1

[20] Saumya Jetley, Michael Sapienza, Stuart Golodetz, andPhilip HS Torr. Straight to shapes: Real-time detection ofencoded shapes. In CVPR, 2017. 1, 2, 7, 8

[21] Michael Kass, Andrew Witkin, and Demetri Terzopoulos.Snakes: Active contour models. IJCV, 1(4):321–331, 1988.1, 2, 3, 4

[22] Guohao Li, Matthias Muller, Ali Thabet, and BernardGhanem. Can gcns go as deep as cnns? In ICCV, 2019.4

[23] Yi Li, Haozhi Qi, Jifeng Dai, Xiangyang Ji, and Yichen Wei.Fully convolutional instance-aware semantic segmentation.In CVPR, 2017. 2, 6, 7, 8

[24] Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays,Pietro Perona, Deva Ramanan, Piotr Dollar, and C LawrenceZitnick. Microsoft coco: Common objects in context. InECCV, 2014. 2, 5

[25] Huan Ling, Jun Gao, Amlan Kar, Wenzheng Chen, and SanjaFidler. Fast interactive object annotation with curve-gcn. InCVPR, 2019. 1, 2, 3, 4, 5, 6

[26] Shu Liu, Jiaya Jia, Sanja Fidler, and Raquel Urtasun. Sgn:Sequential grouping networks for instance segmentation. InICCV, 2017. 7

[27] Shu Liu, Lu Qi, Haifang Qin, Jianping Shi, and Jiaya Jia.Path aggregation network for instance segmentation. InCVPR, 2018. 1, 2, 6, 7

[28] Yiding Liu, Siyu Yang, Bin Li, Wengang Zhou, Jizheng Xu,Houqiang Li, and Yan Lu. Affinity derivation and graphmerge for instance segmentation. In ECCV, 2018. 2, 7

[29] Kevis-Kokitsi Maninis, Sergi Caelles, Jordi Pont-Tuset, andLuc Van Gool. Deep extreme cut: From extreme points toobject segmentation. In CVPR, 2018. 2

[30] Diego Marcos, Devis Tuia, Benjamin Kellenberger, LisaZhang, Min Bai, Renjie Liao, and Raquel Urtasun. Learn-ing deep structured active contours end-to-end. In CVPR,2018. 2

[31] Davy Neven, Bert De Brabandere, Marc Proesmans, andLuc Van Gool. Instance segmentation by jointly optimiz-ing spatial embeddings and clustering bandwidth. In CVPR,2019. 2, 7

[32] Dim P Papadopoulos, Jasper RR Uijlings, Frank Keller, andVittorio Ferrari. Extreme clicking for efficient object anno-tation. In ICCV, 2017. 2

[33] Sida Peng, Yuan Liu, Qixing Huang, Xiaowei Zhou, and Hu-jun Bao. Pvnet: Pixel-wise voting network for 6dof poseestimation. In CVPR, 2019. 3

[34] Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas.Pointnet: Deep learning on point sets for 3d classificationand segmentation. In CVPR, 2017. 4

[35] Lu Qi, Li Jiang, Shu Liu, Xiaoyong Shen, and Jiaya Jia.Amodal instance segmentation with kins dataset. In CVPR,2019. 2, 5

[36] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun.Faster r-cnn: Towards real-time object detection with regionproposal networks. In NeurIPS, 2015. 1

[37] Christian Rupprecht, Elizabeth Huaroc, Maximilian Baust,and Nassir Navab. Deep active contours. arXiv preprintarXiv:1607.05074, 2016. 3

9

[38] Nanyang Wang, Yinda Zhang, Zhuwen Li, Yanwei Fu, WeiLiu, and Yu-Gang Jiang. Pixel2mesh: Generating 3d meshmodels from single rgb images. In ECCV, 2018. 4

[39] Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma,Michael M Bronstein, and Justin M Solomon. Dynamicgraph cnn for learning on point clouds. TOG, 2018. 4

[40] Zian Wang, David Acuna, Huan Ling, Amlan Kar, and SanjaFidler. Object instance annotation with deep extreme levelset evolution. In CVPR, 2019. 3

[41] Enze Xie, Peize Sun, Xiaoge Song, Wenhai Wang, XueboLiu, Ding Liang, Chunhua Shen, and Ping Luo. Polarmask:Single shot instance segmentation with polar representation.In CVPR, 2020. 1

[42] Wenqiang Xu, Haiyang Wang, Fubo Qi, and Cewu Lu. Ex-plicit shape encoding for real-time instance segmentation. InICCV, 2019. 1, 2, 7, 8

[43] Ze Yang, Yinghao Xu, Han Xue, Zheng Zhang, Raquel Ur-tasun, Liwei Wang, Stephen Lin, and Han Hu. Dense rep-points: Representing visual objects with dense point sets.arXiv preprint arXiv:1912.11473, 2019. 2

[44] Xingyi Zhou, Dequan Wang, and Philipp Krahenbuhl. Ob-jects as points. arXiv preprint arXiv:1904.07850, 2019. 5, 6,8

[45] Xingyi Zhou, Jiacheng Zhuo, and Philipp Krahenbuhl.Bottom-up object detection by grouping extreme and centerpoints. In CVPR, 2019. 2, 4

10

deep snake for real-time instance segmentationsnake uses a neural network to iteratively deform an...

Documents