mask r-cnn iccv 2017(oral)imlab.postech.ac.kr/dkim/class/csed703g_2019f/maskrcn.pdf ·...

Mask R-CNNICCV 2017(Oral)

Kaiming He Ross GirshickGeorgia Gkioxari Piotr Dollár Facebook AI Research (FAIR)

1. Abstract

Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.

Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.

Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5fps.

2. Review - Instance Segmentation

http://blog.naver.com/sogangori/221012300995

Instance segmentation combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances.

Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance.

http://blog.naver.com/sogangori/221012300995

2. Review - Fast R-CNN & Faster R-CNN

~2k region proposals (independent algorithm)

convolutional feature extraction

warped region proposals

SVM classificationclassification

box regressionclass specific LSR

CNN

CNN

convolutional backbone

RoIPool layer

fixed size feature map

feature map

fully connected

layers

box regression

classification

RoIs from independent method

RPN

CNN


RoIPool layer


feature map

fully connected layers

box regression

classification

RPN

CNN

mask branch


RoIAlign layer


feature map

box regression

classification

fully connectedlayershead

R-CNN Fast R-CNN

Mask R-CNN Faster R-CNN

4. Contribution

1.To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.

2.Adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression.

4. Contribution

ㅁ

4. Contribution

7 x 7Max-pooling (Input is rounded off)

1) RoIAlign

We propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input.

RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics.

• Previous works - RoIPool

112 x 112Ex)[32/16] = 2, 32/16 = 2[33/16] = 2, 33/16 = 2.06[34/16] = 2, 34/16 = 2.12[35/16] = 2, 35/16 = 2.18[36/16] = 2, 36/16 = 2.25[37/16] = 2, 37/16 = 2.31[38/16] = 2, 38/16 = 2.37[39/16] = 2, 39/16 = 2.43[40/16] = 3, 40/16 = 2.5

4. Contribution1) RoIAlign

We use bilinear interpolation(Spatial transformer networks, Jaderberg, et al.) to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the

U V

result (using max or average).Localisation net

Sampler

Spatial Transformer

Grid generator

T✓(tt)✓

(a)

xsi

ysi

✓ ◆

✓ i= T (tt ) = A ✓@xt

i

yt i

1

0 1

A =

(b)

✓11 ✓12 ✓13

✓21 ✓22 ✓23

Σ@

xt i

yt i

1

0 1

A

ciV =

H WX X

n m

Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015.

Uc nm

simax(0, 1—|x —m s

i|) max(0,1 —|y —n|)

ciV =

H WX X

n m

Uc nm k(x s

i —m;Ø xsi)k(y —n;Ø y ) 8i 2 [1 .. .H0W 0] 8c 2 [1 .. .C]

bilinear sampling kernel

Copying the value at the nearest pixel to (xsi , yis ) to the output location (xti , yit ).

Target feature value at location i in channel c

Input feature valueat location (n,m) inchannel c

Sampling kernelKernel parameters

Sampling coordinate

4. Contribution1) RoIAlign

Source: https://deepsense.io/region-of-interest-pooling-explained/

Input activation Region projection and pooling sections

Max pooling output

Faster R-CNN RoIPool

4. Contribution1) RoIAlign Mask R-CNN

RoIAlign

Input activation Region projection and poolingsections Sampling locations

Bilinear interpolated valuesMax pooling outputSilvio Galesso, https://lmb.informatik.uni-freiburg.de/lectures/seminar_brox/seminar_ss17/maskrcnn_slides.pdf

4. Contribution

Instance segmentation

2) Network architectureFaster R-CNN + Instance segmentation

Faster R-CNN

ResNet or Res

NeXt

ave

mask

14×14×256

class

box2048RoI

7×7×1024 res5

7×7×2048

14×14×80

Faster R-CNN w/ ResNet [19]

FPN(FeaturePyramidNetwork)

RoI

RoI14×14×256

7×7 ×256

14×14×256

1024

28×28×256

1024

×4

class

box

mask

28×28×80

Faster R-CNNw/ FPN [27]

Backbone BackboneHead Head

RPN

CNN



RoIPool layer

feature map

fully connectedlayers

box regression

classification

Faster R-CNN

4. Contribution2) Network architecture

• Backbone - ResNetlayer name output size 18-layer 34-layer 50-layer 101-layer 152-layer

conv1 112⇥112 7⇥7, 64, stride2

conv2 x 56⇥56

3⇥3 max pool, stride 2

3⇥3, 64 Σ⇥2

3⇥3, 643⇥3, 64

Σ⇥3

3⇥3, 64

2

4 1⇥1, 64

3

53⇥3, 64 ⇥31⇥1, 256

2

4 1⇥1, 64

3

53⇥3, 64 ⇥31⇥1, 256

2

4 1⇥1, 64

3

53⇥3, 64 ⇥31⇥1, 256

conv3 x 28⇥28 3⇥3, 128Σ⇥2

3⇥3, 1283⇥3, 128

Σ⇥4

3⇥3, 128

2

4 1⇥1, 128

5

3

3⇥3, 128 ⇥41⇥1, 512

4

2 1⇥1, 128

5

3

3⇥3, 128 ⇥41⇥1, 512

4

2 1⇥1, 128

3

53⇥3, 128 ⇥81⇥1, 512

conv4 x 14⇥14 3⇥3, 256Σ⇥2

3⇥3, 2563⇥3, 256

Σ⇥6

3⇥3, 256

2

4 1⇥1, 256

5

3

3⇥3, 256 ⇥61⇥1, 1024

4

2 1⇥1, 256

5

3

3⇥3, 256 ⇥231⇥1, 1024

4

2 1⇥1, 256

3

53⇥3, 256 ⇥361⇥1, 1024

conv5 x 7⇥7 3⇥3, 512Σ⇥2

3⇥3, 5123⇥3, 512

Σ⇥3

3⇥3, 512

2

4 1⇥1, 512

5

3

3⇥3, 512 ⇥31⇥1, 2048

4

2 1⇥1, 512

5

3

3⇥3, 512 ⇥31⇥1, 2048

4

2 1⇥1, 512

3

53⇥3, 512 ⇥31⇥1, 2048

1⇥1 average pool, 1000-d fc, softmaxFLOPs 1.8⇥109 3.6⇥109 3.8⇥109 7.6⇥109 11.3⇥109

• Backbone - FPN

ResNet-50-C4 ResNet-101-C4

-FPN exploits the inherent hierarchy of CNNs to compute multi-scale features:

-Replace single scale feature map with FPN.

Source: Lin et al., Feature Pyramid Networks for Object Detection

4. Contribution2) Network architecture

L = Lcls + Lbox + Lmask

A. Fast R-CNN

A.

B.• K · (m ⇥ m) sigmoid outputs:

! pixel-wise binaryclassification! one mask for each class, no competition

• Lmask: mean binary cross-entropy

B.

Log loss

Smooth L1 loss

AP AP50 AP75softmax 24.8 44.1 25.1sigmoid 30.3 51.2 31.5

+5.5 +7.1 +6.4

(b) Multinomial vs. Independent Masks(ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax).

Our definition of Lmask allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [29] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results.

2) Network architecture

4. Contribution

5. Experiments• Main dataset: MS COCO

• 80 classes• 80k train image

• 35k sub- set of val images• 5k images for ablation experiments

• Metric

5. Experiments

horse1.00horse1.00 horse1.00

b

bus1.00

car.78

car.98

car.9c6ar.91car.94

car .98

car.9c9ar.99car.99

car.95truck.86 car .98

car1.00

car. t ruck.88 car .9398

us1.00 car.93 car.97

car.87

car.99

car.82

car.93

car.9c5ar.95car.97

person.99

car.78 traffic l ight.73

person1.00

person.99

person.95

person.93

person1.00

person.9 8 person.93

skateboard.82

suitcase1.00

suitcase.99

suitcase.96

suitcase1.00

suitcase.93

suitcase.98

suitcase.88

suitcase.72

stop sign.88

person1.00 person1.00

person1.00

person1.00person.99

person.99

bench.76skateboard.91

skat eb

oard.83

handbag .8 1

surfboard1.00

oard1.0surfb 0

surf

person1.00 person1.00 person1.00 person.98

surfboard

1. 00

person1.0p0erso n.91board1.00 sur fboard.98 person.74person1.00

person1.00person1.00

person1.00

person.99umbrella1.00

um brella.97um brella.97

umbrella.96umbrella.99umbrella1.00

backpack.96

umbrella.98

backpack.95

ppeersrsoonn.9.850backpack.98

b icyc le.93

upmerbsroenll.a9.989person.98ppeerrssoonn1..809p0erson1.00person1.00 handbag.97

handbag .8 5

motorcpyecrlseo.7n21.00

kite.93

kite1.00

pepresorsno.8n9.98

persopne1r.s0op0ne1r.s0o0n1.00

kite.9 7

person.80

handbag .80

kite.82

kite.98

perspoenrs.9o8n.99 ppeerrssoonnp..8ep7r2e2srosno1n.01.000person.95 persopne1r.s0o0n.99peprepsreosrnos.no8.n48.199

kkiittee..7829 kite.81

kite.99kite.84

person.94 personp.9e9rson.96person1.00

kite.73

pepresprosenor.ns9o.89np8.e7r7son.78person.87person.71person.94 person.72

kite.86

ki te.99ki te.89

ki te.88kite.98

kite.k8it8e.95

kiite..8945zebra.99

zebra1.00zebra.99 zebra1.00

zebra.96zeb ra.74

zebra.96

zebra.99zeb ra1.0 0 zebra.90

zebra.88zeb ra.76

dining table.78

chair.94

chair.98 chair.95

person.97

chair.92

person.87

diningcthaabilre.8.931chair.87 chair.97

chair.99 chair.99chair.89

wine glass.93

persopne.9rs8on.8 8

chair.96

person.77person.99 perspoenr.9so2np.9e5rson.94 person.8p8ersopn.e9r7son.97 person.99

wine glass.94wine glass.94wine glass.83

cup.91

chair.85 dining table.96

wine glass.91

person.94 perspoenprs.e8or8nso.9n6.96persopne.9rs9poenr.9so9n.86

dining table.75

cup.96

person.95person.72

chair.95 wine glcausps..803

chair.98

person.9p9epresrosno.np7.e78r6son.97pepresrosno.n8.198 person.8p2erson.89 person.83

cup.c9u8p.71wine glass.80 chair.85

dinincghatairb.c7leu8.p8.c17u9p.93person.91

cup.75 cup.71

person.99

person.99

person1.00

frisbee1.00

person.80

person.82 person1.00

elephant1.00elephant1.00elephant1.00

elephant .97

elephant.99

person1.00person1.

dining t able.95

person1.00person.88

wine glass1.0 0

bottle.97

wine glass1 .0 0

w i ne glass.99

tv.98 tv.8 4

00

person1.00

bench.97

person.98

handbag.73

potted plant.92

b ird .93

person.86persboanc1k.0p0acphkpea.er8nsr8sdoobnn.a71gp6.0.e90r1pseorns.o7n8.9p8erson.78

cell cplhoocnk.e7.377

person.99person1.00

handbag .88

handbag .99

person.987

handbag .88

traffic l ight.99

person1p.0ep0resrosno.n8.798

traffic l ight.87

traffic l ight.71

ppeerrssoonn..7939personp1.ep0re0srosno1n..09p05ersopne1r.s0po0enr.s9o8np1.e0r0spopener.sr9so9onpn.e9.r95s9o

person.95 pepresrosno.n8p10.e0r0son.95 person.99 n.95person.92 person.74

t ie.85

car.99

car.86

car.97

car.95 car.97 car1.00

traffic l ight1.00traffic l ight.99

person.99carc.9ar7.99

car.91

car1.c0a0r.98car.96

car.car.96

97

car.97

car.94car.94

car.95

car.81

bpiceyrcsloen.8.867 cacra.9r.795

parking meter.

car.9ca8r.89

donut1.00

donut.90

donut.88

donut.9 5

donut.96

donut.98 donut1.00

donut.99

donut.94

donut.99donut.98

donut1.00

donut.9 5

donut1d.o0nu0 t.81

donut.98donut.98donut.96

donut.89donut .97

donut.96donut.98

donut.93donut.99

donut.9 5

donut.90donut.89

donut.89

donut.8 9donut .89

donudto.8n6ut.95

donut.86

person1.00

person1.00

person1.00

person1.00


person1.00

baseball bat .99

baseball bat .85baseball bat.98dog1.00

truck.92

truck.99truck.96truck.99truck.97

bus.99tbrusc.k9.093


horse.77

cow .93

person1.00

person.99

person.97 person.98 horse.99

person.98

person.96 horse.97 person.96

person1.00

tennis racket1.00

chair.73

person.77

person.97person.9p0erson.

person.87

person .96

81

erspoep nr.9so9n.71person.94chair.80

persocnh.9a8ir.71

chair.93chaicrh.9a9icrh.9a9ir.97

chaicrh.8a3icrc.h9ha8airi.r8.911

rc.9h2air.81

chair.81

chcahira.ic9r.3h9a4i

sports ball .99

person1.00

person1.00couchp.8e2rson.99

person1.00person1p.0er0son.99

skateboard.99

person.90

person.98

person.91perspoenrsp1o.e0nr0s.9o9n.99

pepresrosno1n..0800

skateboard.98

Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).

5. Experiments

net-depth-features AP AP50 AP75ResNet-50-C4 30.3 51.2 31.5

ResNet-101-C4 32.7 54.2 34.3ResNet-50-FPN 33.6 55.2 35.3

ResNet-101-FPN 35.4 57.3 37.5ResNeXt-101-FPN 36.7 59.5 38.9

(a) Backbone Architecture: Better back- bones bring expected gains: deeper networksdo better, FPN outperforms C4 features, andResNeXt improves on ResNet.

AP AP50 AP75softmax 24.8 44.1 25.1sigmoid 30.3 51.2 31.5

+5.5 +7.1 +6.4

(b) Multinomial vs. Independent Masks(ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax).

align? bilinear? agg. AP AP50 AP75RoIPool [12] max 26.9 48.8 26.4

RoIWarp [10]XX

maxave

27.227.1

49.248.9

27.127.1

RoIAlignXX

XX

maxave

30.230.3

51.051.2

31.831.5

(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ⇠3 points andAP75 by ⇠5 points. Using proper alignment is the only fac-tor that contributes to the large gap between RoI layers.

AP AP50 AP75 APbb APbb50 APbb

75RoIPool 23.6 46.5 21.6 28.2 52.7 26.9RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4

+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5

(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-levelAP using large-stride features. Misalignments are more severe thanwith stride-16 features (Table 2c), resulting in massive accuracy gaps.

mask branch AP AP50 AP75MLPMLP

fc: 1024 !1024 !80·282

fc: 1 0 2 4 !102 4 !102 4 !80 ·28231.531.5

53.754.0

32.832.6

FCN conv: 2 5 6 ! 2 5 6 ! 2 5 6 ! 2 5 6 ! 2 5 6 ! 8 0 33.6 55.2 35.3

(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs.multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs im-prove results as they take advantage of explicitly encoding spatial layout.

Table 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP unless otherwise noted.

5. Experiments

backbone AP AP50 AP75 APS APM APL

MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - -Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5

Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includesmulti-scale train/test, horizontal flip test, and OHEM [35]. All entries are single-model results.


person1.00

person1.00

umbrella1.00umbrella.99

car.99 car.93giraffe1.00 giraffe1.00



person.95

sports ball1.00

sportsba ll.98

person1.00

person1.00

tie.95tie1.00

FCIS

Mas

kR

-CN

N

train1.00train.99

train.80

person1.00person1.00 person1.00person1.00

person1.00

skateboa rd.9

person.99 persopne.9rs9on1.00

skateboard.99

handbag .93 8

Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.

• Instance segmentation

5. Experimentsbackbone APbb APbb

50 APbb75 APbb

S APbbM APbb

LFaster R-CNN+++ [19] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9Faster R-CNN w FPN [27] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2Faster R-CNN by G-RMI [21] Inception-ResNet-v2 [37] 34.7 55.5 36.7 13.5 38.1 52.0Faster R-CNN w TDM [36] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1Faster R-CNN, RoIAlign ResNet-101-FPN 37.3 59.6 40.3 19.8 40.2 48.8Mask R-CNN ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2Mask R-CNN ResNeXt-101-FPN 39.8 62.3 43.4 22.1 43.2 51.2

Table 3. Object detection single-model results (bounding box AP), vs. state-of-the-art on test-dev. Mask R-CNN using ResNet-101-FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these experiments). The gains ofMask R-CNN over [27] come from using RoIAlign (+1.1 APbb), multitask training (+0.9 APbb), and ResNeXt-101 (+1.6 APbb).

• Object detection

Reference paper

• R. Girshick. Fast R-CNN. In ICCV,2015.• S. Ren, K. He,R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object

detection with region proposal networks. In NIPS,2015.• M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial

transformer networks. In NIPS,2015.• K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image

recognition. In CVPR,2016.• S. Xie, R. Girshick, P. Dolla ŕ, Z. Tu, and K. He.Aggregated residual

transformations for deep neural networks. In CVPR,2017.• T.-Y.Lin, P.Dolla ŕ, R.Girshick, K.He, B.Hariharan, and S. Belongie. Feature

pyramid networks for object detection.In CVPR,2017.• J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for

semantic segmentation. In CVPR,2015.• Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware

semantic segmentation. In CVPR,2017.• R.Girshick, F.Iandola, T.Darrell, and J.Malik. Deformable part models are

convolutional neural networks. In CVPR,2015.

mask r-cnn iccv 2017(oral)imlab.postech.ac.kr/dkim/class/csed703g_2019f/maskrcn.pdf ·...

Documents