mask r-cnn iccv 2017(oral)imlab.postech.ac.kr/dkim/class/csed703g_2019f/maskrcn.pdf ·...
TRANSCRIPT
Mask R-CNNICCV 2017(Oral)
Kaiming He Ross GirshickGeorgia Gkioxari Piotr Dollár Facebook AI Research (FAIR)
1. Abstract
Our approach efficiently detects objects in an image while simultaneously generating a high-quality segmentation mask for each instance.
Mask R-CNN extends Faster R-CNN by adding a branch for predicting an object mask in parallel with the existing branch for bounding box recognition.
Mask R-CNN is simple to train and adds only a small overhead to Faster R-CNN, running at 5fps.
2. Review - Instance Segmentation
http://blog.naver.com/sogangori/221012300995
Instance segmentation combines elements from the classical computer vision tasks of object detection, where the goal is to classify individual objects and localize each using a bounding box, and semantic segmentation, where the goal is to classify each pixel into a fixed set of categories without differentiating object instances.
Instance segmentation is challenging because it requires the correct detection of all objects in an image while also precisely segmenting each instance.
2. Review - Fast R-CNN & Faster R-CNN
~2k region proposals (independent algorithm)
convolutional feature extraction
warped region proposals
SVM classificationclassification
box regressionclass specific LSR
CNN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected
layers
box regression
classification
RoIs from independent method
RPN
CNN
convolutional backbone
RoIPool layer
fixed size feature map
feature map
fully connected layers
box regression
classification
RPN
CNN
mask branch
convolutional backbone
RoIAlign layer
fixed size feature map
feature map
box regression
classification
fully connectedlayershead
R-CNN Fast R-CNN
Mask R-CNN Faster R-CNN
4. Contribution
1.To fix the misalignment, we propose a simple, quantization-free layer, called RoIAlign, that faithfully preserves exact spatial locations.
2.Adding a branch for predicting segmentation masks on each Region of Interest (RoI), in parallel with the existing branch for classification and bounding box regression.
4. Contribution
ㅁ
4. Contribution
7 x 7Max-pooling (Input is rounded off)
1) RoIAlign
We propose an RoIAlign layer that removes the harsh quantization of RoIPool, properly aligning the extracted features with the input.
RoIAlign improves mask accuracy by relative 10% to 50%, showing bigger gains under stricter localization metrics.
• Previous works - RoIPool
112 x 112Ex)[32/16] = 2, 32/16 = 2[33/16] = 2, 33/16 = 2.06[34/16] = 2, 34/16 = 2.12[35/16] = 2, 35/16 = 2.18[36/16] = 2, 36/16 = 2.25[37/16] = 2, 37/16 = 2.31[38/16] = 2, 38/16 = 2.37[39/16] = 2, 39/16 = 2.43[40/16] = 3, 40/16 = 2.5
4. Contribution1) RoIAlign
We use bilinear interpolation(Spatial transformer networks, Jaderberg, et al.) to compute the exact values of the input features at four regularly sampled locations in each RoI bin, and aggregate the
U V
result (using max or average).Localisation net
Sampler
Spatial Transformer
Grid generator
T✓(tt)✓
(a)
xsi
ysi
✓ ◆
✓ i= T (tt ) = A ✓@xt
i
yt i
1
0 1
A =
(b)
✓11 ✓12 ✓13
✓21 ✓22 ✓23
Σ@
xt i
yt i
1
0 1
A
ciV =
H WX X
n m
Jaderberg, Max, Karen Simonyan, and Andrew Zisserman. "Spatial transformer networks." Advances in Neural Information Processing Systems. 2015.
Uc nm
simax(0, 1—|x —m s
i|) max(0,1 —|y —n|)
ciV =
H WX X
n m
Uc nm k(x s
i —m;Ø xsi)k(y —n;Ø y ) 8i 2 [1 .. .H0W 0] 8c 2 [1 .. .C]
bilinear sampling kernel
Copying the value at the nearest pixel to (xsi , yis ) to the output location (xti , yit ).
Target feature value at location i in channel c
Input feature valueat location (n,m) inchannel c
Sampling kernelKernel parameters
Sampling coordinate
4. Contribution1) RoIAlign
Source: https://deepsense.io/region-of-interest-pooling-explained/
Input activation Region projection and pooling sections
Max pooling output
Faster R-CNN RoIPool
4. Contribution1) RoIAlign Mask R-CNN
RoIAlign
Input activation Region projection and poolingsections Sampling locations
Bilinear interpolated valuesMax pooling outputSilvio Galesso, https://lmb.informatik.uni-freiburg.de/lectures/seminar_brox/seminar_ss17/maskrcnn_slides.pdf
4. Contribution
Instance segmentation
2) Network architectureFaster R-CNN + Instance segmentation
Faster R-CNN
ResNet or Res
NeXt
ave
mask
14×14×256
class
box2048RoI
7×7×1024 res5
7×7×2048
14×14×80
Faster R-CNN w/ ResNet [19]
FPN(FeaturePyramidNetwork)
RoI
RoI14×14×256
7×7 ×256
14×14×256
1024
28×28×256
1024
×4
class
box
mask
28×28×80
Faster R-CNNw/ FPN [27]
Backbone BackboneHead Head
RPN
CNN
convolutional backbone
fixed size feature map
RoIPool layer
feature map
fully connectedlayers
box regression
classification
Faster R-CNN
4. Contribution2) Network architecture
• Backbone - ResNetlayer name output size 18-layer 34-layer 50-layer 101-layer 152-layer
conv1 112⇥112 7⇥7, 64, stride2
conv2 x 56⇥56
3⇥3 max pool, stride 2
3⇥3, 64 Σ⇥2
3⇥3, 643⇥3, 64
Σ⇥3
3⇥3, 64
2
4 1⇥1, 64
3
53⇥3, 64 ⇥31⇥1, 256
2
4 1⇥1, 64
3
53⇥3, 64 ⇥31⇥1, 256
2
4 1⇥1, 64
3
53⇥3, 64 ⇥31⇥1, 256
conv3 x 28⇥28 3⇥3, 128Σ⇥2
3⇥3, 1283⇥3, 128
Σ⇥4
3⇥3, 128
2
4 1⇥1, 128
5
3
3⇥3, 128 ⇥41⇥1, 512
4
2 1⇥1, 128
5
3
3⇥3, 128 ⇥41⇥1, 512
4
2 1⇥1, 128
3
53⇥3, 128 ⇥81⇥1, 512
conv4 x 14⇥14 3⇥3, 256Σ⇥2
3⇥3, 2563⇥3, 256
Σ⇥6
3⇥3, 256
2
4 1⇥1, 256
5
3
3⇥3, 256 ⇥61⇥1, 1024
4
2 1⇥1, 256
5
3
3⇥3, 256 ⇥231⇥1, 1024
4
2 1⇥1, 256
3
53⇥3, 256 ⇥361⇥1, 1024
conv5 x 7⇥7 3⇥3, 512Σ⇥2
3⇥3, 5123⇥3, 512
Σ⇥3
3⇥3, 512
2
4 1⇥1, 512
5
3
3⇥3, 512 ⇥31⇥1, 2048
4
2 1⇥1, 512
5
3
3⇥3, 512 ⇥31⇥1, 2048
4
2 1⇥1, 512
3
53⇥3, 512 ⇥31⇥1, 2048
1⇥1 average pool, 1000-d fc, softmaxFLOPs 1.8⇥109 3.6⇥109 3.8⇥109 7.6⇥109 11.3⇥109
• Backbone - FPN
ResNet-50-C4 ResNet-101-C4
-FPN exploits the inherent hierarchy of CNNs to compute multi-scale features:
-Replace single scale feature map with FPN.
Source: Lin et al., Feature Pyramid Networks for Object Detection
4. Contribution2) Network architecture
L = Lcls + Lbox + Lmask
A. Fast R-CNN
A.
B.• K · (m ⇥ m) sigmoid outputs:
! pixel-wise binaryclassification! one mask for each class, no competition
• Lmask: mean binary cross-entropy
B.
Log loss
Smooth L1 loss
AP AP50 AP75softmax 24.8 44.1 25.1sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks(ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax).
Our definition of Lmask allows the network to generate masks for every class without competition among classes; we rely on the dedicated classification branch to predict the class label used to select the output mask. This decouples mask and class prediction. This is different from common practice when applying FCNs [29] to semantic segmentation, which typically uses a per-pixel softmax and a multinomial cross-entropy loss. In that case, masks across classes compete; in our case, with a per-pixel sigmoid and a binary loss, they do not. We show by experiments that this formulation is key for good instance segmentation results.
2) Network architecture
4. Contribution
5. Experiments• Main dataset: MS COCO
• 80 classes• 80k train image
• 35k sub- set of val images• 5k images for ablation experiments
• Metric
5. Experiments
horse1.00horse1.00 horse1.00
b
bus1.00
car.78
car.98
car.9c6ar.91car.94
car .98
car.9c9ar.99car.99
car.95truck.86 car .98
car1.00
car. t ruck.88 car .9398
us1.00 car.93 car.97
car.87
car.99
car.82
car.93
car.9c5ar.95car.97
person.99
car.78 traffic l ight.73
person1.00
person.99
person.95
person.93
person1.00
person.9 8 person.93
skateboard.82
suitcase1.00
suitcase.99
suitcase.96
suitcase1.00
suitcase.93
suitcase.98
suitcase.88
suitcase.72
stop sign.88
person1.00 person1.00
person1.00
person1.00person.99
person.99
bench.76skateboard.91
skat eb
oard.83
handbag .8 1
surfboard1.00
oard1.0surfb 0
surf
person1.00 person1.00 person1.00 person.98
surfboard
1. 00
person1.0p0erso n.91board1.00 sur fboard.98 person.74person1.00
person1.00person1.00
person1.00
person.99umbrella1.00
um brella.97um brella.97
umbrella.96umbrella.99umbrella1.00
backpack.96
umbrella.98
backpack.95
ppeersrsoonn.9.850backpack.98
b icyc le.93
upmerbsroenll.a9.989person.98ppeerrssoonn1..809p0erson1.00person1.00 handbag.97
handbag .8 5
motorcpyecrlseo.7n21.00
kite.93
kite1.00
pepresorsno.8n9.98
persopne1r.s0op0ne1r.s0o0n1.00
kite.9 7
person.80
handbag .80
kite.82
kite.98
perspoenrs.9o8n.99 ppeerrssoonnp..8ep7r2e2srosno1n.01.000person.95 persopne1r.s0o0n.99peprepsreosrnos.no8.n48.199
kkiittee..7829 kite.81
kite.99kite.84
person.94 personp.9e9rson.96person1.00
kite.73
pepresprosenor.ns9o.89np8.e7r7son.78person.87person.71person.94 person.72
kite.86
ki te.99ki te.89
ki te.88kite.98
kite.k8it8e.95
kiite..8945zebra.99
zebra1.00zebra.99 zebra1.00
zebra.96zeb ra.74
zebra.96
zebra.99zeb ra1.0 0 zebra.90
zebra.88zeb ra.76
dining table.78
chair.94
chair.98 chair.95
person.97
chair.92
person.87
diningcthaabilre.8.931chair.87 chair.97
chair.99 chair.99chair.89
wine glass.93
persopne.9rs8on.8 8
chair.96
person.77person.99 perspoenr.9so2np.9e5rson.94 person.8p8ersopn.e9r7son.97 person.99
wine glass.94wine glass.94wine glass.83
cup.91
chair.85 dining table.96
wine glass.91
person.94 perspoenprs.e8or8nso.9n6.96persopne.9rs9poenr.9so9n.86
dining table.75
cup.96
person.95person.72
chair.95 wine glcausps..803
chair.98
person.9p9epresrosno.np7.e78r6son.97pepresrosno.n8.198 person.8p2erson.89 person.83
cup.c9u8p.71wine glass.80 chair.85
dinincghatairb.c7leu8.p8.c17u9p.93person.91
cup.75 cup.71
person.99
person.99
person1.00
frisbee1.00
person.80
person.82 person1.00
elephant1.00elephant1.00elephant1.00
elephant .97
elephant.99
person1.00person1.
dining t able.95
person1.00person.88
wine glass1.0 0
bottle.97
wine glass1 .0 0
w i ne glass.99
tv.98 tv.8 4
00
person1.00
bench.97
person.98
handbag.73
potted plant.92
b ird .93
person.86persboanc1k.0p0acphkpea.er8nsr8sdoobnn.a71gp6.0.e90r1pseorns.o7n8.9p8erson.78
cell cplhoocnk.e7.377
person.99person1.00
handbag .88
handbag .99
person.987
handbag .88
traffic l ight.99
person1p.0ep0resrosno.n8.798
traffic l ight.87
traffic l ight.71
ppeerrssoonn..7939personp1.ep0re0srosno1n..09p05ersopne1r.s0po0enr.s9o8np1.e0r0spopener.sr9so9onpn.e9.r95s9o
person.95 pepresrosno.n8p10.e0r0son.95 person.99 n.95person.92 person.74
t ie.85
car.99
car.86
car.97
car.95 car.97 car1.00
traffic l ight1.00traffic l ight.99
person.99carc.9ar7.99
car.91
car1.c0a0r.98car.96
car.car.96
97
car.97
car.94car.94
car.95
car.81
bpiceyrcsloen.8.867 cacra.9r.795
parking meter.
car.9ca8r.89
donut1.00
donut.90
donut.88
donut.9 5
donut.96
donut.98 donut1.00
donut.99
donut.94
donut.99donut.98
donut1.00
donut.9 5
donut1d.o0nu0 t.81
donut.98donut.98donut.96
donut.89donut .97
donut.96donut.98
donut.93donut.99
donut.9 5
donut.90donut.89
donut.89
donut.8 9donut .89
donudto.8n6ut.95
donut.86
person1.00
person1.00
person1.00
person1.00
person1.00 person1.00
person1.00
baseball bat .99
baseball bat .85baseball bat.98dog1.00
truck.92
truck.99truck.96truck.99truck.97
bus.99tbrusc.k9.093
person1.00person1.00
horse.77
cow .93
person1.00
person.99
person.97 person.98 horse.99
person.98
person.96 horse.97 person.96
person1.00
tennis racket1.00
chair.73
person.77
person.97person.9p0erson.
person.87
person .96
81
erspoep nr.9so9n.71person.94chair.80
persocnh.9a8ir.71
chair.93chaicrh.9a9icrh.9a9ir.97
chaicrh.8a3icrc.h9ha8airi.r8.911
rc.9h2air.81
chair.81
chcahira.ic9r.3h9a4i
sports ball .99
person1.00
person1.00couchp.8e2rson.99
person1.00person1p.0er0son.99
skateboard.99
person.90
person.98
person.91perspoenrsp1o.e0nr0s.9o9n.99
pepresrosno1n..0800
skateboard.98
Figure 4. More results of Mask R-CNN on COCO test images, using ResNet-101-FPN and running at 5 fps, with 35.7 mask AP (Table 1).
5. Experiments
net-depth-features AP AP50 AP75ResNet-50-C4 30.3 51.2 31.5
ResNet-101-C4 32.7 54.2 34.3ResNet-50-FPN 33.6 55.2 35.3
ResNet-101-FPN 35.4 57.3 37.5ResNeXt-101-FPN 36.7 59.5 38.9
(a) Backbone Architecture: Better back- bones bring expected gains: deeper networksdo better, FPN outperforms C4 features, andResNeXt improves on ResNet.
AP AP50 AP75softmax 24.8 44.1 25.1sigmoid 30.3 51.2 31.5
+5.5 +7.1 +6.4
(b) Multinomial vs. Independent Masks(ResNet-50-C4): Decoupling via per- class binary masks (sigmoid) gives large gains over multinomial masks (softmax).
align? bilinear? agg. AP AP50 AP75RoIPool [12] max 26.9 48.8 26.4
RoIWarp [10]XX
maxave
27.227.1
49.248.9
27.127.1
RoIAlignXX
XX
maxave
30.230.3
51.051.2
31.831.5
(c) RoIAlign (ResNet-50-C4): Mask results with various RoI layers. Our RoIAlign layer improves AP by ⇠3 points andAP75 by ⇠5 points. Using proper alignment is the only fac-tor that contributes to the large gap between RoI layers.
AP AP50 AP75 APbb APbb50 APbb
75RoIPool 23.6 46.5 21.6 28.2 52.7 26.9RoIAlign 30.9 51.8 32.1 34.0 55.3 36.4
+7.3 + 5.3 +10.5 +5.8 +2.6 +9.5
(d) RoIAlign (ResNet-50-C5, stride 32): Mask-level and box-levelAP using large-stride features. Misalignments are more severe thanwith stride-16 features (Table 2c), resulting in massive accuracy gaps.
mask branch AP AP50 AP75MLPMLP
fc: 1024 !1024 !80·282
fc: 1 0 2 4 !102 4 !102 4 !80 ·28231.531.5
53.754.0
32.832.6
FCN conv: 2 5 6 ! 2 5 6 ! 2 5 6 ! 2 5 6 ! 2 5 6 ! 8 0 33.6 55.2 35.3
(e) Mask Branch (ResNet-50-FPN): Fully convolutional networks (FCN) vs.multi-layer perceptrons (MLP, fully-connected) for mask prediction. FCNs im-prove results as they take advantage of explicitly encoding spatial layout.
Table 2. Ablations for Mask R-CNN. We train on trainval35k, test on minival, and report mask AP unless otherwise noted.
5. Experiments
backbone AP AP50 AP75 APS APM APL
MNC [10] ResNet-101-C4 24.6 44.3 24.8 4.7 25.9 43.6FCIS [26] +OHEM ResNet-101-C5-dilated 29.2 49.5 - 7.1 31.3 50.0FCIS+++ [26] +OHEM ResNet-101-C5-dilated 33.6 54.5 - - - -Mask R-CNN ResNet-101-C4 33.1 54.9 34.8 12.1 35.6 51.1Mask R-CNN ResNet-101-FPN 35.7 58.0 37.8 15.5 38.1 52.4Mask R-CNN ResNeXt-101-FPN 37.1 60.0 39.4 16.9 39.9 53.5
Table 1. Instance segmentation mask AP on COCO test-dev. MNC [10] and FCIS [26] are the winners of the COCO 2015 and 2016segmentation challenges, respectively. Without bells and whistles, Mask R-CNN outperforms the more complex FCIS+++, which includesmulti-scale train/test, horizontal flip test, and OHEM [35]. All entries are single-model results.
person1.00person1.00
person1.00
person1.00
umbrella1.00umbrella.99
car.99 car.93giraffe1.00 giraffe1.00
person1.00person1.00
person1.00 person1.00
person.95
sports ball1.00
sportsba ll.98
person1.00
person1.00
tie.95tie1.00
FCIS
Mas
kR
-CN
N
train1.00train.99
train.80
person1.00person1.00 person1.00person1.00
person1.00
skateboa rd.9
person.99 persopne.9rs9on1.00
skateboard.99
handbag .93 8
Figure 5. FCIS+++ [26] (top) vs. Mask R-CNN (bottom, ResNet-101-FPN). FCIS exhibits systematic artifacts on overlapping objects.
• Instance segmentation
5. Experimentsbackbone APbb APbb
50 APbb75 APbb
S APbbM APbb
LFaster R-CNN+++ [19] ResNet-101-C4 34.9 55.7 37.4 15.6 38.7 50.9Faster R-CNN w FPN [27] ResNet-101-FPN 36.2 59.1 39.0 18.2 39.0 48.2Faster R-CNN by G-RMI [21] Inception-ResNet-v2 [37] 34.7 55.5 36.7 13.5 38.1 52.0Faster R-CNN w TDM [36] Inception-ResNet-v2-TDM 36.8 57.7 39.2 16.2 39.8 52.1Faster R-CNN, RoIAlign ResNet-101-FPN 37.3 59.6 40.3 19.8 40.2 48.8Mask R-CNN ResNet-101-FPN 38.2 60.3 41.7 20.1 41.1 50.2Mask R-CNN ResNeXt-101-FPN 39.8 62.3 43.4 22.1 43.2 51.2
Table 3. Object detection single-model results (bounding box AP), vs. state-of-the-art on test-dev. Mask R-CNN using ResNet-101-FPN outperforms the base variants of all previous state-of-the-art models (the mask output is ignored in these experiments). The gains ofMask R-CNN over [27] come from using RoIAlign (+1.1 APbb), multitask training (+0.9 APbb), and ResNeXt-101 (+1.6 APbb).
• Object detection
Reference paper
• R. Girshick. Fast R-CNN. In ICCV,2015.• S. Ren, K. He,R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object
detection with region proposal networks. In NIPS,2015.• M. Jaderberg, K. Simonyan, A. Zisserman, andK. Kavukcuoglu. Spatial
transformer networks. In NIPS,2015.• K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image
recognition. In CVPR,2016.• S. Xie, R. Girshick, P. Dolla ŕ, Z. Tu, and K. He.Aggregated residual
transformations for deep neural networks. In CVPR,2017.• T.-Y.Lin, P.Dolla ŕ, R.Girshick, K.He, B.Hariharan, and S. Belongie. Feature
pyramid networks for object detection.In CVPR,2017.• J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for
semantic segmentation. In CVPR,2015.• Y. Li, H. Qi, J. Dai, X. Ji, and Y. Wei. Fully convolutional instance-aware
semantic segmentation. In CVPR,2017.• R.Girshick, F.Iandola, T.Darrell, and J.Malik. Deformable part models are
convolutional neural networks. In CVPR,2015.