rgbd image semantic labelling for urban driving scenes via...
TRANSCRIPT
Jason Bolito, Research School of Computer Science, ANU
RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN
Supervisors: Yiran Zhong & Hongdong Li
Outline
1. Motivation and Background
2. Proposed Method
3. Implementation, Experiment and Results
4. Conclusion and Future Work
2
Motivation – Semantic Segmentation
• Understanding road scenes.• Useful for autonomous cars and drones.
3
Source: cityscape datasets.
Semantic Segmentation vs. Object RecognitionObject Recognition Semantic Segmentation
4
“Person”
“Road” “Person” “Vegetation” “Motorcycle”Source: cityscapes-datasets.com
What we want from our method
• Leverage both 3D and colourinformation.
• Attain more accurate and robust semantic segmentation.
5
Background – RGB Semantic Labelling
• Earlier days: CRFs (low level vision cues).
• Recently: Deep Neural Nets.
6
Background – Fully Convolutional Networks
7
• Pixels to pixels approach.• Builds on VGG16. (encoder)• Upsampling using deconvolution to get label map. (decoder)
Source: FCNs for semantic segmentation by J. Long et al.
Background – Deconvolution Networks
8
• Expands VGG16. (encoder)• Uses unpooling + deconv to get label map. (decoder)
Source: Learning Deconvolution Network for Semantic Segmentation by H. Noh et al.
Background – SegNet
9
• Similar encoder-decoder structure.• Removes fully connected layers.• Prioritises memory efficiency.
Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation by V. Badrinarayanan et al.
Background – RGBd Semantic Labelling
• HHA representation (Saurabh et all, 2014).
• Hard mutex constraints (Deng et al 2015).
• LSTM-F (Li et al, 2016).
• Fusenet (Hazirbas et al, 2016).
10
Background (cont’d)
• Presented methods use depth as a channel.
• Depth used as generic information.
• 3D structure not considered/learned.
11
Proposed Method – Ideas
12
• Use depth to partially reconstruct 3D scene.
• Use 3D convolution to capture structure.
• Apply encoder-decoder design to achieve rich segmentation maps.
Proposed Method – S3D
Deconv3D + ReLUConv3D + ReLU
SoftmaxConv3D (2x stride) + ReLU
Encoder
Decoder
32 64 64 64 128Featuremaps
13
S3D building blocks – Input Layer
• Input RGB image I is voxelised via disparity map D:
• 2.5D reconstruction of environment.
• Points at infinity have disparity 0.
14
I3D(z , x , y , c) :=
⇢I(x , y , c), for z = bD(x , y)c0, otherwise
,
S3D building blocks – Encoder
• Feature extraction via 3D convolution:
• Each 3x3x3 filter is a learnable template.
• High response = input matches template.
• 3D structure = 3D input + 3D templates15
Fout
(z , x , y , cout
) =X
k,i ,j ,cin
Fin
(z + k , x + i , y + j , cin
)Kcout
(k , i , j , cin
)
S3D building blocks – Encoder (cont’d)
• Non-linear activation function:
• Good gradients for backprop.
• Learnable downsampling = strided 3D convolution.
16
ReLU(x) = max(0, x)
S3D building blocks – Decoder
• 3D deconvolution = “inverse” of 3D convolution.
• Already implemented as backwards Conv3D pass.
• Learnable upsampling = strided 3D deconvolution.
17
Fout
(z + k , x + i , y + j , cout
) +=X
cin
Fin
(z , x , y , cin
)Kcout
(k , i , j , cin
)
S3D building blocks – Decoder (cont’d)
• Skip layers (top down modulation)
18
Conv3D ...
...Deconv3D...
Shallow=
Low level features
Deep=
High level knowledge
Helps with convergenceand refines features
S3D building blocks – Inference
• Use softmax to get probability cube:
• Argmax over classes to get 3D labels:
• Project using D to get 2D labels:
19
ˆP(z , x , y , c) :=exp(F(z , x , y , c))P
c02Classes exp(F(z , x , y , c0))
ˆL3D(z , x , y) := argmax
c2Classes
ˆP(z , x , y , c).
L̂(x , y) := L̂3D(bD(x , y)c , x , y)
Implementation
20
• Implemented using a deep-learning facadeAPI and TensorFlow.
Experiment and Results
• Dataset: Cityscapes (urban scene dataset)
• Splits: 2795 training / 500 test images over 50 cities.
• GPU: Nvidia GeForce Titan X Pascal.
21
Experiment and Results (cont’d)
• Image size: 128x64x128• Iterations: Around 30k• Results: (State of Art has mIoU = 80.1%)
• Learning feature extraction takes a while.• Can we de better?
22
G mIoU C Gtest mIoUtest Ctest
0.908 0.533 0.399 0.833 0.444 0.295
Experiment and Results (cont’d)
• Trick: Let pre-trained 2D DCNN do feature extraction.
• Use S3D on extracted features.
• Not SoA but matches DeepLab (71.4%)!• Depth accuracy/efficiency trade-off!
23
Method G mIoU C Gtest mIoUtest Ctest time/it (s)
S3D-ResNet-38-128 0.958 0.717 0.606 0.949 0.691 0.56 1.35
S3D-ResNet-38-48 0.96 0.748 0.622 0.942 0.715 0.57 0.513
S3D-ResNet-38-16 0.954 0.722 0.597 0.943 0.69 0.548 0.169
Conclusion and Future Work• Presented a DNN solution for semantic segmentation.
• Solution fully utilises 3D structure.
• Achieves good results especially when used on pre-extracted features.
• Good results achieved without any extra goodies! (CRFs, data augmentation, …)
• There is plenty of room for improvement!
24
Conclusion and Future Work (cont’d)
• Need to push S3D to the limit.
• Can be done with post-processing, balancing, upsampling, …
• What happens when we generalise one of the other architectures to 3D?
25
Questions?
26
Thank You!