rgbd image semantic labelling for urban driving scenes via...

Jason Bolito, Research School of Computer Science, ANU

RGBd Image Semantic Labelling for Urban Driving Scenes via a DCNN

Supervisors: Yiran Zhong & Hongdong Li

Outline

1. Motivation and Background

2. Proposed Method

3. Implementation, Experiment and Results

4. Conclusion and Future Work

2

Motivation – Semantic Segmentation

• Understanding road scenes.• Useful for autonomous cars and drones.

3

Source: cityscape datasets.

Semantic Segmentation vs. Object RecognitionObject Recognition Semantic Segmentation

4

“Person”

“Road” “Person” “Vegetation” “Motorcycle”Source: cityscapes-datasets.com

What we want from our method

• Leverage both 3D and colourinformation.

• Attain more accurate and robust semantic segmentation.

5

Background – RGB Semantic Labelling

• Earlier days: CRFs (low level vision cues).

• Recently: Deep Neural Nets.

6

Background – Fully Convolutional Networks

7

• Pixels to pixels approach.• Builds on VGG16. (encoder)• Upsampling using deconvolution to get label map. (decoder)

Source: FCNs for semantic segmentation by J. Long et al.

Background – Deconvolution Networks

8

• Expands VGG16. (encoder)• Uses unpooling + deconv to get label map. (decoder)

Source: Learning Deconvolution Network for Semantic Segmentation by H. Noh et al.

Background – SegNet

9

• Similar encoder-decoder structure.• Removes fully connected layers.• Prioritises memory efficiency.

Source: SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation by V. Badrinarayanan et al.

Background – RGBd Semantic Labelling

• HHA representation (Saurabh et all, 2014).

• Hard mutex constraints (Deng et al 2015).

• LSTM-F (Li et al, 2016).

• Fusenet (Hazirbas et al, 2016).

10

Background (cont’d)

• Presented methods use depth as a channel.

• Depth used as generic information.

• 3D structure not considered/learned.

11

Proposed Method – Ideas

12

• Use depth to partially reconstruct 3D scene.

• Use 3D convolution to capture structure.

• Apply encoder-decoder design to achieve rich segmentation maps.

Proposed Method – S3D

Deconv3D + ReLUConv3D + ReLU

SoftmaxConv3D (2x stride) + ReLU

Encoder

Decoder

32 64 64 64 128Featuremaps

13

S3D building blocks – Input Layer

• Input RGB image I is voxelised via disparity map D:

• 2.5D reconstruction of environment.

• Points at infinity have disparity 0.

14

I3D(z , x , y , c) :=

⇢I(x , y , c), for z = bD(x , y)c0, otherwise

,

S3D building blocks – Encoder

• Feature extraction via 3D convolution:

• Each 3x3x3 filter is a learnable template.

• High response = input matches template.

• 3D structure = 3D input + 3D templates15

Fout

(z , x , y , cout

) =X

k,i ,j ,cin

Fin

(z + k , x + i , y + j , cin

)Kcout

(k , i , j , cin

)

S3D building blocks – Encoder (cont’d)

• Non-linear activation function:

• Good gradients for backprop.

• Learnable downsampling = strided 3D convolution.

16

ReLU(x) = max(0, x)

S3D building blocks – Decoder

• 3D deconvolution = “inverse” of 3D convolution.

• Already implemented as backwards Conv3D pass.

• Learnable upsampling = strided 3D deconvolution.

17

Fout

(z + k , x + i , y + j , cout

) +=X

cin

Fin

(z , x , y , cin

)Kcout

(k , i , j , cin

)

S3D building blocks – Decoder (cont’d)

• Skip layers (top down modulation)

18

Conv3D ...

...Deconv3D...

Shallow=

Low level features

Deep=

High level knowledge

Helps with convergenceand refines features

S3D building blocks – Inference

• Use softmax to get probability cube:

• Argmax over classes to get 3D labels:

• Project using D to get 2D labels:

19

ˆP(z , x , y , c) :=exp(F(z , x , y , c))P

c02Classes exp(F(z , x , y , c0))

ˆL3D(z , x , y) := argmax

c2Classes

ˆP(z , x , y , c).

L̂(x , y) := L̂3D(bD(x , y)c , x , y)

Implementation

20

• Implemented using a deep-learning facadeAPI and TensorFlow.

Experiment and Results

• Dataset: Cityscapes (urban scene dataset)

• Splits: 2795 training / 500 test images over 50 cities.

• GPU: Nvidia GeForce Titan X Pascal.

21

Experiment and Results (cont’d)

• Image size: 128x64x128• Iterations: Around 30k• Results: (State of Art has mIoU = 80.1%)

• Learning feature extraction takes a while.• Can we de better?

22

G mIoU C Gtest mIoUtest Ctest

0.908 0.533 0.399 0.833 0.444 0.295

Experiment and Results (cont’d)

• Trick: Let pre-trained 2D DCNN do feature extraction.

• Use S3D on extracted features.

• Not SoA but matches DeepLab (71.4%)!• Depth accuracy/efficiency trade-off!

23

Method G mIoU C Gtest mIoUtest Ctest time/it (s)

S3D-ResNet-38-128 0.958 0.717 0.606 0.949 0.691 0.56 1.35

S3D-ResNet-38-48 0.96 0.748 0.622 0.942 0.715 0.57 0.513

S3D-ResNet-38-16 0.954 0.722 0.597 0.943 0.69 0.548 0.169

Conclusion and Future Work• Presented a DNN solution for semantic segmentation.

• Solution fully utilises 3D structure.

• Achieves good results especially when used on pre-extracted features.

• Good results achieved without any extra goodies! (CRFs, data augmentation, …)

• There is plenty of room for improvement!

24

Conclusion and Future Work (cont’d)

• Need to push S3D to the limit.

• Can be done with post-processing, balancing, upsampling, …

• What happens when we generalise one of the other architectures to 3D?

25

Questions?

26

Thank You!

rgbd image semantic labelling for urban driving scenes via...

Documents