building and road detection from large aerial imagery

Building and road detection from large aerial imagery

Shunta SAITO*, Yoshimitsu AOKI* * Graduate School of Science and Technology,

Keio University, Japan

Motivation• Understanding aerial image is highly demanded for generating maps,

analyzing disaster scale, detecting changes to manage estate, etc. • But it’s been usually done by human experts, so that it’s both slow and costly. • Remote sensing community has been focused on this task but it’s still difficult

to detect terrestrial objects automatically from aerial image with high accuracy.

Goal

Input aerial image (RGB)

Output 3-channels map R: road, G: building, B: others

The trade-off between different objects at a same pixel

Previous WorksSenaras et al., Building detection with decision fusion, 2013

ResultProcess flow

Input

Predicted labels

Ground truth

Mean shift segmentation

Extract 15 different features

Combine the multiple classifica-tion results

segments

feature

feature

feature

Vegetation maskShadow mask

Infrared-Red image

Input aerial image (Infrared, RGB)

Infrared-Red-Green image

Hue-Saturation-Intensity (HSI)

image

Normalized Difference

Vegetation Index (NDVI) image

classifier

classifier

classifier

Previous WorksVolodymyr Mnih, Machine Learning for Aerial Image Labeling, 2013

• They take patch-based approach which is very suited to use Convolutional Neural Network (CNN).

• They formulate the problem as obtaining a mapping from aerial image patch to label image patch.

• However, they train two CNNs, one for building, another for road, despite there may be trade-offs between them.

Process flow Result Description

Aerial imagery

Predicted Label

Noise model

Patches

CNNCNNCNN

Dataset

Aerial image Building labelx 151

Road labelx 1109

Convolutional Neural Network

64 x 64 x 3 (RGB) sized patches

Correct answers

Predictions

16 x 16 x 3 (building, road, other)

Calculate loss

Back

prop

agat

ion

Our ApproachWe train a Convolutional Neural Network (CNN) as a mapping from an input aerial image patch and a 3-channel label image patch using stochastic gradient descent.

RG

B

Input aerialimage patch

Predicted map patch

FC(4096)C(64, 9x9/2) P(2/1) C(128, 7x7/1) C(128, 5x5/1) FC(768)

• We train a CNN which has the above architecture. • The CNN takes a small RGB image patch as input, and

output a predicted 3-channel label patch. • The predicted label patch is consisted of Road channel

and Building channel and others channel. • No pre-processing like segmentation is needed and we

don’t need to design any image features. CNN obtains good feature extractors automatically.

Patch-based framework

allow the network to use surrounding pixels to predict labels in the center patch

Using context

It’s building!?

1

0

0

0

0

1

m̃i2

m̃i1

m̃i3

i

s

m̃

wm

wm

ws

ws

Road

Building

Otherwise

Aerial imagepatch

Predicted label patch

p(m̃|s) =

w2m�

i=1

p(m̃i|s)We learn with CNN.

Loss function• Each pixel in predicted label image

• is independent each other (assumption) • is always belonging to only one of the 3 labels (building, road, others)

m̂i

(1.56, 4.37, 3.11)

softmax

(0.05, 0.74, 0.21)

Predicted label

(1, 0, 0)

m̃i

Correct label

c : channel (Building, Road, Others)P : correct label distribution (1-of-3 coding)Q : predicted label distribution

Asymmetric cross entropy

and just minimize this cross entropy by Stochastic Gradient Descent

wc : weight for each channel loss

wbuilding = 1.5, wroad = 1.5, wothers = 0

*because prediction loss in the others channel is not important

Dataset

+ →

Building label Road label 3-channel labelAerial image

• We combine the Volodymyr’s Road and Building detection datasets* to create our 3-channel map dataset.

• Our dataset contains 147 sets of aerial images and 3-channel label images. - 137 sets for training - 10 sets for testing

• Each image is 1500 x 1500 pixel sized at 1m^2/pixel resolution.

• The entire dataset covers roughly 340 km^2 of mainly urban region in Massachusetts, the United States.

Aerial images 3-ch labels

Dataset

* http://www.cs.toronto.edu/~vmnih/data/

Experiment

• Training with 137 images and labels

• Testing with 10 images and labels

• We test some variants of the basic architecture.

RG

B

Input aerialimage patch

Predicted map patch

FC(4096)C(64, 9x9/2) P(2/1) C(128, 7x7/1) C(128, 5x5/1) FC(768)

Basic architecture

Activation Dropout rate Filter sizeS-ReLU(Basic) ReLU N/A 9-7-5

S-ReLU-Dropout ReLU 0.5 9-7-5

S-Maxout Maxout 0.5 9-7-5

ReLU ReLU N/A 16-4-3

ReLU-Dropout ReLU 0.5 16-4-3

Maxout Maxout 0.5 16-4-3

Tested architectures

Input aerial image Predicted 3-channel label image from the basic architecture

Example of Test Results

Input aerial image

Example of Test Results

Predicted 3-channel label image from the basic architecture

Precision-recall curve

Compare road channel result with Volodymyr Mnih’s result

Compare building channel result with Volodymyr Mnih’s result

Precision-recall curve

Road BuildingS-ReLU(Basic) 0.8905 0.9241

S-ReLU-Dropout 0.8889 0.9220

S-Maxout 0.8842 0.9185

ReLU 0.8657 0.8984

ReLU-Dropout 0.8650 0.8973

Maxout 0.8548 0.8940

Volodymyr 0.8873 0.9150

Precision at breakeven point• Our basic architecture achieved the

best results. • Using Maxout or Dropout, or both

seem not to improve the performance. • The architecture which has smaller

filter size is better than ones with bigger filters.

Conclusion• We propose a CNN-based building and road extraction method for aerial

imagery. • Our method doesn’t need hand-designed image features because the good

feature extractors are automatically constructed by training CNN. • Our CNN predicts building and road regions simultaneously at state-of-the-art

accuracy.

Thank you for your kind attention.

All codes to generate our dataset, perform training of CNN, and test of the

resulting models are available on GitHub.

https://github.com/mitmul/ssai

https://github.com/mitmul/ssai