nonlinear structured prediction using the gpu: deep...

37
Nonlinear Structured Prediction using the GPU: Deep Learning meets Structured Prediction Alexander G. Schwing in collaboration with L.-C. Chen, A. L. Yuille and R. Urtasun GPU Technology Conference, March 18, 2015 A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 1 / 28

Upload: dangtruc

Post on 05-Sep-2018

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Nonlinear Structured Prediction using the GPU:Deep Learning meets Structured Prediction

Alexander G. Schwing

in collaboration with L.-C. Chen, A. L. Yuille and R. Urtasun

GPU Technology Conference, March 18, 2015

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 1 / 28

Page 2: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Classification

Scene understanding Tag prediction Segmentation

x = image

x = image x = image x = image

s ∈ S : categories

s ∈ S : room layout s ∈ S : tag combos s ∈ S : segmentation

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 2 / 28

Page 3: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Classification Scene understanding

Tag prediction Segmentation

x = image x = image

x = image x = image

s ∈ S : categories s ∈ S : room layout

s ∈ S : tag combos s ∈ S : segmentation

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 2 / 28

Page 4: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Classification Scene understanding Tag prediction

Segmentation

x = image x = image x = image

x = image

s ∈ S : categories s ∈ S : room layout s ∈ S : tag combos

s ∈ S : segmentation

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 2 / 28

Page 5: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Classification Scene understanding Tag prediction Segmentation

x = image x = image x = image x = image

s ∈ S : categories s ∈ S : room layout s ∈ S : tag combos s ∈ S : segmentation

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 2 / 28

Page 6: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Inference

s∗ = arg maxs∈S

F (s, x ,w)

Challenge: The domain size |S| is potentially largeImageNet classification: |S| = 1000Scene understanding: |S| = 504

Tag prediction: |S| = 2Number of tags

Image segmentation: |S| = CNumber of pixels

Computation of F (s, x ,w) for all possible s ∈ S in general often intractable.

Observation: Interest in jointly predicting multiple variables s = (s1, . . . , sn)

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 3 / 28

Page 7: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Assumption: function/model decomposes additively

F (s, x ,w) = F (s1, . . . , sn, x ,w) =∑

r

fr (sr , x ,w)

Restriction r : sr = (si)i∈r

Discrete domain:

f{1,2}(s{1,2}) = f{1,2}(s1, s2) =[f{1,2}(1,1), f{1,2}(1,2), . . .

]Visualization

s1,2

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 4 / 28

Page 8: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

[Werner‘07; Wainwright&Jordan’08; Koller&Friedman‘09]

Example

s∗ = arg maxs

f1(s1, x ,w) + f2(s2, x ,w) + f3(s3, x ,w) + f1,2(s1,2, x ,w) + f2,3(s2,3, x ,w)

s1 s2 s3

s1,2 s2,3

Dual decomposition techniques for distributed inference

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 5 / 28

Page 9: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

How to find the parameters w of the scoring function F (s, x ,w)?

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 6 / 28

Page 10: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Learning [Lafferty et al.’01; Taskar et al.’03; Joachims et al.’04;Hinton et al.’84; Bengio et al.’89 ; LeCun et al.’89;

Salakhutdinov et al.’08; Lee et al.’09; Fergus et al.’10;]

good parameters from annotated examples

D = {(x , s)}

Log-linear models (CRFs, structured SVMs):

F (s, x ,w) = w>F (s, x)

Non-linear models, e.g., CNNs (this talk):

F (s, x ,w)

Dealing with multiple variables using CNNs

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 7 / 28

Page 11: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Learning [Lafferty et al.’01; Taskar et al.’03; Joachims et al.’04;Hinton et al.’84; Bengio et al.’89 ; LeCun et al.’89;

Salakhutdinov et al.’08; Lee et al.’09; Fergus et al.’10;]

good parameters from annotated examples

D = {(x , s)}

Log-linear models (CRFs, structured SVMs):

F (s, x ,w) = w>F (s, x)

Non-linear models, e.g., CNNs (this talk):

F (s, x ,w)

Dealing with multiple variables using CNNs

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 7 / 28

Page 12: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Inference:s∗ = arg max

s∈SF (s, x ,w)

Probability of a configuration s:

p(s | x ,w) =1

Z (x ,w)exp F (s, x ,w)

Z (x ,w) =∑s∈S

exp F (s, x ,w)

Inference alternatively:s∗ = arg max

s∈Sp(s | x ,w)

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 8 / 28

Page 13: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Probability of a configuration s:

p(s | x ,w) =1

Z (x ,w)exp F (s, x ,w)

Z (x ,w) =∑s∈S

exp F (s, x ,w)

Maximize the likelihood of training data via

w∗ = arg maxw

∏(x ,s)∈D

p(s|x ,w)

= arg maxw

∑(x ,s)∈D

F (s, x ,w)− ln∑s∈S

exp F (s, x ,w)

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 9 / 28

Page 14: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Maximum likelihood is equivalent to maximizing cross-entropy:Target distribution: p(x ,s),tg(s) = δ(s = s)Cross-Entropy:

maxw

∑(x ,s)∈D,s

p(x ,s),tg(s) ln p(s | x ;w)

= maxw

∑(x ,s)∈D

ln p(s | x ;w)

= maxw

ln∏

(x ,s)∈D

p(s | x ;w)

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 10 / 28

Page 15: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Program of interest:max

w

∑(x ,s)∈D,s

p(x ,s),tg(s) ln p(s | x ;w)

Optimize via gradient ascent

∂w

∑(x ,s)∈D,s

p(x ,s),tg(s) ln p(s | x ;w)

=∑

(x ,s)∈D,s

(p(x ,s),tg(s)− p(s | x ;w)

) ∂

∂wF (s, x ,w)

=∑

(x ,s)∈D

(Ep(x,s),tg

[∂

∂wF (s, x ,w)

]− Ep(x,s)

[∂

∂wF (s, x ,w)

])

Compute predicted distribution p(s | x ;w)

Use chain rule to pass back difference between prediction and observation

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 11 / 28

Page 16: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Algorithm: Deep Learning

Repeat until stopping criteria

1 Forward pass to compute F (s, x ,w)

2 Compute p(s | x ,w)

3 Backward pass via chain rule to obtaingradient

4 Update parameters w

Why are large output spaces a challenge?

How do we even represent F (s, x ,w) if S is large?How do we compute p(s | x ,w)?

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 12 / 28

Page 17: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Algorithm: Deep Learning

Repeat until stopping criteria

1 Forward pass to compute F (s, x ,w)

2 Compute p(s | x ,w)

3 Backward pass via chain rule to obtaingradient

4 Update parameters w

Why are large output spaces a challenge?

How do we even represent F (s, x ,w) if S is large?How do we compute p(s | x ,w)?

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 12 / 28

Page 18: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Domain size of typical applications:ImageNet classification: |S| = 1000Scene understanding: |S| = 504

Tag prediction: |S| = 2Number of tags

Image segmentation: |S| = CNumber of pixels

Solution:Interest in jointly predicting multiple variables s = (s1, . . . , sn)

Assumption: function/model decomposes additively

F (s, x ,w) = F (s1, . . . , sn, x ,w) =∑

r

fr (sr , x ,w)

Every fr (sr , x ,w) is an arbitrary function, e.g., a CNN

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 13 / 28

Page 19: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

F (s, x ,w) = F (s1, . . . , sn, x ,w) =∑

r

fr (sr , x ,w)

How to compute gradient:

∂w

∑(x ,s)∈D,s

p(x ,s),tg(s) ln p(s | x ;w)

=∑

(x ,s)∈D

(Ep(x,s),tg

[∂

∂wF (s, x ,w)

]− Ep(x,s)

[∂

∂wF (s, x ,w)

])

=∑

(x ,s)∈D,r

(Ep(x,s),r,tg

[∂

∂wfr (sr , x ,w)

]− Ep(x,s),r

[∂

∂wfr (sr , x ,w)

])How to obtain marginals pr (sr |x ,w)?

Approximate marginals br (sr |x ,w) via:Sampling methodsVariational methods

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 14 / 28

Page 20: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Intuition: Standard CNN

CNN

s1

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 15 / 28

Page 21: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Intuition: Independent Prediction

CNN1

s1

CNN2

s2

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 15 / 28

Page 22: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Intuition: Deep Structured Learning

CNN1

s1

CNN2

s2

CNN3

s3

CNN4

s1,2

CNN5

s2,3

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 15 / 28

Page 23: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Approximated Deep Structured Learning[Domke’13; Deng et al.’14;

Tompson et al.’14; Li&Zemel’14]

Sample parallel implementation:

Partition data D onto compute nodesRepeat until stopping criteria

1 Each compute node uses GPU for CNNs forward pass to compute fr (sr , x ,w)

2 Each compute node estimates beliefs br (sr | x ,w) for assigned samples3 Backpropagation of difference using GPU to obtain machine local gradient4 Synchronize gradient across all machines using MPI5 Update parameters w

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 16 / 28

Page 24: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Dealing with large number |D| of training examples:Parallelized across samples (any number of machines and GPUs)Usage of mini batches

Dealing with large output spaces S:Variational approximationsBlending of learning and inference

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 17 / 28

Page 25: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

ImageNet dataset [Russakovsky et al.’14Krizhvsky et al.’13; Simonyan&Zisserman’14; Szegedy et al.’14; Jia et al.’ 14]

Model:

s1

|S| = 10001.2 million training examples50,000 validation examples

Model Top 5 validation set error [%]AlexNet 19.95

DeepNet16 10.29DeepNet19 10.37

Different from reported results because of missing averaging, different image crops, etc.

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 18 / 28

Page 26: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Visual results

Groundtruth: Alp

s1 F (s1, x ,w)

Ski 85.28%Alp 9.10%Shovel 4.86%King penguin 0.14%Dogsled 0.10%

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 19 / 28

Page 27: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Visual results

Groundtruth: Puma

s1 F (s1, x ,w)

Puma 99.87%Lynx 0.06%Koala 0.03%Lion 0.01%Egyptian cat 0.00%

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 19 / 28

Page 28: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Visual results

Groundtruth: Cradle

s1 F (s1, x ,w)

Diaper 12.51%Beagle 10.98%Teddy bear 8.12%Pajama 7.06%Crib 5.22%

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 19 / 28

Page 29: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Layout dataset [Hoiem et al.’05; Hedau et al.’09]

Given a single image x , predict a 3D parametric box that best describes the observedroom layout

|S| = 504

Linear model205 training examples104 test examples

s1 s2 s3 s4

s1,3 s1,4 s2,3 s2,4

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 20 / 28

Page 30: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Pixel-wise prediction errors [%] on layout dataset:

OM GC OM + GC Others[Hoiem07] - 28.9 - -[Hedau09] - 21.2 - -[Wang10] 22.2 - - -[Lee10] 24.7 22.7 18.6 -[Pero12] - - - 16.3

Ours 18.63 15.35 13.59 -

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 21 / 28

Page 31: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Flickr dataset[Huiskes&Lew’08]

Assign a subset of 38 possible tags to a given image x

Model: K38

|S| = 238

10000 training examples10000 test examples

Training method Prediction error [%]Unary only 9.36Piecewise 7.70

Joint (with pre-training) 7.25

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 22 / 28

Page 32: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Visual results

female/indoor/portrait sky/plant life/tree water/animals/sea indoor/flower/plant lifefemale/indoor/portrait sky/plant life/tree water/animals/sky ∅

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 23 / 28

Page 33: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Learned class correlations:

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 24 / 28

Page 34: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Pascal VOC 2012 dataset[Everingham et al.’14; Chen et al.’15; Long et al.’15; Zheng et al.’15

Krähenbühl&Koltun’11,’13; Vineet et al.’12]

Model: K350·500

|S| = 21350·500

≈ 10000 training examples≈ 1500 validation examples

Training method Mean IoU Accuracy [%]Unary only 61.476

Joint 64.060

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 25 / 28

Page 35: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Visual results

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 26 / 28

Page 36: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Thanks

collaborators L.-C. Chen, A. L. Yuille and R. UrtasunNVIDIA for donating one Tesla K40 GPU

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 27 / 28

Page 37: Nonlinear Structured Prediction using the GPU: Deep ...on-demand.gputechconf.com/gtc/2015/presentation/S5368-Alexander... · Nonlinear Structured Prediction using the GPU: Deep Learning

Nonlinear Structured PredictionModeling of correlations between variablesNonlinear dependence on parametersJoint training of many convolutional neural networksDistributed onto multiple compute nodesEach using a GPU

http://alexander-schwing.de

A. G. Schwing (University of Toronto) Structured Prediction & Deep Learning 2014 28 / 28