rob 537: learning-based control -...

31
1 Week 6, Lecture 1 Deep Learning (based on lectures by Fuxin Li, CS 519: Deep Learning) Announcements: HW 3 Due TODAY Midterm Exam on 11/6 Reading: Survey paper on Deep Learning (Schmidhuber 2015) ROB 537: Learning-Based Control In weeks 13 we talked about neural networks Message was: Neural networks use data to learn a mapping from inputs to outputs … with a few caveats Learning: Mapping Inputs to Outputs

Upload: doannhan

Post on 19-Jul-2018

223 views

Category:

Documents


0 download

TRANSCRIPT

1

   

Week  6,  Lecture  1  Deep  Learning    

(based  on  lectures  by  Fuxin  Li,  CS  519:  Deep  Learning)      

Announcements:  HW  3  Due  TODAY  

Midterm  Exam  on  11/6    

Reading:    Survey  paper  on  Deep  Learning  (Schmidhuber  2015)  

ROB 537: Learning-Based Control

•  In  weeks  1-­‐3  we  talked  about  neural  networks  •  Message  was:    

   Neural  networks  use  data  to  learn  a  mapping  from  inputs  to  outputs      …  with  a  few  caveats      

Learning: Mapping Inputs to Outputs

2

Recall: Dog vs. Cat

•  Dog  vs.  Cat:        Cat  

Recall: Dog vs. Cat

3

Recall:

•  Dog  vs.  Cat:        Cat  

•  Movement  vs.  staYonary    “Dog”  maybe    •  Indoor  vs.  outdoor    “Dog”  

•  Red  vs.  not  red  animal    “Dog”

Recall: Dog vs. Cat or ???

4

Let’s revisit what happens in such a mapping

Label:  “Motorcycle”  Suggest  tags  Image  search  …  

Speech  recogniYon  Music  classificaYon  Speaker  idenYficaYon  …  

Web  search  AnY-­‐spam  Machine  translaYon  …    

text  

audio  

images/video  

Input:  X   Output:  Y  

ML  

ML  

ML  

We want to map this picture to a label …

“motorcycle”  ML  

5

Why is this hard?

You  see  this:    

But  the  camera  sees  this:  

Raw Representation

Input  

Raw  image  

Motorbikes  

“Non”-­‐Motorbikes  

Learning  algorithm  

pixel  1  

pixel  2  

pixel  1

pixel  2

10

6

Raw Representation

Input  

Motorbikes  

“Non”-­‐Motorbikes  

Learning  algorithm  

pixel  1

pixel  2

pixel  1  

pixel  2  

Raw  image  

What we Want

Input  

Motorbikes  

“Non”-­‐Motorbikes  

Learning  algorithm  

pixel  1  

pixel  2  

Raw  image  

Handlebars  

Whe

els  

Features  

handlebars

wheel

Feature  representaYon  

E.g.,  Does  it  have  Handlebars?    Wheels?    

7

•  Wheels  and  handlebars  “represent”  key  aspects  of  a  motorcycle  

•  Looking  for  those  allows  an  algorithm  to  recognize  a  motorbike  

•  What  we  did    is  “feature  engineering”  

–  IdenYfy  key  features  using  domain  knowledge  –  Extract  key  features  from  image  –  Map  key  features  to  labels    

Representation

•  Feature  engineering  requires  –  Domain  knowledge  –  Specific  to  data  sets  –  Labor  intensive  

•  How  about  Feature  learning  ?  –  Edges,  corners  –  Circles    –     …  –  Shapes?  

 

Feature Engineering vs. Feature Learning

Deep  Learning  

8

Deep Learning: Let’s learn the representation!

pixels   edges   object  parts   object  models  shapes  

nose,  eye   Joe  …  happy    corners  

Deep Learning

Neural  Network  architecture  with  many  layers  

 …  wait  a  minute…      

 …  is  this  new?  Different  than  “just”  neural  networks?        

   Yes          and            No  

9

•  CyberneYcs        

•  ConnecYonism      

•  Deep  Learning      

History of Neural Networks

•  CyberneYcs                1950s  and  60s    Perceptron  (Rosenblaf,  1957)    Adaline,  Madaline  (Woodrow  and  Hoff  1959)  

•  ConnecYonism              1980s  and  early  90s    BackpropagaYon  (Rumelhart,  Hinton,  Williams  1986,  Werbos  1981)    Universal  approximaYon  theorems  (Cybenko  1989,  Hornik  et  a.  1991)  

•  Deep  Learning              2005  onward  (mostly  2010s)  

Three Waves of Neural Networks

10

•  CyberneYcs                1950s  and  60s      

       1970s  :    Disillusionment  1  -­‐  XOR  (Minski,  Papert  1969)  

•  ConnecYonism              1980s  and  early  90s              ~1995-­‐2005  :  Disillusionment  2  -­‐  (Support  Vector  Machine  …)  

•  Deep  Learning              2005  onward  (mostly  2010s)  

Three Waves of Neural Networks

First Two Waves Focused on

One  hidden  layer  NNs   Two  hidden  layer  NNs  

11

Third Wave: Deep Learning Neural Networks

…  

pixels   edges   object  parts   object  models  shapes  

Third Wave: Deep Learning Neural Networks

…  

.  

.  

.  

.  

.  

.  

.  

.  

.  

.  

.  

.  

12

Why Didn’t Deep NN Idea Catch on Before?

•  BackpropagaYon  doesn’t  like  too  many  layers  –  Gradient  either  goes  to  zero  or  blows  up  

•  Training  requires  a  lot  of  labeled  training  data  –  How  do  you  get  millions  of  labeled  images?  

•  The  learning  Yme  does  not  scale  well  

•  NNs  may  overfit,  especially  with  many  hidden  layers  

•  Not  enough  compuYng  power  

Why Didn’t Deep NN Idea Catch on Before?

13

Back to Deep Learning

pixels   edges   object  parts   object  models  shapes  

What’s  going  on  in    these  layers?  

•  You’re  applying  a  Filter  

What’s going on between Layers?

14

•  Subgradient  •  RecYfier  Linear  Units  (ReLU)    •  Pooling    

•  Stride  •  Padding  

 

Some Basic Concepts

•  What  if  a  funcYon  is  nondifferenYable?      •  For  a  convex  funcYon  f(x):  

 c  is  subgradient  at  x0    

•  Subgradient:  approximate  derivaYve    

Subgradient

Non-­‐differenYable  

f (x)− f (x0 ) ≥ c (x − x0 )

15

AcYvaYon  funcYon:  RecYfier  Linear  Units  (ReLU)  

Subgradient:  

Rectifier Linear Units

f (x) =max(x, 0)

f '(x) =1 if x > 0

0 otherwise

!

"#

$#

•  Pooling  is  downsampling.    •  You  can  average,  take  the  max  etc.    •  Example:  2x2  maxpooling  :  

Pooling

Why?    Because  exact  locaYon  of  object  (or  edge  or  face)  doesn’t  mafer  

16

•  How  much  does  the  filter  move  at  each  step  

Stride

Stride  1:  7x7  to  5x5  

Stride  2:  7x7  to  3x3  

•  NoYce  as  we  applied  filters,  our  dimension  decreased  •  What  happens  when  you  apply  many  layers?    

•  Padding  keeps  dimensionality  of  previous  layer  •  5x5  and  apply  3x3  filter:  

Padding

0   0   0  0   0   0  0  

0   0   0  0   0   0  0  0   0  0   0  0   0  0   0  0   0  

Padding   Filter  

17

•  Exploit  structure  in  image  

•  Neighboring  pixels  carry  local  correlaYon  

•  Shapes  carry  long-­‐range  correlaYon  

Convolutional Neural Networks

The Convolution Operator

ConvoluYon  Sobel  filter  

ConvoluYon  

*  

18

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

∗-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2  

3×1+(−1)×1=2

19

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

∗-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2   -­‐1  

1×(−2)+1×1+1×(−1)+1×1=−1

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2   -­‐1   -­‐6  

4  

20

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

∗-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2   -­‐1   -­‐6  

4   -­‐3  

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

∗-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2   -­‐1   -­‐6  

4   -­‐3   -­‐5  

21

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

∗-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2   -­‐1   -­‐6  

4   -­‐3   -­‐5  

1  

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

∗-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2   -­‐1   -­‐6  

4   -­‐3   -­‐5  

1   -­‐2  

22

2D Convolution with Padding and Stride 1

0   0   0   0   0  

0   1   3   1   0  

0   0   -­‐1   1   0  

0   2   2   -­‐1   0  

0   0   0   0   0  

∗-­‐2   -­‐2   1  

-­‐2   0   1  

1   1   1  

=2   -­‐1   -­‐6  

4   -­‐3   -­‐5  

1   -­‐2   -­‐2  

What’s the shape of weights and input

…3x3x64  

ConvoluYon  

+ReLU  

3x3x3  

Say,  64  filters  

ConvoluYon  

23

What’s the shape of weights and input

•  e.g.    64  filters  level  1      128  filters  level  2  

…3x3x64x128  

ConvoluYon  

+ReLU  

3x3x3x64  

224  x  224  x  3  

Input  

Weights  

Output1:  224  x  224  x  64  

Output1:  224  x  224  x  128  

Dramatic reduction on the number of parameters

•  A  fully-­‐connected  NN  on  10-­‐class,  256  x  256  image  with  500  hidden  units  –  Num.  of  params  =  65536  *  3  *  500  +  500  *  10  =  98.3  Million  

•  1-­‐hidden  layer  convoluYonal  network  on  256  x  256  image  with  11x11  and  500  hidden  units?  –  Num.  of  params  =  11  *  11  *  3  *  500  +  500  *  10  =  155,000  

•  2-­‐hidden  layers  convoluYonal  network  on  256  x  256  image  with  11x11  –  3x3  sized  filters  and  500  hidden  units  in  each  layer?  –  Num.  of  params  =  150,000  +  3  *  3  *  500  *  500    +  500  *  10  =  2.4  Million  

24

Recall: Deep Learning Neural Network Flow

…  

.  

.  

.  

.  

.  

.  

.  

.  

.  

.  

.  

.  

Convolution + Rectifier Linear Unit

•  We  need  nonlinearity  •  Make  the  gradient  sparser  and  

simpler  to  compute  

25

Convolution + ReLU + Pooling

•  Pooling  allows  invariance  of  features    •  Pooling  makes  higher  layers  filters  cover  a  larger  region  of  the  input  

State of the Art in Deep Learning

26

Computer Vision – Image Classification

•  Imagenet  –  Over  1  million  images,  1000  classes,    

different  sizes,  avg  482x415,  color  

•  16.42%  Deep  CNN  dropout  in  2012  •  6.66%  22  layer  CNN  (GoogLeNet)  in    

2014  •  3.6%    (Microsot  Research  Asia)  

super-­‐human  performance  in  2015  

Sources:    Krizhevsky  et  al  ImageNet  ClassificaYon  with  Deep  ConvoluYonal  Neural  Networks,  Lee  et  al  Deeply  supervised  nets  2014,  Szegedy  et  al,  Going  Deeper  with  convoluYons,  ILSVRC2014,  Sanchez  &  Perronnin  CVPR  2011,  hfp://www.clarifai.com/  

 Benenson,  hfp://rodrigob.github.io/are_we_there_yet/build/classificaYon_datasets_results.html          

Impact on speech recognition

27

Unsupervised Deep Learning

•  CNN  is  most  successful  with  a  lot  of  training  examples  

•  What  can  we  do  if  we  do  not  have  any  training  example?  –  Or  have  very  few  of  them?  

Dimensionality Reduction: Principle Component Analysis

•  Project  data  onto  a  new  subspace  •  Bases  are  orthogonal  •  OpYmal  under  some  assumpYons  (Gaussian)  

AssumpYons  almost  never  true  in  real  data  

28

Input vector

Input vector

code

•  Standard  Neural  network:  But  output  is  the  input  •  Goal:        Minimize  reconstrucYon  error    

Neural Network as PCA: A utoencoder

Input vector

Input vector

Deep Autoencoder

Many    Encoding  Layers  

Many  decoding    layers  

29

Engineering  applicaYons:    •  Computer  vision  •  Speech  recogniYon  •  Natural  Language  Understanding  •  RoboYcs  

57

Deep Learning in Neural Networks

Back to … Deep Learning

Neural  Network  architecture  with  many  layers  

 …  wait  a  minute…      

 …  is  this  new?  Different  than  “just”  neural  networks?        Yes    

 and        No  

30

Why Didn’t Deep NN Idea Catch on Before?

•  Not  clear  what  deep  fully  connected  networks  learn  

•  BackpropagaYon  doesn’t  like  too  many  layers  –  Gradient  either  goes  to  zero  or  blows  up  

•  Training  requires  a  lot  of  labeled  training  data  –  Million+  labeled  images  (Amazon  Mechanical  Turk)  

•  The  learning  Yme  does  not  scale  well  

•  NNs  tend  to  overfit,  especially  with  many  hidden  layers  

•  Not  enough  compuYng  power  

So What’s New ?

•  ConvoluYon  neural  networks    

•  New  acYvaYon  funcYons  and  subgradients  –  ReLU  

•  Everything  is  on  the  internet  …  A  lot  of  labeled  data  –  Million+  labeled  images  (Amazon  Mechanical  Turk)  

•  More  efficient  learning  algorithms:  stochasYc  gradient  descent  

•  Pooling  (dimensionality  control)    

•  Much  befer  compuYng  power  …  GPUs  

31

What’s Next ?

Deep  Learning:  Methods  and  ApplicaYons,  L.  Deng  and  D.  Yu,  “FoundaYons  and  Trends  in  Signal  Processing,”  Vol.  7,  Nos.  3–4,  197–387,  2013.  (also,  hfp://en.wikipedia.org/wiki/Hype_cycle  )  

What’s Next?

Deep  Learning:  Methods  and  ApplicaYons,  L.  Deng  and  D.  Yu,  “FoundaYons  and  Trends  in  Signal  Processing,”  Vol.  7,  Nos.  3–4,  197–387,  2013.