“deep”&learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4....

57
“Deep” Learning

Upload: others

Post on 13-Sep-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

“Deep”  Learning  

Page 2: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

2  

natural    language  analyzer  

Big  picture:  natural  language  analyzers  

Natural  language  input  signal:    -­‐  Web  page  -­‐  Ques<on  -­‐  Search  query  -­‐  Tweet  -­‐  Voice  command  

Output  analysis:    -­‐  Ques<on  -­‐  Answer  -­‐  Command  to              a  robot  -­‐  Trending  topics  

Page 3: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

3  

sen<ment  analyzer  

speech  recognizer  

tokenizer  

POS  tagger  

syntac<c  parser  

seman<c  parser  

machine  translator  

named  en<ty  

recognizer  

spell  corrector  

coreference    resolu<on  

classifica<on  

Big  picture:  natural  language  analyzers  

Natural  language  input  signal:    -­‐  Web  page  -­‐  Ques<on  -­‐  Search  query  -­‐  Tweet  -­‐  Voice  command  

Output  analysis:    -­‐  Ques<on  -­‐  Answer  -­‐  Command  to              a  robot  -­‐  Trending  topics  

Page 4: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

4  

sen<ment  analyzer  

speech  recognizer  

tokenizer  

POS  tagger  

syntac<c  parser  

seman<c  parser  

machine  translator  

named  en<ty  

recognizer  

spell  corrector  

coreference    resolu<on  

classifica<on  

Today:  deep  learning  for  NLP  components  

Natural  language  input  signal:    -­‐  Web  page  -­‐  Ques<on  -­‐  Search  query  -­‐  Tweet  -­‐  Voice  command  

Output  analysis:    -­‐  Ques<on  -­‐  Answer  -­‐  Command  to              a  robot  -­‐  Trending  topics  

Page 5: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Agenda  

•  Big  picture  

•  Why  deep  learning?  

•  Building  blocks  of  a  deep  neural  network  

•  How  to  train  deep  neural  networks  

•  Important  results  

Page 6: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

6  

do  

classifica<on  

Running  example:  document  classifica<on  

sen<ment  analyzer  

speech  recognizer  

tokenizer  

POS  tagger  

syntac<c  parser  

seman<c  parser  

machine  translator  

named  en<ty  

recognizer  

spell  corrector  

coreference    resolu<on  

document   category  

Page 7: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

7  

classifica<on  

Running  example:  document  classifica<on  

•  Politics•  Business•  Science•  Sports•  Health

output  =  argmaxl  f(  l,    d)  

d   l  

Barcelona lost to Real Madrid Sports

Page 8: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

8  

How  to  define  f(l,  d):  linear  models  

Linear  models:  f(l,  d)  =  w  .  g(l,d)  

0  0  0  1  0  0  1  0  0  0  0  …  

0.4  -­‐1.2  0.2  0.2  -­‐0.4  -­‐1.0  5.1  1.1  2.3  0.8  -­‐0.1  …   Number  of  <mes  Barcelona appears  

in  a  document  labeled  Sports

Number  of  <mes    lost appears    in  a  document  labeled  Sports

Barcelona lost to Real Madrid

Sportsl  =      d  =    

Page 9: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

9  

How  to  define  f(l,  d):  linear  models  

Linear  models:  f(l,  d)  =  w  .  g(l,d)      -­‐  Easy  to  implement  -­‐  Easy  to  op<mize  w    Two  possible  improvements:  -­‐  Define  more  complex  func<ons  -­‐  Find  becer  representa<ons                of  (l,d)    

Figure  credits:  Barbara  Rosario  

Page 10: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Agenda  

•  Big  picture  

•  Why  deep  learning?  

•  Building  blocks  of  a  deep  neural  network  

•  How  to  train  deep  neural  networks  

•  Important  results  

Page 11: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

11  

Linear  models:  f(l,  d)  =  w  .  g(l,d)  =  w(l)  .  x(d)  e.g.,  y1  =  x1  w1,1+  x2  w2,1+  x3  w3,1+  x4  w4,1+  x5  w5,1  =  w(1)  .  x(d)    

w1,1  

w5,3  

Number  of  <mes    lost appears    in  a  document    

Number  of  <mes    Barcelona appears  in  a  document    

neural  network  v1.0:  linear  model  

- Politics

- Science

- Sports

Page 12: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

12  

neural  network  v1.0:  linear  model  

Linear  models:  f(l,  d)  =  w  .  g(l,d)  =  w(l)  .  x(d)  e.g.,  y1  =  x1  w1,1+  x2  w2,1+  x3  w3,1+  x4  w4,1+  x5  w5,1  =  w(1)  .  x(d)                             x  

W  same  as    

y  

W  

=  

x   y  

similar  words  s<ll  share    no  parameters!  

-  Politics

-  Science

-  Sports

Page 13: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

13  

neural  network  v2.0:  representa<on  learning  

Big  idea:  induce  low-­‐dimensional  dense  feature  representa<ons  of  high-­‐dimensional  objects  

Page 14: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

14  

neural  network  v2.1:  representa<on  learning  

x  

W  

y  dense  

Did  this  really  solve  the  problem?  

Big  idea:  embed  words  in  a  dense  vector  space  and  use  the  word  embeddings  as  dense  features  

Page 15: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

15  

neural  network  v3.0:  complex  func<ons  

x  

W  

y  

Big  idea:  define  more  complex  func<ons  by  adding  a  hidden  layer  

y  =  W  x  

Page 16: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

16  

neural  network  v3.0:  complex  func<ons  

x  

W2  

y  

Big  idea:  define  more  complex  func<ons  by  adding  a  hidden  layer  

W1  

h1  

y  =  W2  h1  =  W2  (W1  x)                                        =  W  x   Wait!  Is  that  true?!  

Page 17: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

17  

neural  network  v3.0:  complex  func<ons  

x  

W2  

y  

Big  idea:  define  more  complex  func<ons  by  adding  a  hidden  layer  

W1  

h1  

Universal  approxima<on  theorem  Cybenko.,  G.  (1989)    

y  =  W2  h1  =  W2  a1(W1  x)    

non-­‐linear  func<ons,    e.g.,  logis<c  func<on  a1(z)  =  1  /  (1  +  e-­‐z)  

z  

a1(z)  

induced  features  

Page 18: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

18  

neural  network  v3.0:  complex  func<ons  

hcps://en.wikipedia.org/wiki/Ac<va<on_func<on  

Popular  ac<va<on/transfer/non-­‐linear  func<ons:  

Page 19: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

19  

neural  network  v3.5:  “deeper”  networks  

x  

W2  

y  

W1  

h1  

y  =  W3  h2  =  W3  a2(  W2  a1(W1  x)  )  

W3  

h2  

Wait  but  why  do  we  need  more  layers?  

Page 20: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

20  

neural  network  v3.5:  “deeper”  networks  

x  

W2  

y  

W1  

h1  

y  =  W3  h2  =  W3  a2(  W2  a1(W1  x)  )  

W3  

h2  

Page 21: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

21  

neural  network  v4.0:  recurrent  neural  networks  

Big  idea:  use  hidden  layers  to  represent  sequen<al  state  

x  

y  

Feed-­‐forward  neural  networks  

x1  

y  

x2   x3  

Recurrent  neural  networks  

Figure  credits:  Andrej  Karpathy  

How  did  we    represent  x  for  document    classifica<on?  

Real …. Madrid =  

Real Madrid

Real …. Madrid

≠  Real Madrid

Do  we  share  parameters  across  states?  

Page 22: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

22  

neural  network  v4.0:  recurrent  neural  networks  

Figure  credits:  Christopher  Olah  

How  to  compute  the  hidden  layers?  

Page 23: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

23  

neural  network  v4.1:  output  sequences  

Figure  credits:  Andrej  Karpathy  

Page 24: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

24  

neural  network  v4.1:  output  sequences  

Figure  credits:  Andrej  Karpathy  

Example:      Character-­‐level    language  models  

Page 25: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

25  

neural  network  v4.1:  output  sequences  

Credits:  Andrej  Karpathy  

Sample  output:  Copyright  was  the  succession  of  independence  in  the  slop  of  Syrian  influence  that  was  a  famous  German  movement  based  on  a  more  popular  servicious,  non-­‐doctrinal  and  sexual  power  post.  Many  governments  recognize  the  military  housing  of  the  [[Civil  Liberaliza<on  and  Infantry  Resolu<on  265  Na<onal  Party  in  Hungary]],    that  is  sympathe<c  to  be  to  the  [[Punjab  Resolu<on]]  (PJS)[hcp://www.humah.yahoo.com/guardian.cfm/7754800786d17551963s89.htm  Official  economics  Adjoint  for  the  Nazism,  Montgomery  was  swear  to  advance  to  the  resources  for  those  Socialism's  rule,  was  star<ng  to  signing  a  major  tripad  of  aid  exile.]]  

Page 26: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

26  

neural  network  v4.2:  Long-­‐Short  Term  Memory  

Figure  credits:  Christopher  Olah  hcp://colah.github.io/posts/2015-­‐08-­‐Understanding-­‐LSTMs/  

LSTMs  

Regular    RNNs  

Page 27: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

27  

neural  network  v4.2:  Long-­‐Short  Term  Memory  

Figure  credits:  Christopher  Olah  hcp://colah.github.io/posts/2015-­‐08-­‐Understanding-­‐LSTMs/  

Page 28: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

28  

neural  network  v4.3:  bidirec<onal  RNNs  

Figure  credits:  Christopher  Olah  

Unidirec<onal  RNNs   Bidirec<onal  RNNs  

Page 29: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

29  

neural  network  v4.4:  acen<on  models  

Bahdanau  et  al.  (2015)  

Page 30: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

30  

neural  network  v5:  convolu<onal  NN  

Figure  credits:  Christopher  Olah  

Page 31: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

31  

neural  network  v5:  convolu<onal  NN  

Figure  credits:  Christopher  Olah  

Feed-­‐forward  NN  

Convolu<onal  NN  

Page 32: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

32  

neural  network  v5:  convolu<onal  NN  

Figure  credits:  Christopher  Olah  

Convolu<onal  layer  

convolu<onal  layer  2      convolu<onal  layer  1  

Do  we  share  parameters  of  different  convolu<ons?  In  the  same  layer?  In  different  layers?  

Page 33: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

convolu<onal  layer  2    convolu<onal  layer  1  

33  

neural  network  v5:  convolu<onal  NN  

Figure  credits:  Christopher  Olah  

convolu<onal  layer  2    Max  pooling  layer    convolu<onal  layer  1  

2D  convolu<ons  

Page 34: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

34  

neural  network  v5.1:  recursive  NNs  

Page 35: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

35  

neural  network  v6:  dropout  

Page 36: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Agenda  

•  Big  picture  

•  Why  deep  learning?  

•  Building  blocks  of  a  deep  neural  network  

•  How  to  train  deep  neural  networks  

•  Important  results  

Page 37: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  train  NN  models?  •  argmaxl  f(d,  l)  only  tells  us  which  label  to  predict.  •  Supervised  learning  (need  input/output  pairs)  •  Loss  func<on:  e.g.,  cross-­‐entropy  between  empirical  distribu<on  and  model  distribu<on  

 -­‐  log  p(l*|d)  =  -­‐  log  (  ef(d,  l*)  /  Σl  ef(d,  l)  )  

x   y  h1  

soymax    layer  

Regression  problems?  Mean  square  error  

E[(y  -­‐  y*)2]  

Page 38: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  op<mize  the  loss?  •  Stochas<c  gradient  descent  

 

Page 39: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  op<mize  the  loss?  

•  Parameter  ini<aliza<on  –  Break  the  symmetry  – Use  small  values  –  Random  restarts  

–  Popular  choice:  uniform  with  mean=zero  and  variance  =  1  /  size  of  previous  layer  

hcp://andyljones.tumblr.com/post/110998971763/an-­‐explana<on-­‐of-­‐xavier-­‐ini<aliza<on  

Page 40: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  op<mize  the  loss?  

•  Other  op<miza<on  methods  –  Variants  of  stochas<c  gradient  descent    (e.g.,  averaged  SGD,  SGD  with  momentum)  See  hcp://research.microsoy.com/pubs/192769/tricks-­‐2012.pdf    

– Adagrad  – Adam  

– Adadelta    

Page 41: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  op<mize  the  loss?  

•  Compu<ng  gradients:  the  hard  way  – Analy<cally  derive  the  expression  that  represents  the  gradient  with  respect  to  each  input.  

–  Compute  that  expression.  

•  Compu<ng  gradients:  automa<c  differen<a<on  –  Translate  the  loss  func<on  into  a  sequence  of  atomic  opera+ons  

– Hard-­‐code  the  differen<a<on  of  each  atomic  opera<on  with  respect  to  its  parameters  is  hard-­‐coded.  

–  Recursively  compute  the  gradient  of  the  loss  func<on  with  respect  to  model  parameters  using  chain  rule.  

 

Page 42: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  op<mize  the  loss:  automa<c  differen<a<on  

Page 43: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  op<mize  the  loss:  automa<c  differen<a<on  

Page 44: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

How  to  op<mize  the  loss:  deep  learning  libraries  

hcps://github.com/soumith/convnet-­‐benchmarks/  Also  see:  CMU’s  locally  grown  library  at  hcps://github.com/clab/cnn    

Page 45: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Agenda  

•  Big  picture  

•  Why  deep  learning?  

•  Building  blocks  of  a  deep  neural  network  

•  How  to  train  deep  neural  networks  

•  Important  results  

Page 46: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

46  

Major  results:  language  modeling  

Page 47: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Krizehvsky  et  al.  (2012)  

47  

Major  results:  image  classifica<on  

Page 48: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

48  

Major  results:  ImageNet  

Krizehvsky  et  al.    (2012):  posi<ve  and  nega<ve  examples  

Page 49: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

49  

Major  results:  ImageNet  

Krizehvsky  et  al.  (2012):  sample  convolu<on  filters  

Page 50: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

50  

Major  results:  speech  recogni<on  

Graves  et  al.  (2013)  

Page 51: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

51  

Major  results:  transla<on  

Sutskever  et  al.  (2014)  

Page 52: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

52  

Major  results:  transla<on  

Bahdanau  et  al.  (2015)  

Page 53: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

53  

Major  results:  dependency  parsing  

Chen  and  Manning  (2014)  

Page 54: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

54  

Major  results:  dependency  parsing  

Dyer  et  al.  (2015)  

Page 55: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Important  things  we  didn’t  cover  

•  Dark  knowledge  •  Connec<on  to  graphical  models  •  Alterna<ves  to  the  soymax  output  layer  

Page 56: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

Agenda  

•  Big  picture  

•  Why  deep  learning?  

•  Building  blocks  of  a  deep  neural  network  

•  How  to  train  deep  neural  networks  

•  Important  results  

Page 57: “Deep”&Learning&demo.clab.cs.cmu.edu/.../files/slides/26-deep-learning.pdf · 2020. 4. 21. · 3 sen

57  

sen<ment  analyzer  

speech  recognizer  

tokenizer  

POS  tagger  

syntac<c  parser  

seman<c  parser  

machine  translator  

named  en<ty  

recognizer  

spell  corrector  

coreference    resolu<on  

classifica<on  

Open  ques<on:  can  we  do  without  the  intermediate  linguis<c  abstrac<ons?  

Natural  language  input  signal:    -­‐  Web  page  -­‐  Ques<on  -­‐  Search  query  -­‐  Tweet  -­‐  Voice  command  

Output  analysis:    -­‐  Ques<on  -­‐  Answer  -­‐  Command  to              a  robot  -­‐  Trending  topics