lecture25 - mit csailpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · lecture25...

19
Dimensionality Reduc1on (con1nued) Lecture 25 David Sontag New York University Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Upload: others

Post on 10-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Dimensionality  Reduc1on  (con1nued)  Lecture  25  

David  Sontag  New  York  University  

Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Page 2: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Basic  PCA  algorithm  

•  Start  from  m  by  n  data  matrix  X  •  Recenter:  subtract  mean  from  each  row  of  X  

– Xc  ←  X  –  X  

•  Compute  covariance  matrix:  –   Σ  ←  1/m  Xc

T  Xc  

•  Find  eigen  vectors  and  values  of  Σ    •  Principal  components:  k  eigen  vectors  with  highest  eigen  values  

Page 3: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Linear  projecHons,  a  review  

•  Project  a  point  into  a  (lower  dimensional)  space:  – point:  x  =  (x1,…,xn)    – select  a  basis  –  set  of  unit  (length  1)  basis  vectors  (u1,…,uk)  •  we  consider  orthonormal  basis:    

– ui•ui=1,  and  ui•uj=0  for  i≠j  – select  a  center  –  x,  defines  offset  of  space    – best  coordinates  in  lower  dimensional  space  defined  by  dot-­‐products:  (z1,…,zk),  zi  =  (x-­‐x)•ui  

Page 4: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

PCA  finds  projecHon  that  minimizes  reconstrucHon  error  

•  Given  m  data  points:  xi  =  (x1i,…,xni),  i=1…m  

•  Will  represent  each  point  as  a  projecHon:  

•  PCA:  –  Given  k<n,  find  (u1,…,uk)            minimizing  reconstrucHon  error:   x1

x2 u1

Page 5: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Understanding  the  reconstrucHon  error  

•  Note  that  xi  can  be  represented  exactly  by  n-­‐dimensional  projecHon:  

•  RewriHng  error:  

 Given k<n, find (u1,…,uk)

minimizing reconstruction error:

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

5

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

x̄+k�

j=1

zijuj

2

5

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

5

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

5

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

5

Error is the sum of squared weights for dimensions that have been cut!

Page 6: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

ReconstrucHon  error  and  covariance  matrix  

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

=m�

i=1

n�

j=k+1

uTj (x

i − x̄)(xi − x̄)T uj

5

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

=m�

i=1

n�

j=k+1

uTj (x

i − x̄)(xi − x̄)T uj

=n�

j=k+1

uTj

�m�

i=1

(xi − x̄)(xi − x̄)T�

uj

5

argmaxθ

m�

j=1

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

=m�

j=1

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

=m�

j=1

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

=m�

i=1

n�

j=k+1

uTj (x

i − x̄)(xi − x̄)T uj

=n�

j=k+1

uTj

�m�

i=1

(xi − x̄)(xi − x̄)T�

uj

errork =n�

j=k+1

uTj Σuj

5

Thus, to minimize the reconstruction error we want to minimize

n�

j=k+1

uTj Σuj

Recall that to maximize the variance we want to maximize

k�

j=1

uTj Σuj

These are equivalent! k�

j=1

uTj Σuj +

n�

j=k+1

uTj Σuj =

n�

j=1

uTj Σuj = trace(Σ)

Page 7: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Dimensionality  reducHon  with  PCA  

23

!"#$%&"'%()"*+,-$./0*"'%,/&"%1,234In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability

Only keep data projections onto principal components with large eigenvalues

Can ignore the components of lesser significance.

You might lose some information, but if the eigenvaluesmuch

0

5

10

15

20

25

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

Varia

nce

(%)

[Slide from Aarti Singh]

Page 8: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

PCA  example  

Data: Projection: Reconstruction:

Page 9: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

What’s  the  difference  between  the  first  eigenvector  and  linear  regression?  

Suppose we have data { (x,y) }

Predict y from x Predict x from y PCA

[Pictures from “Cerebral Mastication” blog]

Page 10: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Eigenfaces  [Turk,  Pentland  ’91]  •  Input  images:     Principal components:

Page 11: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Eigenfaces  reconstrucHon  

•  Each  image  corresponds  to  adding  together  the  principal  components:  

Page 12: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Scaling  up  

•  Covariance  matrix  can  be  really  big!  –   Σ  is  n  by  n  – 10000  features  can  be  common!    – finding  eigenvectors  is  very  slow…  

•  Use  singular  value  decomposiHon  (SVD)  – finds  to  k  eigenvectors  – great  implementaHons  available,  e.g.,  Matlab  svd  

Page 13: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

SVD  • Write  X  =  W  S  VT  

– X  ←  data  matrix,  one  row  per  datapoint  

– W  ←  weight  matrix,  one  row  per  datapoint  –  coordinate  of  xi  in  eigenspace    

– S  ←  singular  value  matrix,  diagonal  matrix  •  in  our  sedng  each  entry  is  eigenvalue  λj  

– VT  ←  singular  vector  matrix  •  in  our  sedng  each  row  is  eigenvector  vj  

Page 14: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

PCA  using  SVD  algorithm  •  Start  from  m  by  n  data  matrix  X  •  Recenter:  subtract  mean  from  each  row  of  X  

– Xc  ←  X  –  X  

•  Call  SVD  algorithm  on  Xc  –  ask  for  k  singular  vectors  

•  Principal  components:  k  singular  vectors  with  highest  singular  values  (rows  of  VT)  – Coefficients:  project  each  point  onto  the  new  vectors  

Page 15: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Non-­‐linear  methods  

12

!"#$%#&'$"#()$&*+#)",#-.%!"#$%&'(%"&)*"+*"$),-.,/*+,'(0-,)*1-".%/,*#"-,*,++%2%,&(*-,1-,),&('(%"&3*'&/*2'1(0-,*0&/,-45%&6*-,4'(%"&)*(7'(*6".,-&*(7,*/'('

8969** 86"3*1,-)"&'4%(5*'&/*%&(,44%6,&2,*'-,*7%//,&*'((-%$0(,)*(7'(*27'-'2(,-%:,**70#'&*$,7'.%"-*%&)(,'/*"+*)0-.,5*;0,)(%"&)<"1%2)*=)1"-()3*)2%,&2,3*&,>)3*,(29?*%&)(,'/*"+*/"20#,&()

@+(,&*#'5*&"(*7'.,*175)%2'4*#,'&%&6

A%&,'-/)-%,-0"1&2.30.%$%#&4%"156-6&7/248B'2("-*C&'45)%)D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C?

E"&4%&,'-!"01",-"% *-9$%3"06DF@GCHA"2'4*A%&,'-*8#$,//%&6*=AA8?

Slide from Aarti Singh

Latent Dirichlet allocation

“swiss roll”

Page 16: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

ProbabilisHc  topic  models  

(Blei, Ng, Jordan JMLR ‘03)

gene 0.04dna 0.02genetic 0.01.,,

life 0.02evolve 0.01organism 0.01.,,

brain 0.04neuron 0.02nerve 0.01...

data 0.02number 0.02computer 0.01.,,

Topics Documents Topic proportions andassignments

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.

model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)

We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.

2. For each word in the document

(a) Randomly choose a topic from the distribution over topics in step #1.

(b) Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document

1Technically, the model assumes that the topics are generated first, before the documents.

3

θd

z1d

zNd

β1

βT

Page 17: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

ProbabilisHc  topic  models  

(Blei, Ng, Jordan JMLR ‘03)

pna  .0100  cough  .0095  

pneumonia  .0090  cxr  .0085  

levaquin  .0060  …

sore  throat      .05  swallow  .0092  

voice  .0080  fevers  .0075  

ear  .0016  

celluliHs  .0105  swelling  .0100  redness  .0055  

lle  .0050  fevers  .0045  …

Topic word distributions

Triage note

Inference

Graphical model for Latent Dirichlet Allocation (LDA)

β1 β2

βT

Low Dimensional representation: distribution of topics for the note

Pneumonia 0.50 Common cold 0.49 Diabetes 0.01

θd

Page 18: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Example  of  learned  representaHon  

Paraphrased  note:  

 “Pa%ent  has  URI  [upper  respiratory  infec3on]  symptoms  like  cough,  runny  nose,  ear  pain.  Denies  fevers.  history  of  seasonal  allergies”  

Inferred  Topic  Distribu1on  

Allergy  

Cold  

Other  

Allergy Cold/URI

Page 19: lecture25 - MIT CSAILpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · Lecture25 David&Sontag& New&York&University& Slides adapted from Carlos Guestrin and Luke Zettlemoyer

What  you  need  to  know  

•  Dimensionality  reducHon  – why  and  when  it’s  important  

•  Simple  feature  selecHon  

•  RegularizaHon  as  a  type  of  feature  selecHon  •  Principal  component  analysis  

– minimizing  reconstrucHon  error  –  relaHonship  to  covariance  matrix  and  eigenvectors  

– using  SVD