lecture25 - mit csailpeople.csail.mit.edu/dsontag/courses/ml12/slides/lecture25.pdf · lecture25...

Dimensionality Reduc1on (con1nued) Lecture 25

David Sontag New York University

Slides adapted from Carlos Guestrin and Luke Zettlemoyer

Basic PCA algorithm

•  Start from m by n data matrix X •  Recenter: subtract mean from each row of X

– Xc ← X – X

•  Compute covariance matrix: –  Σ ← 1/m Xc

T Xc

•  Find eigen vectors and values of Σ •  Principal components: k eigen vectors with highest eigen values

Linear projecHons, a review

•  Project a point into a (lower dimensional) space: – point: x = (x1,…,xn) – select a basis – set of unit (length 1) basis vectors (u1,…,uk) •  we consider orthonormal basis:

– ui•ui=1, and ui•uj=0 for i≠j – select a center – x, defines offset of space – best coordinates in lower dimensional space defined by dot-‐products: (z1,…,zk), zi = (x-‐x)•ui

PCA finds projecHon that minimizes reconstrucHon error

•  Given m data points: xi = (x1i,…,xni), i=1…m

•  Will represent each point as a projecHon:

•  PCA: –  Given k<n, find (u1,…,uk) minimizing reconstrucHon error: x1

x2 u1

Understanding the reconstrucHon error

•  Note that xi can be represented exactly by n-‐dimensional projecHon:

•  RewriHng error:

 Given k<n, find (u1,…,uk)

minimizing reconstruction error:

argmaxθ

m�

j=1

�

z

Q(t+1)(z | xj) log�p(z, xj | θ(t))

�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z

Q(z | xj) log�P (xj | θ(t))

�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

5

argmaxθ

m�

j=1

�

z


�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z


�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

−

x̄+k�

j=1

zijuj

2

5

argmaxθ

m�

j=1

�

z


�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z


�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

−

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

5

argmaxθ

m�

j=1

�

z


�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z


�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

−

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

5

argmaxθ

m�

j=1

�

z


�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z


�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

−

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

5

Error is the sum of squared weights for dimensions that have been cut!

ReconstrucHon error and covariance matrix

argmaxθ

m�

j=1

�

z


�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z


�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

−

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

=m�

i=1

n�

j=k+1

uTj (x

i − x̄)(xi − x̄)T uj

5

argmaxθ

m�

j=1

�

z


�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z


�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

−

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

=m�

i=1

n�

j=k+1

uTj (x


=n�

j=k+1

uTj

�m�

i=1

(xi − x̄)(xi − x̄)T�

uj

5

argmaxθ

m�

j=1

�

z


�

=m�

j=1

�

z

Q(z | xj) log

�P (z | xj , θ(t))P (xj | θ(t))

Q(z | xj)

�

=m�

j=1

�

z


�−

m�

j=1

�

z

Q(z | xj) log

�Q(z | xj)

P (z | xj , θ(t))

�

z1 = w(1)0 +�

i

w(1)i xi

zk = w(k)0 +�

i

w(k)i xi

errork =m�

i=1

xi −

x̄+k�

j=1

zijuj

2

=m�

i=1

x̄+n�

j=1

zijuj

−

x̄+k�

j=1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj

2

=m�

i=1

n�

j=k+1

zijuj · ujz

ij +

m�

i=1

n�

j=k+1

n�

l=k+1,l �=k

zijuj · ulz

il

=m�

i=1

n�

j=k+1

(zij)2

=m�

i=1

n�

j=k+1

uTj (x


=n�

j=k+1

uTj

�m�

i=1

(xi − x̄)(xi − x̄)T�

uj

errork =n�

j=k+1

uTj Σuj

5

Thus, to minimize the reconstruction error we want to minimize

n�

j=k+1

uTj Σuj

Recall that to maximize the variance we want to maximize

k�

j=1

uTj Σuj

These are equivalent! k�

j=1

uTj Σuj +

n�

j=k+1

uTj Σuj =

n�

j=1

uTj Σuj = trace(Σ)

Dimensionality reducHon with PCA

23

!"#$%&"'%()"*+,-$./0*"'%,/&"%1,234In high-dimensional problem, data usually lies near a linear subspace, as noise introduces small variability

Only keep data projections onto principal components with large eigenvalues

Can ignore the components of lesser significance.

You might lose some information, but if the eigenvaluesmuch

0

5

10

15

20

25

PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10

Varia

nce

(%)

[Slide from Aarti Singh]

PCA example

Data: Projection: Reconstruction:

What’s the difference between the first eigenvector and linear regression?

Suppose we have data { (x,y) }

Predict y from x Predict x from y PCA

[Pictures from “Cerebral Mastication” blog]

Eigenfaces [Turk, Pentland ’91] •  Input images:   Principal components:

Eigenfaces reconstrucHon

•  Each image corresponds to adding together the principal components:

Scaling up

•  Covariance matrix can be really big! –  Σ is n by n – 10000 features can be common! – finding eigenvectors is very slow…

•  Use singular value decomposiHon (SVD) – finds to k eigenvectors – great implementaHons available, e.g., Matlab svd

SVD • Write X = W S VT

– X ← data matrix, one row per datapoint

– W ← weight matrix, one row per datapoint – coordinate of xi in eigenspace

– S ← singular value matrix, diagonal matrix •  in our sedng each entry is eigenvalue λj

– VT ← singular vector matrix •  in our sedng each row is eigenvector vj

PCA using SVD algorithm •  Start from m by n data matrix X •  Recenter: subtract mean from each row of X

– Xc ← X – X

•  Call SVD algorithm on Xc – ask for k singular vectors

•  Principal components: k singular vectors with highest singular values (rows of VT) – Coefficients: project each point onto the new vectors

Non-‐linear methods

12

!"#$%#&'$"#()$&*+#)",#-.%!"#$%&'(%"&)*"+*"$),-.,/*+,'(0-,)*1-".%/,*#"-,*,++%2%,&(*-,1-,),&('(%"&3*'&/*2'1(0-,*0&/,-45%&6*-,4'(%"&)*(7'(*6".,-&*(7,*/'('

8969** 86"3*1,-)"&'4%(5*'&/*%&(,44%6,&2,*'-,*7%//,&*'((-%$0(,)*(7'(*27'-'2(,-%:,**70#'&*$,7'.%"-*%&)(,'/*"+*)0-.,5*;0,)(%"&)<"1%2)*=)1"-()3*)2%,&2,3*&,>)3*,(29?*%&)(,'/*"+*/"20#,&()

@+(,&*#'5*&"(*7'.,*175)%2'4*#,'&%&6

A%&,'-/)-%,-0"1&2.30.%$%#&4%"156-6&7/248B'2("-*C&'45)%)D&/,1,&/,&(*!"#1"&,&(*C&'45)%)*=D!C?

E"&4%&,'-!"01",-"% *-9$%3"06DF@GCHA"2'4*A%&,'-*8#$,//%&6*=AA8?

Slide from Aarti Singh

Latent Dirichlet allocation

“swiss roll”

ProbabilisHc topic models

(Blei, Ng, Jordan JMLR ‘03)

gene 0.04dna 0.02genetic 0.01.,,

life 0.02evolve 0.01organism 0.01.,,

brain 0.04neuron 0.02nerve 0.01...

data 0.02number 0.02computer 0.01.,,

Topics Documents Topic proportions andassignments

Figure 1: The intuitions behind latent Dirichlet allocation. We assume that somenumber of “topics,” which are distributions over words, exist for the whole collection (far left).Each document is assumed to be generated as follows. First choose a distribution over thetopics (the histogram at right); then, for each word, choose a topic assignment (the coloredcoins) and choose the word from the corresponding topic. The topics and topic assignmentsin this figure are illustrative—they are not fit from real data. See Figure 2 for topics fit fromdata.

model assumes the documents arose. (The interpretation of LDA as a probabilistic model isfleshed out below in Section 2.1.)

We formally define a topic to be a distribution over a fixed vocabulary. For example thegenetics topic has words about genetics with high probability and the evolutionary biologytopic has words about evolutionary biology with high probability. We assume that thesetopics are specified before any data has been generated.1 Now for each document in thecollection, we generate the words in a two-stage process.

1. Randomly choose a distribution over topics.

2. For each word in the document

(a) Randomly choose a topic from the distribution over topics in step #1.

(b) Randomly choose a word from the corresponding distribution over the vocabulary.

This statistical model reflects the intuition that documents exhibit multiple topics. Eachdocument exhibits the topics with different proportion (step #1); each word in each document

1Technically, the model assumes that the topics are generated first, before the documents.

3

θd

z1d

zNd

β1

βT

ProbabilisHc topic models

(Blei, Ng, Jordan JMLR ‘03)

pna .0100 cough .0095

pneumonia .0090 cxr .0085

levaquin .0060 …

sore throat .05 swallow .0092

voice .0080 fevers .0075

ear .0016

…

celluliHs .0105 swelling .0100 redness .0055

lle .0050 fevers .0045 …

Topic word distributions

Triage note

Inference

Graphical model for Latent Dirichlet Allocation (LDA)

β1 β2

βT

Low Dimensional representation: distribution of topics for the note

Pneumonia 0.50 Common cold 0.49 Diabetes 0.01

θd

Example of learned representaHon

Paraphrased note:

“Pa%ent has URI [upper respiratory infec3on] symptoms like cough, runny nose, ear pain. Denies fevers. history of seasonal allergies”

Inferred Topic Distribu1on

Allergy

Cold

Other

Allergy Cold/URI

What you need to know

•  Dimensionality reducHon – why and when it’s important

•  Simple feature selecHon

•  RegularizaHon as a type of feature selecHon •  Principal component analysis

– minimizing reconstrucHon error –  relaHonship to covariance matrix and eigenvectors

– using SVD