csci b609: β€œfoundations of data science”grigory.us/b609/lec05.pdfβ€’ random spherical gaussian...

16
CSCI B609: β€œFoundations of Data Science” Grigory Yaroslavtsev http://grigory.us Lecture 8/9: Faster Power Method and Applications of SVD Slides at http://grigory.us/data-science-class.html

Upload: others

Post on 09-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

CSCI B609: β€œFoundations of Data Science”

Grigory Yaroslavtsev http://grigory.us

Lecture 8/9: Faster Power Method and Applications of SVD

Slides at http://grigory.us/data-science-class.html

Faster Power Method β€’ PM drawback: 𝐴𝑇𝐴 is dense even for sparse 𝐴

β€’ Pick random Gaussian 𝒙 and compute π΅π‘˜π’™

β€’ 𝒙 = 𝑐𝑖𝒗𝑖𝒅𝑖=1 (augment π’—π’Šβ€™s to o.n.b. if π‘Ÿ < 𝒅)

β€’ π΅π‘˜π’™ β‰ˆ 𝜎12π‘˜π’—1𝒗1

𝑇 𝑐𝑖𝒗𝑖𝑑𝑖=1 = 𝜎1

2π‘˜π‘1𝒗1

π΅π‘˜π’™ = 𝐴𝑇𝐴 𝐴𝑇𝐴 … (𝐴𝑇𝐴)𝒙

β€’ Theorem: If 𝒙 is unit ℝ𝒅-vector, 𝒙𝑇𝒗1 β‰₯ 𝜹:

– 𝑉 = subspace spanned by 𝒗𝑖′𝑠 for πœŽπ‘— β‰₯ 1 βˆ’ 𝝐 𝜎1

– π’˜ = unit vector after π‘˜ =1

2𝝐ln1

𝝐𝜹 iterations of PM

β‡’ π’˜ has a component at most 𝝐 orthogonal to 𝑉

Faster Power Method: Analysis

β€’ 𝐴 = πœŽπ‘–π’–π‘–π’—π‘–π‘‡π‘Ÿ

𝑖=1 and 𝒙 = 𝑐𝑖𝒗𝑖𝒅𝑖=1

β€’ π΅π‘˜π’™ = πœŽπ‘–2π‘˜π’—π‘–π’—π‘–

𝑇𝒅𝑖=1 𝑐𝑗𝒗𝑗

𝒅𝑗=1 = πœŽπ‘–

2π‘˜π‘π‘–π’—π‘–π’…π‘–=1

π΅π‘˜π’™2

2= πœŽπ‘–

2π‘˜π‘π‘–π’—π‘–

𝒅

𝑖=12

2

= πœŽπ‘–4π‘˜π‘π‘–2

𝒅

𝑖=1

β‰₯ 𝜎14π‘˜π‘12 β‰₯ πœŽπ‘–

4π‘˜π›Ώ2

β€’ (Squared ) component orthogonal to 𝑉 is

πœŽπ‘–4π‘˜π‘π‘–2

𝒅

𝑖=π‘š+1

≀ 1 βˆ’ 𝝐 4π‘˜πœŽ14π‘˜ 𝑐𝑖

2

𝒅

𝑖=π‘š+1

≀ 1 βˆ’ 𝝐 4π‘˜πœŽ14π‘˜

β€’ Component of π’˜ βŠ₯ 𝑉 ≀ 1 βˆ’ 𝝐 2π‘˜/𝜹 ≀ 𝝐

Choice of 𝒙

β€’ π’š random spherical Gaussian with unit variance

β€’ 𝒙 =π’š

π’šπŸ

:

π‘ƒπ‘Ÿ 𝒙𝑻𝒗 ≀1

20 𝒅≀1

10+ 3π‘’βˆ’π’…/64

β€’ π‘ƒπ‘Ÿ π’šπŸβ‰₯ 𝟐 𝒅 ≀ 3π‘’βˆ’π’…/64 (Gaussian Annulus)

β€’ π’šπ‘»π’— ∼ 𝑁 0,1 β‡’ Pr π’šπ‘»π’—2≀1

10≀1

10

β€’ Can set 𝜹 =1

20 𝒅 in the β€œfaster power method”

Singular Vectors and Eigenvectors

β€’ Right singular vectors are eigenvectors of 𝐴𝑇𝐴

β€’ πœŽπ‘–2 are eigenvalues of 𝐴𝑇𝐴

β€’ Left singular vectors are eigenvectors of 𝐴𝐴𝑇

β€’ 𝐴𝑇𝐴 satisfies βˆ€π’™: 𝒙𝑇𝐡𝒙 β‰₯ 0

– 𝐡 = πœŽπ‘–2𝒗𝑖𝒗𝑖

𝑇𝑖

– βˆ€π’™: 𝒙𝑇𝒗𝑖𝒗𝑖𝑇𝒙 = (𝒙𝑇𝒗𝑖)

2β‰₯ 0

– Such matrices are called positive semi-definite

β€’ Any p.s.d matrix can be decomposed as 𝐴𝑇𝐴

Application of SVD: Centering Data

β€’ Minimize sum of squared distances from π‘¨π’Š to π‘†π‘˜

β€’ SVD: best fitting π‘†π‘˜ if data is centered

β€’ What if not?

β€’ Thm. π‘†π‘˜ that minimizes squared distance goes through centroid of the point set:

1

𝑛 π‘¨π’Š

β€’ Will only prove for π‘˜ = 1, analogous proof for arbitrary π‘˜ (see textbook)

Application of SVD: Centering Data β€’ Thm. Line that minimizes squared distance goes through the centroid β€’ Line: β„“ = 𝒂 + πœ†π’—; distance 𝑑𝑖𝑠𝑑(π‘¨π’Š, β„“)

β€’ π‘¨π’Š βˆ’ 𝒂 22= 𝑑𝑖𝑠𝑑 π‘¨π’Š, β„“

2 + 𝒗, π‘¨π’Š2

β€’ Center so that π‘¨π’Šπ‘›π‘–=1 = 𝟎 by subtracting the centroid

β€’ 𝑑𝑖𝑠𝑑 π‘¨π’Š, β„“2𝑛

𝑖 = ( π‘¨π’Š βˆ’ 𝒂 22βˆ’ 𝒗, π‘¨π’Š

2)𝑛𝑖=1

= ( π‘¨π’Š 22+ 𝒂

2

2βˆ’ 2βŸ¨π‘¨π’Š, π’‚βŸ© βˆ’ 𝒗, π‘¨π’Š

2)

𝑛

𝑖=1

= π‘¨π’Š 22+ 𝑛 𝒂

2

2βˆ’ 2⟨ π‘¨π’Š

𝑛

𝑖=1

, π’‚βŸ© βˆ’ 𝒗,π‘¨π’Š2

𝑛

𝑖=1

𝑛

𝑖=1

= π‘¨π’Š 22+ 𝑛 𝒂

2

2βˆ’ 𝒗,π‘¨π’Š

2

𝑛

𝑖=1

𝑛

𝑖=1

β€’ Minimized when 𝒂 = 𝟎

Principal Component Analysis

β€’ 𝒏 Γ— 𝒅 matrix: customersΓ—movies preference

β€’ 𝒏 = #customers, 𝒅 = #movies

β€’ 𝐴𝑖𝑗 = how much customer 𝑖 likes movie 𝑗

β€’ Assumption: 𝐴𝑖𝑗 can be described with π‘˜ factors

– Customers and movies: vectors in π’–π’Š and π’—π’Š ∈ β„π‘˜

– 𝐴𝑖𝑗 = βŸ¨π’–π’Š, π’—π’‹βŸ©

β€’ Solution: π΄π‘˜

Class Project β€’ Survey of 3-5 research papers

– Closely related to the topics of the class β€’ Algorithms for high-dimensional data β€’ Fast algorithms for numerical linear algebra β€’ Algorithms for machine learning and/or clustering β€’ Algorithms for streaming and massive data

– Office hours if you need suggestions – Individual (not a group) project – 1-page Proposal Due: October 31, 2016 at 23:59 EST – Final Deadline: December 09, 2016 at 23:59 EST

β€’ Submission by e-mail to Lisul Islam (IU id: islammdl) – Submission Email Title: Project + Space + β€œYour Name” – Submission format: PDF from LaTeX

Separating mixture of π‘˜ Gaussians

β€’ Sample origin problem: – Given samples from π’Œ well-separated spherical Gaussians

– Q: Did they come from the same Gaussian?

β€’ 𝛿 = distance between centers

β€’ For two Gaussians naΓ―ve separation requires

𝛿 > πœ” π’…πŸ/πŸ’

β€’ Thm. 𝛿 = Ξ©(π’Œ1

4) suffices

β€’ Idea: – Project on a π’Œ-dimensional subspace through centers

– Key fact: This subspace can be found via SVD

– Apply naΓ―ve algorithm

Separating mixture of π‘˜ Gaussians β€’ Easy fact: Projection preserves the property of

being a unit-variance spherical Gaussian

β€’ Def. If 𝑝 is a probability distribution, best fit line *𝑐𝒗, 𝑐 ∈ ℝ+ is:

𝒗 = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯ 𝒗 =1 π”Όπ’™βˆΌπ‘ π’—π‘»π’™πŸ

β€’ Thm: Best fit line for a Gaussian centered at 𝝁 passes through 𝝁 and the origin

Best fit line for a Gaussian β€’ Thm: Best fit line for a Gaussian centered at 𝝁

passes through 𝝁 and the origin

π”Όπ’™βˆΌπ‘ π’—π‘»π’™πŸ= π”Όπ’™βˆΌπ‘ 𝒗

𝑻 𝒙 βˆ’ 𝝁 + π’—π‘»ππŸ

= π”Όπ’™βˆΌπ‘ 𝒗𝑻 𝒙 βˆ’ 𝝁 𝟐 + 2(𝒗𝑻𝝁)𝒗𝑻 𝒙 βˆ’ 𝝁 + (𝒗𝑻𝝁)2

= π”Όπ’™βˆΌπ‘ ,𝒗𝑻 𝒙 βˆ’ 𝝁 𝟐- + 2(𝒗𝑻𝝁)π”Όπ’™βˆΌπ‘,𝒗

𝑻 𝒙 βˆ’ 𝝁 - + (𝒗𝑻𝝁)2

= π”Όπ’™βˆΌπ‘ ,𝒗𝑻 𝒙 βˆ’ 𝝁 𝟐- + (𝒗𝑻𝝁)2

= 𝜎2 +(𝒗𝑻𝝁)2

β€’ Where we used:

– π”Όπ’™βˆΌπ‘,𝒗𝑻 𝒙 βˆ’ 𝝁 - = 𝟎

– π”Όπ’™βˆΌπ‘,𝒗𝑻 𝒙 βˆ’ 𝝁 𝟐- = 𝜎2

β€’ Best fit line maximizes (𝒗𝑻𝝁)2

Best fit subspace for one Gaussian

β€’ Best fit π‘˜-dimensional subspace π‘½π‘˜:

π‘½π‘˜ = π‘Žπ‘Ÿπ‘”π‘šπ‘Žπ‘₯𝑽:π‘‘π‘–π‘š 𝑽 =π‘˜

π”Όπ’™βˆΌπ‘ π‘π‘Ÿπ‘œπ‘— 𝒙, 𝑽 𝟐

𝟐

β€’ For a spherical Gaussian 𝑽 is a best-fit π‘˜-dimensional subspace iff it contains 𝝁

β€’ If 𝝁 = 0 then any π‘˜-dim. subspace is best fit

β€’ If 𝝁 β‰  0 then best fit line 𝒗 goes through 𝝁 – Same greedy process as SVD projects on 𝒗

– After projection we have Gaussian with 𝝁 = 0

– Any (π‘˜ βˆ’ 1)-dimensional subspace would do

Best fit subspace for π‘˜ Gaussians

β€’ Thm. 𝒑 is a mixture of π‘˜ spherical Gaussians β‡’ best fit π‘˜-dim. subspace contains their centers

β€’ 𝑝 = 𝑀1𝒑1 +𝑀2𝒑2 +β‹―+π‘€π‘˜π’‘π‘˜

β€’ Let 𝑽 be a subspace of dimension ≀ π‘˜

π”Όπ’™βˆΌπ’‘ π‘π‘Ÿπ‘œπ‘— 𝒙, 𝑽 𝟐

𝟐= 𝑀𝑖

π‘˜

𝑖=1

π”Όπ’™βˆΌπ’‘π’Š π‘π‘Ÿπ‘œπ‘— 𝒙, 𝑽 𝟐

𝟐

β€’ Each term is maximized if 𝑽 contains all ππ’Šβ€²π‘ 

β€’ If we only have a finite number of samples then accuracy has to be analyzed carefully

HITS Algorithm for Hubs and Authorities

β€’ Document ranking: project on 1st singular vector

β€’ WWW: directed graph with links = edges

β€’ 𝒏 Authorities: pages containing original info

β€’ 𝒅 Hubs: collections of links to authorities – Authority depends on importance of pointing hubs

– Hub quality depends on how authoritative links are

β€’ Authority vector: 𝒗𝒋, 𝑗 = 1,… , 𝒏: 𝒗𝒋 ∼ 𝒖𝑖𝑨𝑖𝑗𝒅𝑖=1

β€’ Hub vector: π’–π’Š, 𝑖 = 1,… , 𝒅: π’–π’Š ∼ 𝒗𝑗𝑨𝑖𝑗𝒏𝑗=1

β€’ Use power method: 𝒖 = 𝑨𝒗, 𝒗 = 𝑨𝑻𝒖

β€’ Converges to first left/right singular vectors

Exercises

β€’ Ex. 1: 𝐴 is 𝑛 Γ— 𝑛 matrix with orthonormal rows

– Show that it has orthonormal columns

β€’ Ex. 2: Interpret the left and right singular vectors of the document x term matrix

β€’ Ex. 3. Use power method to compute singular values of the matrix:

1 23 4