randomized svd, curdecomposition, and spsd matrix...

Randomized SVD, CUR Decomposition,and SPSD Matrix Approximation

Shusen Wang

Outline

• CX Decomposition & Approximate SVD• CUR Decomposition• SPSD Matrix Approximation

• Givenanymatrix𝐀 ∈ ℝ$×&

• TheCXdecompositionof𝐀1. Sketching:𝐂 = 𝐀𝐏 ∈ ℝ$×*2. Find𝐗 suchthat𝐀 ≈ 𝐂𝐗

• E.g.𝐗⋆ = argmin𝐗 𝐀 − 𝐂𝐗 56 = 𝐂7𝐀

• Itcosts𝑂 𝑚𝑛𝑐

• CXdecomposition⇔approximateSVD𝐀 ≈ 𝐂𝐗 = 𝐔G𝚺G𝐕GJ𝐗 = 𝐔G𝐙 = 𝐔G𝐔L𝚺L𝐕LJ

CX Decomposition

• Let the sketching matrix 𝐏 ∈ ℝ&×* be defined in the table.

• minWXYZ 𝐗 [\

𝐀 − 𝐂𝐗 ]6 ≤ 1 + 𝜖 𝐀 − 𝐀\ ]

6

Uniformsampling Leveragescoresampling

Gaussianprojection

SRHT Countsketch

c ≥ O 𝜈𝑘 log 𝑘 +1𝜖 O 𝑘 log 𝑘 +

1𝜖

O𝑘𝜖 O 𝑘 + log 𝑛 log 𝑘 +

1𝜖

O 𝑘6 +𝑘𝜖

𝜈 isthecolumncoherenceof𝐀Z

CX Decomposition

CX Decomposition ⇔ Approximate SVD

• CXdecomposition⇔approximateSVD

𝐀 ≈ 𝐂𝐗 = 𝐔G𝚺G𝐕GJ𝐗 = 𝐔G𝐙 = 𝐔G𝐔L𝚺L𝐕LJ




SVD:𝐂 = 𝐔G 𝚺G𝐕GJ ∈ ℝ$×*

Timecost:𝑂(𝑚𝑐6)





Let𝚺G𝐕GJ𝐗 = 𝐙 ∈ ℝ*×&

Timecost:𝑂(𝑚𝑐6 + 𝑛𝑐6)






SVD:𝐙 = 𝐔L𝚺L𝐕LJ ∈ ℝ*×&

Timecost:𝑂(𝑚𝑐6 + 𝑛𝑐6 + 𝑛𝑐6)






SVD:𝐙 = 𝐔L𝚺L𝐕LJ ∈ ℝ*×&

𝑚×𝑠 matrixwithorthonormalcolumns

diagonalmatrix

𝑠×𝑛 matrixwithorthonormalrows

Timecost:𝑂(𝑚𝑐6 + 𝑛𝑐6 + 𝑛𝑐6 + 𝑚𝑐6)



• Done!Approximaterank𝑐 SVD:𝐀 ≈ (𝐔G𝐔L)𝚺L𝐕LJ


𝑚×𝑠 matrixwithorthonormalcolumns

diagonalmatrix

𝑠×𝑛 matrixwithorthonormalrows

Timecost:𝑂 𝑚𝑐6 + 𝑛𝑐6 + 𝑛𝑐6 + 𝑚𝑐6 = 𝑂(𝑚𝑐6 + 𝑛𝑐6)



• Given𝐀 ∈ ℝ$×& and𝐂 ∈ ℝ$×*,theapproximateSVDcosts• 𝑂 𝑚𝑛𝑐 time• 𝑂 𝑚𝑐 + 𝑛𝑐 memory

CX Decomposition

• TheCXdecompositionof𝐀 ∈ ℝ$×&

• Optimalsolution:𝐗⋆ = argmin𝐗 𝐀 − 𝐂𝐗 56 = 𝐂7𝐀

• Howtomakeitmoreefficient?

CX Decomposition

• TheCXdecompositionof𝐀 ∈ ℝ$×&

• Optimalsolution:𝐗⋆ = argmin𝐗 𝐀 − 𝐂𝐗 56 = 𝐂7𝐀

• Howtomakeitmoreefficient?Aregressionproblem!

Fast CX Decomposition

• FastCX [Drineas, Mahoney, Muthukrishnan, 2008][Clarkson & Woodruff, 2013]

• Drawanothersketchingmatrix𝐒 ∈ ℝ$×m

• Compute𝐗n = argmin𝐗 𝐒o 𝐀 − 𝐂𝐗 56 = 𝐒J𝐂 7 𝐒J𝐀

• Timecost:𝑂 𝑛𝑐𝑠 + TimeOfSketch• When𝑠 = 𝑂q 𝑐/𝜖 ,

𝐀 − 𝐂𝐗n5

6≤ 1 + 𝜖 ⋅ min𝐗 𝐀 − 𝐂𝐗 5

6

Outline


CUR Decomposition

• Sketching• 𝐂 = 𝐀𝐏𝐂 ∈ ℝ$×*

• 𝐑 = 𝐏𝐑J𝐀 ∈ ℝv×&

• Find 𝐔 such that 𝐂𝐔𝐑 ≈ 𝐀• CUR⇔ ApproximateSVD• Inthesamewayas“CX⇔ ApproximateSVD”

CUR Decomposition

• Sketching• 𝐂 = 𝐀𝐏𝐂 ∈ ℝ$×*

• 𝐑 = 𝐏𝐑J𝐀 ∈ ℝv×&

• Find 𝐔 such that 𝐂𝐔𝐑 ≈ 𝐀• CUR⇔ ApproximateSVD• Inthesamewayas“CX⇔ ApproximateSVD”

• 3 types of 𝐔

CUR Decomposition

• Type 1 [Drineas, Mahoney, Muthukrishnan, 2008]:𝐔 = 𝐏𝐑o𝐀𝐏𝐂

7

𝐂𝐀 𝐔 𝐑

CUR Decomposition


7

• Recall the fast CX decomposition𝐀 ≈ 𝐂𝐗n = 𝐂 𝐏𝐑o𝐂

7 𝐏𝐑o𝐀 = 𝐂𝐔𝐑

CUR Decomposition


7



• They’re equivalent: 𝐂𝐗n = 𝐂𝐔𝐑

CUR Decomposition


7



• They’re equivalent: 𝐂𝐗n = 𝐂𝐔𝐑

• Require 𝑐 = 𝑂q \w

and 𝑟 = 𝑂q *w

such that𝐀 − 𝐂𝐔𝐑 5

6 ≤ 1 + 𝜖 𝐀 − 𝐀\ 56

CUR Decomposition

• Type 1 [Drineas, Mahoney, Muthukrishnan, 2008]:𝐔 = 𝐏𝐑𝐓𝐀𝐏𝐂

7

• Efficient• O 𝑟𝑐6 + TimeOfSketch

• Loose bound• Sketch size ∝ 𝜖{6

• Bad empirical performance

CUR Decomposition

• Type 2:OptimalCUR𝐔⋆ = min

𝐔𝐀 − 𝐂𝐔𝐑 ]

6 = 𝐂7𝐀𝐑7

CUR Decomposition



6 = 𝐂7𝐀𝐑7

• Theory [W&Zhang,2013], [Boutsidis & Woodruff, 2014]:• 𝐂 and 𝐑 are selected by the adaptive sampling algorithm• 𝑐 = 𝑂 \

w and 𝑟 = 𝑂 \w

• 𝐀 − 𝐂𝐔𝐑 ]6 ≤ 1 + 𝜖 𝐀 − 𝐀\ 5

6

CUR Decomposition



6 = 𝐂7𝐀𝐑7

• Inefficient• O 𝑚𝑛𝑐 + TimeOfSketch

CUR Decomposition

• Type 3:FastCUR[W, Zhang, Zhang, 2015]

• Draw 2 sketching matrices 𝐒𝐂 and 𝐒𝐑• Solve the problem

𝐔n = min𝐔

𝑺𝑪o 𝐀 − 𝐂𝐔𝐑 𝐒𝐑 ]

6= 𝐒𝐂J𝐂

7 𝐒𝐂o𝐀𝐒𝑹 𝐑𝐒𝐑 7

• Intuition?

CUR Decomposition

• The optimal𝐔matrixis obtained by the optimization problem𝐔⋆ = min

𝐔𝐂𝐔𝐑 − 𝐀 ]

6

CUR Decomposition

• Approximately solve the optimization problem, e.g. by columnselection

CUR Decomposition

• Solve the small scale problem

CUR Decomposition


• Draw 2 sketching matrices 𝐒𝐂 ∈ ℝ$×m� and 𝐒𝐑 ∈ ℝ&×m�• Solve the problem

𝐔n = min𝐔

𝐒𝑪o 𝐀 − 𝐂𝐔𝐑 𝐒𝐑 ]

6= 𝐒𝐂J𝐂


• Theory• 𝑠* = 𝑂 *

w and sv = 𝑂 vw

• 𝐀 − 𝐂𝐔n𝐑]

6≤ 1 + 𝜖 ⋅ min


6

CUR Decomposition


• Draw 2 sketching matrices 𝐒𝐂 ∈ ℝ$×m� and 𝐒𝐑 ∈ ℝ&×m�• Solve the problem

𝐔n = min𝐔

𝐒𝑪o 𝐀 − 𝐂𝐔𝐑 𝐒𝐑 ]

6= 𝐒𝐂J𝐂


• Efficient• 𝑂 𝑠*𝑠v 𝑐 + 𝑟 + TimeOfSketch

• Good empirical performance

Type 2:OptimalCUROriginal

Type 1:FastCX Type 3:FastCUR𝑠* = 2𝑐, 𝑠v = 2𝑟

Type 3:FastCUR𝑠* = 4𝑐, 𝑠v = 4𝑟

𝐀:𝑚 = 1920𝑛 = 1168

𝐂 and𝐑:• 𝑐 = 𝑟 = 100• uniformsampling

Conclusions

• ApproximatetruncatedSVD• CXdecomposition• CURdecomposition(3types)

• FastCURisthebest

Outline


Motivation 1: Kernel Matrix

• Given𝑛 samples𝐱�,⋯ , 𝐱& ∈ ℝ� andkernelfunction𝜅 ⋅,⋅ .• E.g.GaussianRBFkernel

𝜅 𝐱�, 𝐱� = exp −𝐱� − 𝐱� 6

6

𝜎6 .

• Computingthekernelmatrix𝐊 ∈ ℝ&×&• where𝑘�� = 𝜅 𝐱�, 𝐱�• costsO(𝑛6𝑑) time

Motivation 1: Kernel Matrix

• Given𝑛 samples𝐱�,⋯ , 𝐱& ∈ ℝ� andkernelfunction𝜅 ⋅,⋅ .• E.g.GaussianRBFkernel

𝜅 𝐱�, 𝐱� = exp −𝐱� − 𝐱� 6

6

𝜎6 .

• Computingthekernelmatrix𝐊 ∈ ℝ&×&• where𝑘�� = 𝜅 𝐱�, 𝐱�• costs𝑂(𝑛6𝑑) time

Motivation 2: Matrix Inversion

• Solvethelinearsystem𝐊 + 𝛼𝐈& 𝐰 = 𝐲

tofind𝐰 ∈ ℝ&.• ItcostsO(𝑛�) timeandO(𝑛6)memory.• Performedby• Gaussianprocessregression(equivalently,kernelridgeregression)• LeastsquareskernelSVM

• 𝐊 ∈ ℝ&×& isthekernelmatrix• 𝐲 = 𝑦�,⋯ , 𝑦& ∈ ℝ& containsthelabels



tofind𝐰 ∈ ℝ&.• Solution:𝐰⋆ = 𝐊 + 𝛼𝐈& {�𝐲



tofind𝐰 ∈ ℝ&.• Solution:𝐰⋆ = 𝐊 + 𝛼𝐈& {�𝐲• Itcosts• 𝑂(𝑛�) time• 𝑂(𝑛6)memory.



tofind𝐰 ∈ ℝ&.• Solution:𝐰⋆ = 𝐊 + 𝛼𝐈& {�𝐲• Itcosts• 𝑂(𝑛�) time• 𝑂(𝑛6)memory.

• Performedby• Kernelridgeregression• LeastsquareskernelSVM

Motivation 3: Eigenvalue Decomposition

• Findthetop𝑘 (≪ 𝑛)eigenvectorsof𝐊.• Itcosts• 𝑂q(𝑛6𝑘) time• 𝑂(𝑛6)memory.

Motivation 3: Eigenvalue Decomposition

• Findthetop𝑘 (≪ 𝑛)eigenvectorsof𝐊.• Itcosts• 𝑂q(𝑛6𝑘) time• 𝑂(𝑛6)memory.

• Performedby• KernelPCA(𝑘 isthetargetrank)• Manifoldlearning (𝑘 isthetargetrank)

Computational Challenges

• Timecosts• Computingkernelmatrix:𝑂(𝑛6𝑑)• Matrixinversion:𝑂(𝑛�)• Rank𝑘 eigenvaluedecomposition:𝑂(𝑛6𝑘)



Atleastquadratictime!



• Memorycosts• Inversionandeigenvaluedecomposition:𝑂(𝑛6)



• Memorycosts• Inversionandeigenvaluedecomposition:𝑂(𝑛6)• Because

• thenumericalalgorithmsarepass-inefficient• è form𝐊 andkeepitinmemory



• Memorycosts• Inversionandeigenvaluedecomposition:𝑂(𝑛6)• Because

• thenumericalalgorithmsarepass-inefficient• è form𝐊 andkeepitinmemory

When𝑛 = 10�,the𝑛×𝑛 matrixcosts80GBmemory!

How to Speedup?

• Efficientlyformthelow-rankapproximation𝐊 ≈ 𝐂𝐔𝐂J

𝐊 𝐂 𝐂J𝐔

How to Speedup?

• Efficientlyformthelow-rankapproximation𝐊 ≈ 𝐂𝐔𝐂J

• Equivalent𝐊 ≈ 𝐋𝐋J

Efficient Matrix Inversion

• Solvethelinearsystem 𝐊 + 𝛼𝐈& 𝐰 = 𝐲:

• Replace𝐊 by𝐋𝐋J:𝐰⋆ = 𝐊 + 𝛼𝐈& {�𝐲 ≈ 𝐋𝐋J + 𝛼𝐈&{�𝐲


• Approximatelysolvethelinearsystem 𝐊 + 𝛼𝐈& 𝐰 = 𝐲:



• Approximatelysolvethelinearsystem 𝐊 + 𝛼𝐈& 𝐰 = 𝐲


• Expandtheinversion bytheWoodburyidentity




• Expandtheinversion bytheWoodburyidentity

𝐀 + 𝐁𝐂𝐃 {� = 𝐀{� − 𝐀{�𝐁 𝐂{� + 𝐃𝐀{�𝐁 {�𝐃𝐀{�




• Expandtheinversion bytheWoodburyidentity𝐰⋆ ≈ 𝛼{�𝐲 + 𝛼{�𝐋 𝛼𝐈 + 𝐋o𝐋 {�𝐋𝐓𝐲




• ExpandtheinversionbytheWoodburyidentity𝐰⋆ ≈ 𝛼{�𝐲 + 𝛼{�𝐋 𝛼𝐈 + 𝐋o𝐋 {�𝐋𝐓𝐲

• Timecost:𝑂 𝑛𝑐6

Linearin𝑛,muchbetterthan𝑂 𝑛�

Efficient Eigenvalue Decomposition

• Approximatelycomputethe𝑘-eigenvaluedecompositionof𝐊• SVD:𝐋 = 𝐔𝐋𝚺𝐋𝐕𝐋• 𝐊 ≈ 𝐋𝐋𝐓 = 𝐔𝐋𝚺𝐋𝟐𝐔𝐋𝐓

• Approximate𝑘- eigenvaluedecompositionof𝐊• eigenvectors:thefirst𝑘 vectorsin𝐔𝐋• eigenvalues:thefirst𝑘 vectorsdiagonalentriesin𝚺𝐋



• Approximate𝑘- eigenvaluedecompositionof𝐊• eigenvectors:thefirst𝑘 vectorsin𝐔𝐋

• Timecost:𝑂 𝑛𝑐6



• Approximate𝑘- eigenvaluedecomposition of𝐊• eigenvectors:thefirst𝑘 vectorsin𝐔𝐋

• Timecost:𝑂 𝑛𝑐6• Muchlowerthan𝑂q 𝑛6𝑘

Sketching Based Models

• Howtofindsuchanapproximation?

𝐊 ≈ 𝐂𝐔𝐂J

𝐊 𝐂 𝐂J𝐔



• SketchingbasedMethods:𝐂 = 𝐊𝐒 ∈ ℝ&×* isasketchof𝐊.• 𝐒 ∈ ℝ&×* canbecolumnselectionorrandomprojectionmatrix




• SketchingbasedMethods:𝐂 = 𝐊𝐒 ∈ ℝ&×* isasketchof𝐊.• 𝐒 ∈ ℝ&×* canbecolumnselectionorrandomprojectionmatrix

• Threemethods:• Theprototypemodel[HMT11,WZ13,WLZ16]

• Thefastmodel[WZZ15]

• TheNyström method [WS15,GM13]


The Prototype Model

• Objective:𝐊 ≈ 𝐂𝐔𝐂J

• Minimizetheapproximationerrorby𝐔⋆ = argmin

𝐔𝐊 − 𝐂𝐔𝐂J

]

6= 𝐂7𝐊 𝐂7 J.

The Prototype Model




]

6= 𝐂7𝐊 𝐂7 J.

Extension of the random SVD to SPSD matrix [HMT11]

The Prototype Model




]

6= 𝐂7𝐊 𝐂7 J.

• Time:𝑂(𝑛6𝑐)

• Thetimecomplexityisnearlythesametothe𝑘-eigenvaluedecomposition.• Itismuchfasterthanthe𝑘- eigenvaluedecompositioninpractice.

The Prototype Model




]

6= 𝐂7𝐊 𝐂7 J.

• Time:𝑂(𝑛6𝑐)• #Passes:one

The Prototype Model




]

6= 𝐂7𝐊 𝐂7 J.

• Time:𝑂(𝑛6𝑐)• #Passes:one• Memory:𝑂(𝑛𝑐)

• Put𝑘�� inmemoryonlywhenitisvisited• Keep𝐂7 inmemory

The Prototype Model

• ErrorBound• 𝑘 ≪ 𝑛 isarbitraryinteger• 𝐏 samples𝑐 = 𝑂 \

w columns by adaptive sampling

• 𝔼 𝐊 − 𝐂𝐔⋆𝐂J]

6≤ 1 + 𝜖 𝐊 − 𝐊\ ]

6

The Prototype Model

• Limitations• 𝐔⋆ = 𝐂7𝐊 𝐂7 J

• Timecostis𝑂 𝑛6𝑐• Requiresobservingthewholeof𝐊

The Prototype Model

• Prototypemodel:𝐊 ≈ 𝐂𝐔⋆𝐂J,where𝐔⋆ = argmin


5

6.

𝐊 𝐂 𝐂J𝐔

The Fast Model

• Column/rowselection• Form𝐏J𝐊𝐏 and𝐏J𝐂

𝐏J𝐊𝐏 𝐏J𝐂 𝐂J𝐏𝐔

The Fast Model

• 𝐊 ≈ 𝐂𝐔n 𝐂J, where𝐔n = argmin

𝐔𝐏J 𝐊 − 𝐂𝐔𝐂J 𝐏

𝑭

𝟐.

𝐏J𝐊𝐏 𝐏J𝐂 𝐂J𝐏𝐔

The Fast Model• Prototypemodel: 𝐔⋆ = argmin


5

6= 𝐂7𝐊 𝐂7 J

• Fastmodel:𝐔n = argmin𝐔

𝐏J 𝐊 − 𝐂𝐔𝐂J 𝐏]

𝟐= 𝐏J𝐂 7(𝐏J𝐊𝐏) 𝐂J𝐏 7.



5

6= 𝐂7𝐊 𝐂7 J




• Theory• 𝑝 = 𝑂 &*

w�

• 𝐏 is column selection matrix (according to the row leverage scores of 𝐂)

• Then 𝐊 − 𝐂𝐔n𝐂J]

6≤ 1 + 𝜖 𝐊 − 𝐂𝐔⋆𝐂J

]

6

Thefastermodelisnearlyasgoodastheprototypemodel!



5

6= 𝐂7𝐊 𝐂7 J




• Theory• 𝑝 = 𝑂 &*

w�

• 𝐏 is column selection matrix (according to the row leverage scores of 𝐂)

• Then 𝐊 − 𝐂𝐔n𝐂J]

6≤ 1 + 𝜖 𝐊 − 𝐂𝐔⋆𝐂J

]

6

• Overalltimecost:𝑂 𝑝6𝑐 + 𝑛𝑐6 = 𝑂 𝑛𝑐�/𝜖linearin𝑛

The Nyström Method

• 𝐒 (𝑛×𝑐):columnselectionmatrix• 𝐂 = 𝐊𝐒 (𝑛×𝑐),𝐖 = 𝐒J𝐊𝐒 = 𝐒J𝐂 (𝑐×𝑐)• TheNyström method:𝐊 ≈ 𝐂𝐖7𝐂J

𝐊

The Nyström Method


𝐊 𝐂 𝐂J

The Nyström Method


𝐊 𝐂 𝐂J𝐖

The Nyström Method


𝐊 𝐂 𝐂J𝐖7

The Nyström Method


• Newexplanation:• Recallthefastmodel:𝐗n = argmin

𝐗𝐏J 𝐊 − 𝐂𝐗𝐂J 𝐏

5

6

• Setting𝐏 = 𝐒,then𝐗n = argmin

𝐗𝐒J 𝐊 − 𝐂𝐗𝐂J 𝐒

5

6

= 𝐒J𝐂 7 𝐒J𝐊𝐒 𝐂J𝐒 7

= 𝐖7𝐖𝐖7 = 𝐖7

The Nyström Method


• Newexplanation:• Recallthefastmodel:𝐗n = argmin

𝐗𝐏J 𝐊 − 𝐂𝐗𝐂J 𝐏

5

6

• Setting𝐏 = 𝐒,then𝐗n = argmin

𝐗𝐒J 𝐊 − 𝐂𝐗𝐂J 𝐒

5

6

= 𝐒J𝐂 7 𝐒J𝐊𝐒 𝐂J𝐒 7

= 𝐖7𝐖𝐖7 = 𝐖7

• TheNystrommethodisspecialinstanceofthefastmodel.

• Itisapproximatesolutiontotheprototypemodel

The Nyström Method

• Cost• Time:𝑂 𝑛𝑐6• Memory:𝑂 𝑛𝑐

The Nyström Method


Veryefficient!

The Nyström Method


• Errorbound:weak

Veryefficient!

Comparisons

• 𝐂 = 𝐊𝐒 ∈ ℝ&×*,𝐖 = 𝐒J𝐊𝐒 = 𝐒J𝐂 ∈ ℝ*×*

• SPSDmatrixapproximation: 𝐊 ≈ 𝐂𝐔𝐂J

• Theprototypemodel:𝐔 = 𝐂7𝐊 𝐂7 J

• Thefastmodel: 𝐔 = 𝐏J𝐂 7(𝐏J𝐊𝐏) 𝐂J𝐏 7

• TheNyström method:𝐔 = 𝐖7

Comparisons

• 𝐂 = 𝐊𝐒 ∈ ℝ&×*,𝐖 = 𝐒J𝐊𝐒 = 𝐒J𝐂 ∈ ℝ*×*



• Thefastmodel: 𝐔 = 𝐏J𝐂 7(𝐏J𝐊𝐏) 𝐂J𝐏 7


When 𝐏 = 𝐈&,theprototypemodel⇔ thefastmodel

Comparisons

• 𝐂 = 𝐊𝐒 ∈ ℝ&×*,𝐖 = 𝐒J𝐊𝐒 = 𝐒J𝐂 ∈ ℝ*×*



• Thefastmodel:𝐔 = 𝐏J𝐂 7(𝐏J𝐊𝐏) 𝐂J𝐏 7


When 𝐏 = 𝐒,theNyström method⇔ thefastmodel

Comparisons

• 𝑐 = 150,𝑛 = 100𝑐,vary𝑝 from2𝑐 to40𝑐

𝑝/𝑐

𝐊− 𝐂𝐔𝐂J]

6

𝐊 ]6

TheNyströmMethod𝑂 𝑛𝑐6 time

TheFastModel𝑂 𝑛𝑐6 + 𝑝6𝑐 time

ThePrototypeModel𝑂 𝑛6𝑐 time

Conclusions

• Motivations• Avoidformingthekernelmatrix• Avoidinversion/decomposition

• Prototypemodel,fastmodel,Nystrom• Theyhaveconnections• ThefastmodelandNystromarepractical

randomized svd, curdecomposition, and spsd matrix...

Documents