low-rank matrix approximations in python by christian thurau pydata 2014
DESCRIPTION
Low-rank approximations of data matrices have become an important tool in machine learning and data mining. They allow for embedding high dimensional data in lower dimensional spaces and can therefore mitigate effects due to noise, uncover latent relations, or facilitate further processing. These properties have been proven successful in many application areas such as bio-informatics, computer vision, text processing, recommender systems, social network analysis, among others. Present day technologies are characterized by exponentially growing amounts of data. Recent advances in sensor technology, internet applications, and communication networks call for methods that scale to very large and/or growing data matrices. In this talk, we will describe how to efficiently analyze data by means of matrix factorization using the Python Matrix Factorization Toolbox (PyMF) and HDF5. We will briefly cover common methods such as k-means clustering, PCA, or Archetypal Analysis which can be easily cast as a matrix decomposition, and explain their usefulness for everyday data analysis tasks.TRANSCRIPT
![Page 1: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/1.jpg)
Low-rank matrix approximations with Python
Christian Thurau
![Page 2: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/2.jpg)
Table of Contents
1 Intro
2 The Basics
3 Matrix approximation
4 Some methods
5 Matrix Factorization with Python
6 Example & Conclusion
2
![Page 3: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/3.jpg)
For Starters...
Observations
• Data matrix factorization has become an important tool ininformation retrieval, data mining, and pattern recognition
• Nowadays, typical data matrices are HUGE
• Examples include:• Gene expression data and microarrays• Digital images• Term by document matrices• User ratings for movies, products, ...• Graph adjacency matrices
3
![Page 4: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/4.jpg)
Matrix Factorization
• given a matrix
V
• determine matrices
W and H
• such that
V = WH or V ≈ WH
• characteristics such as entries, shape, rank of V ,W , and H willdepend on application context
4
![Page 5: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/5.jpg)
The Basics
matrix factorization allows for:
• solving linear equations
• transforming data
• compressing data
matrix factorization facilitates subsequent processing in:
• information retrieval
• pattern recognition
• data mining
5
![Page 6: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/6.jpg)
Low-rank Matrix Approximations
• Aapproximate V
V ≈ WH
• where
V ∈ Rm×n
W ∈ Rm×k
H ∈ Rk×n
• and
rank(W ) ≪ rank(V )
k ≪ min(m, n)
V
=
W H
6
![Page 7: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/7.jpg)
Matrix Approximation
• If
V = WH
• then
vi ,j = wi ,∗h∗,j
=k∑
x=1
wi ,xhx ,j
V
=
W H
7
![Page 8: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/8.jpg)
Matrix Approximation
• More importantly:
v∗,j = Wh∗,j
=k∑
x=1
w∗,xhx ,j
• therefore
W ↔ ”basis” matrix
H ↔ coefficient matrix
V
=
W H
= + +
8
![Page 9: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/9.jpg)
On Matrix Factorization Methods
• matrix factorization ↔ data transformation
• matrix rank reduction ↔ data compression
• Common form: V = WH• Broad range of methods:
• K-means clustering• SVD/PCA• Non-negative Matrix Factorization• Archetypal Analysis• Binary matrix factorization• CUR decomposition• ...
• Each method yields a unique view on data . . .
• . . . and is suited for different tasks
9
![Page 10: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/10.jpg)
K-means Clustering1
• Baseline clustering method
• Constrained quadradic optimization problem:
minW ,H
∥V − WH∥2
s.t. H = [0; 1],∑k
hk,i = 1
• Find W ,H using expectation maximization
• Optimal k-means partitioning is np-hard
• Goal: group similar data points
• Interesting: K-means clustering is matrix factorization
1J.B. MacQueen, Some Methods for classification and Analysis of MultivariateObservations”. Berkeley Symposium on Mathematical Statistics and Probability. 1967
10
![Page 11: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/11.jpg)
K-means Clustering is Matrix Factorization!
x1,1 x1,2 x1,3 . . . x1,nx2,1 x2,2 x2,3 . . . x2,nx3,1 x3,2 x3,3 . . . x3,n...
......
. . ....
xm,1 xm,2 xm,3 . . . xm,n
b1,1 b1,2 b1,3b2,1 b2,2 b2,3b3,1 b3,2 b2,3...
......
bn,1 bn,2 bn,3
0 1 1 . . . 01 0 0 . . . 00 0 0 . . . 1
• i.e. for X ∈ Rm×n, and B ∈ Rn×3, and A ∈ R3×n as above, theproduct
XBA = MA
realizes an assignment
xi → mj , where mj = Xbj
11
![Page 12: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/12.jpg)
Example: K-means
≈ 0.0 + 0.0 . . . 1.0 . . . 0.0 =
• Similar images are grouped into k groups
• Approximate data by mapping each data point onto the mean of acluster regions
12
![Page 13: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/13.jpg)
Python Matrix Factorization Toolbox (PyMF)2
• Started in 2010 at Fraunhofer IAIS/University of Bonn
• Vast number of different methods!
• Supports hdf5/h5py and sparse matrices
How to factorize a data matrix V :
>>>import pymf
>>>import numpy as np
>>>data = np.array([[1.0, 0.0, 2.0], [0.0, 1.0, 1.0]])
>>>mdl = pymf.kmeans.Kmeans(data, num_bases=2)
>>>mdl.factorize(niter=10) # optimize for WH>>>V_approx = np.dot(mdl.W, mdl.H) # V = WH
2http://github.com/cthurau/pymf13
![Page 14: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/14.jpg)
Python Matrix Factorization Toolbox (PyMF)2
• Restarted development a few weeks back ;)
• Looking for contributors!
How to map data onto W :
>>>import pymf
>>>import numpy as np
>>>test_data = np.array([[1.0], [0.3]])
>>>mdl_test = pymf.kmeans.Kmeans(test_data, num_bases=2)
>>>mdl_test.W = mdl.W # mdl.W -> existing basis W>>>mdl_test.factorize(compute_w=False)
>>>test_datx_approx = np.dot(mdl.W, mdl_test.H)
2http://github.com/cthurau/pymf14
![Page 15: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/15.jpg)
PCA
Principal Component Analysis (PCA)3
• SVD/PCA are baseline matrix factorization methods
• Optimize:
minW ,H
∥V − WH∥2
s.t. W TW = I
• Restrict W to singular vectors of V (orthogonal matrix)
• Can (usually does) violate non-negativity
• Goal: best possible matrix approximation for a given k
• Great for compression or filtering out noise!
3K. Pearson, On Lines and Planes of Closest Fit to Systems of Points in Space,Philosophical Magazine, 1901.
15
![Page 16: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/16.jpg)
Example PCA
>>>from pymf.pca import PCA
>>>import numpy as np
>>>mdl = PCA(data, num_bases=2)
>>>mdl.factorize()
>>>V_approx = np.dot(mdl.W, mdl.H)
• Usage for data analysis questionable
• Basis vectors usually not interpretable
V
≈
Vapprox
W = . . .
16
![Page 17: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/17.jpg)
Non-negative Matrix Factorization4
• For V ≥ 0 constrained quadradic optimization problem:
minW ,H
∥V − WH∥2
s.t. W ≥ 0
H ≥ 0
• a globally optimal solution provably exists; algorithms guaranteed tofind it remain elusive; exact NMF is NP hard
• Often W converges to partial representations
• Active area of research
• Goal: reconstruct data by independent parts
4D.D. Lee and H.S. Seung, Learning the Parts of Objects by Non-Negative MatrixFactorization, Nature, 401(6755), 1999
17
![Page 18: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/18.jpg)
Example NMF
>>>from pymf.nmf import NMF
>>>import numpy as np
>>>mdl = NMF(data, num_bases=2, iter=50)
>>>mdl.factorize()
>>>V_approx = np.dot(mdl.W, mdl.H)
• Additive combination of parts
• Interesting options for data analysis
V
≈
Vapprox
W = . . .
18
![Page 19: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/19.jpg)
Archetypal Analysis5
• Convexity constrained quadratic optmization problem:
minW ,H
∥V − VWH∥2
s.t. wl ,i ≥ 0,∑l
wl ,i = 1
hk,i ≥ 0,∑k
hk,i = 1
• Reconstruct data by its archetypes, i.e. convex combinations of polaropposites
• Yields novel and intuitive insights into data
• Great for interpretable data representations!
• O(n2), but: efficient approximations for large data exist5A. Cutler and L. Breiman, Archetypal Analysis, in Technometrics 36(4), 1994
19
![Page 20: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/20.jpg)
Example Archetypal Analysis
>>>from pymf.aa import AA
>>>import numpy as np
>>>mdl = AA(data, num_bases=2, iter=50)
>>>mdl.factorize()
>>>V_approx = np.dot(mdl.W, mdl.H)
• Existent data points as basis vectors
• Convex combination allows aprobablilist interpretation
V
≈
Vapprox
W = . . .
20
![Page 21: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/21.jpg)
Method Summary
• Common form: V = WH (or V = VWH)
W constraint H constraint Outcome
PCA - - compressed VK-means - H = [0; 1],
∑k hk,i = 1 groups
NMF W ≥ 0 H ≥ 0 partsAA W ≥ 0,
∑l wl,i = 1 H ≥ 0,
∑k hk,i = 1 opposites
• Doesn’t only work for images ;)
• More complex constraints usually result in more complex solvers
• Active area of research deals with approximations for large data
21
![Page 22: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/22.jpg)
Large matrices: PyMF and h5py
>>> import h5py
>>> import numpy as np
>>> from pymf.sivm import SIVM # uses [6]
>>> file = h5py.File(’myfile.hdf5’, ’w’)
>>> file[’dataset’] = np.random.random((100,1000))
>>> file[’W’] = np.random.random((100,10))
>>> file[’H’] = np.random.random((10,1000))
>>> sivm_mdl = SIVM(file[’dataset’], num_bases=10)
>>> sivm_mdl.W = file[’W’]
>>> sivm_mdl.H = file[’H’]
>>> sivm_mdl.factorize()
6Thurau, Kersting, and Bauckhage, ”Simplex volume maximization for descriptiveweb scale matrix factorization”, CIKM’2010
22
![Page 23: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/23.jpg)
7Science, 2010: Vol. 330
![Page 24: Low-rank matrix approximations in Python by Christian Thurau PyData 2014](https://reader034.vdocument.in/reader034/viewer/2022051210/54c6c7db4a795938448b4592/html5/thumbnails/24.jpg)
Take Home Message
• Most clustering, and data analysis methods are matrixapproximations
• Imposed constraints shape the factorization
• Imposed constraints yield different views on data
• One of the most effective and versatile tools for data exploration!
• Python implementation → http://github.com/cthurau/pymf
24