mathematical approach for text mining 1
DESCRIPTION
Standard Latent semantic indexingTRANSCRIPT
Kyunghoon Kim
Mathematical approach for Text Mining
- Standard Latent Semantic Indexing -
7/17/2014 Standard Latent Semantic Indexing 1
2014. 07. 17.
UNIST Mathematical Sciences
Kyunghoon Kim ( [email protected] )
Kyunghoon Kim
What is the Indexing?
7/17/2014 Standard Latent Semantic Indexing 2
Google Glasses is a computer with a head-mounted display.
He wore thick glasses. He worked in google corporation.
He wore glasses to be able to read signs at a distance.
googleglassesisacomputer withhead-mounteddisplayhe
1 21 2 311 311112 3
worethickworkedincorporationtobeableread…
2 322223333
1 2 3
Kyunghoon Kim
>>> Original
matrix([[1, 1, 0, 1],
[7, 0, 0, 7],
[1, 1, 0, 1],
[2, 5, 3, 6]])
>>> U, Sigma, VT = np.linalg.svd(Original)
SVD with Numpy
7/17/2014 Standard Latent Semantic Indexing 3
Kyunghoon Kim
Singular Value Decomposition(SVD)
7/17/2014 Standard Latent Semantic Indexing 4
Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
Kyunghoon Kim
>>> np.matrix(np.diag(Sigma))
matrix([
[ 1.218e+01, 0.0e+00, 0.0e+00, 0.0e+00],
[ 0.0e+00, 5.370e+00, 0.0e+00, 0.0e+00],
[ 0.0e+00, 0.0e+00, 8.823e-01, 0.0e+00],
[ 0.0e+00, 0.0e+00, 0.0e+00, 1.082e-15]])
Singular Values
7/17/2014 Standard Latent Semantic Indexing 5
Kyunghoon Kim
np.matrix(U)*np.matrix(np.diag(Sigma))*np.matrix(VT)
matrix([
[ 1.0e+00, 1.0e+00, -5.296e-16, 1.0e+00],
[ 7.0e+00, 4.302e-16, 7.979e-16, 7.0e+00],
[ 1.0e+00, 1.0e+00, -2.542e-17, 1.0e+00],
[ 2.0e+00, 5.0e+00, 3.0e+00, 6.0e+00]])
Full Recovery
7/17/2014 Standard Latent Semantic Indexing 6
matrix([[1, 1, 0, 1],[7, 0, 0, 7],[1, 1, 0, 1],[2, 5, 3, 6]])
Kyunghoon Kim
# Calculation with all singular value
[[1 1 0 1]
[7 0 0 7]
[1 1 0 1]
[2 5 3 6]]# Calculation with 3 of 4
[[1 1 0 1]
[7 0 0 7]
[1 1 0 1]
[2 5 3 6]]
Recovering with some singular values
7/17/2014 Standard Latent Semantic Indexing 7
# Calculation with 2 of 4
[[1 1 0 1]
[7 0 0 7]
[1 1 0 1]
[2 5 3 6]]# Calculation with 1 of 4
[[1 0 0 1]
[5 3 1 7]
[1 0 0 1]
[4 2 1 6]]
Kyunghoon Kim
>>> sig2=Sigma**2
array([1.48e+02, 2.88e+01, 7.78e-01, 1.17e-30])
>>> sum(sig2)
178.0
>>> sum(sig2)*0.9
160.20000000000002
>>> sum(sig2[:1])
148.375554981108
How many take singular values
7/17/2014 Standard Latent Semantic Indexing 8
>>> sum(sig2[:2])
177.22150138532837
Kyunghoon Kim
Corpus
7/17/2014 Standard Latent Semantic Indexing 9
Kyunghoon Kim
Corpus
7/17/2014 Standard Latent Semantic Indexing 10
Kyunghoon Kim
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 11
Kyunghoon Kim
• Each term 𝑡𝑡𝑖𝑖 generates a row vector (𝑎𝑎𝑖𝑖𝑖,𝑎𝑎𝑖𝑖𝑖,⋯ ,𝑎𝑎𝑖𝑖𝑖𝑖)referred to as a term vector and each document 𝑑𝑑𝑗𝑗 generates a column vector
𝑑𝑑𝑗𝑗 =𝑎𝑎𝑖𝑗𝑗⋮
𝑎𝑎𝑚𝑚𝑗𝑗
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 12
Kyunghoon Kim
Frequency Matrix
7/17/2014 Standard Latent Semantic Indexing 13
>>> A = np.matrix([[1,0,0],[0,1,0],[1,1,1],[1,1,0],[0,0,1]])
>>> A
matrix([[1, 0, 0],
[0, 1, 0],
[1, 1, 1],
[1, 1, 0],
[0, 0, 1]])
Kyunghoon Kim
U, Sigma, VT = np.linalg.svd(A)
S = np.zeros((U.shape[1],VT.shape[0]))
S[:3,:3] = np.diag(Sigma)
Recon = U*S*VT
print np.round(Recon)
Example of SVD :: Full Singular values
7/17/2014 Standard Latent Semantic Indexing 14
[[ 1. 0. 0.][ 0. 1. 0.][ 1. 1. 1.][ 1. 1. 0.][ 0. 0. 1.]]
Kyunghoon Kim
Singular Value Decomposition(SVD)
7/17/2014 Standard Latent Semantic Indexing 15
Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
Kyunghoon Kim
U, Sigma, VT = np.linalg.svd(A)
S = np.zeros((U.shape[1],VT.shape[0]))
S[:2,:2] = np.diag(Sigma[:2])
Recon = U*S*VT
print np.round(Recon,5)
Example of SVD :: 2 singular values
7/17/2014 Standard Latent Semantic Indexing 16
[[ 0.5 0.5 0.][ 0.5 0.5 0.][ 1. 1. 1.][ 1. 1. 0.][ 0. 0. 1.]]
Kyunghoon Kim
array([[ 0.5, 0.5, 0. ],
[ 0.5, 0.5, 0. ],
[ 1. , 1. , 1. ],
[ 1. , 1. , 0. ],
[ 0. , 0. , 1. ]]) % rounded Matrix for convenience
% not rounded Matrix
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],
[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],
[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],
[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],
[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
Example of SVD :: 2 singular values
7/17/2014 Standard Latent Semantic Indexing 17
Kyunghoon Kim
Query
7/17/2014 Standard Latent Semantic Indexing 18
Kyunghoon Kim
Query
7/17/2014 Standard Latent Semantic Indexing 19
Kyunghoon Kim
Case1.
Case2.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 20
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
Kyunghoon Kim
Case1.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 21
query = np.matrix([[1,0,0,1,0]])
for i in range(int(Recon.shape[1])):
q = query
d = Recon[:,i]
dotproduct = np.asscalar(np.dot(q,d))
normq = np.linalg.norm(q)
normd = np.linalg.norm(d)
print dotproduct / (normq*normd)
Kyunghoon Kim
Case1.
Case2.
Example with Query
7/17/2014 Standard Latent Semantic Indexing 22
matrix([[ 5.00000000e-01, 5.00000000e-01, 5.27355937e-16],[ 5.00000000e-01, 5.00000000e-01, -1.94289029e-16],[ 1.00000000e+00, 1.00000000e+00, 1.00000000e+00],[ 1.00000000e+00, 1.00000000e+00, 3.33066907e-16],[ 5.55111512e-16, -2.49800181e-16, 1.00000000e+00]])
Kyunghoon Kim
What’s the feature of LSI?
7/17/2014 Standard Latent Semantic Indexing 23
Appx of A = matrix([[ 0.5, 0.5, 0. ],[ 0.5, 0.5, 0. ],[ 1. , 1. , 1. ],[ 1. , 1. , 0. ],[ 0. , 0. , 1. ]])
Kyunghoon Kim
Related work
7/17/2014 Standard Latent Semantic Indexing 24
Kyunghoon Kim
Demonstration of LSI
7/17/2014 Standard Latent Semantic Indexing 25
Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 26
Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 27
Kyunghoon Kim7/17/2014 Standard Latent Semantic Indexing 28
Kyunghoon Kim
• Probabilistic Latent Semantic Indexing
• Latent Dirichlet Allocation
What’s Next?
7/17/2014 Standard Latent Semantic Indexing 29
Kyunghoon Kim
• Harrington, Peter. Machine learning in action. Manning Publications Co., 2012.
• Simovici, Dan A. Linear algebra tools for data mining. World Scientific, 2012.
• Berry, Michael W., Susan T. Dumais, and Gavin W. O'Brien. "Using linear algebra for intelligent information retrieval." SIAM review 37.4 (1995): 573-595.
References
7/17/2014 Standard Latent Semantic Indexing 30