algebraic techniques for analysis of large discrete-valued datasets
DESCRIPTION
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets. Mehmet Koyuturk 1 , Ananth Grama 1 , and Naren Ramakrishnan 2 Dept. of Computer Sciences, Purdue University {koyuturk, ayg} @cs.purdue.edu 2. Dept. of Computer Sciences, Virginia Tech [email protected]. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/1.jpg)
Algebraic Techniques for Analysis of Large Discrete-Valued Datasets
Mehmet Koyuturk1, Ananth Grama1, and Naren
Ramakrishnan2
1. Dept. of Computer Sciences, Purdue University
{koyuturk, ayg} @cs.purdue.edu
2. Dept. of Computer Sciences, Virginia Tech
![Page 2: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/2.jpg)
Motivation Handling large discrete-valued
datasets Extracting relations between data
items Summarizing data in an error-bounded
fashion Clustering of data items Finding coinsize representations for
clustered data
![Page 3: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/3.jpg)
Background
Singular Value Decomposition (SVD) [Berry et.al., 1995] Decompose matrix into A=UVT
U and V orthogonal matrices, diagonal with singular values
Used for Latent Semantic Indexing in Information Retrieval
Truncate decomposition to compress data
![Page 4: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/4.jpg)
Background Semi-Discrete Decomposition (SDD)
[Kolda and O’Leary, 1998] Restrict entries of U and V to {-1,0,1} Requires very small amount of storage Can perform as well as SVD in LSI using
less than one-tenth the storage Effective in finding outlier clusters
works well for datasets containing a large number of small clusters
![Page 5: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/5.jpg)
Rank-1 Approximation
TxyA
011
1
1
1
011
011
011x : presence vectory : pattern vector
10100
00000
10100
10100
10100
1
0
1
1
10101
11000
10100
10110
A
![Page 6: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/6.jpg)
Rank-1 Approximation
}1:{ ijT
ijF
T aa xyAxyA
Problem: Given discrete matrix Amxn , find discrete vectors xmx1 and ynx1 to Minimize
= number of non-zeros in the error matrix
Heuristic: Fix y, set 2
2y
Ays solve for x to Maximize
2
2x
sxT
Iteratively solve for x and y until no improvement possible
![Page 7: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/7.jpg)
Initialization of pattern vector
Crucial to escape from local optima Must require at most (nz(A)) time, not to Some possible schemes
AllOnes: Set all entries to 1, poor. Threshold: Set only the entries that have
corresponding columns with # of non-zeros more than a threshold. Can lead to bad local optima.
Maximum: Set only the entry that corresponds to the column with max. # of non-zeros. Risky, that column may be shared by lots of patterns.
Partition: Partition the rows of matrix based on a column, than apply threshold scheme taking into account only one of the parts. Best among these.
![Page 8: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/8.jpg)
Recursive Algorithm
- if x(i)=1 row i goes to A1
- At any step, given rank-one approximation AxyT, split A to A1 and A0 based on rows :
- if x(i)=0 row i goes to A0
- Stop when- hamming radius of A1, maximum of the hamming distances of A1pattern vector, is less then some threshold-all rows of A are present in A1
(if A1does not satisfy hamming radius condition, can split A1
based on hamming distances)
![Page 9: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/9.jpg)
Recursive Algorithm
![Page 10: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/10.jpg)
Effectiveness of Analysis
![Page 11: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/11.jpg)
Effectiveness of Analysis
![Page 12: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/12.jpg)
Run-time Scalability-Rank-1 approximation requires O(nz(A)) time -Total run-time at each level in the recursive tree cannot exceed this since total # of nonzeros at each level is at most nz(A) Run-time is linear in nz(A)
runtime vs # columns runtime vs # rows runtime vs # nonzeros
![Page 13: Algebraic Techniques for Analysis of Large Discrete-Valued Datasets](https://reader035.vdocument.in/reader035/viewer/2022081603/56813ca6550346895da6559f/html5/thumbnails/13.jpg)
Conclusions and Ongoing Work Proposed algorithm is
Scalable to exteremely high-dimensions Effective in discovering dominant patterns Hierarchical in nature, allowing multi-
resolution analysis Currently working on
Real-world applications of proposed method Effective initialization schemes Parallel implementation