1 comparison of principal component analysis and random projection in text mining steve vincent...
TRANSCRIPT
1
Comparison of Principal Component Analysis and Random Projection in Text Mining
Steve VincentApril 29, 2004
INFS 795
Dr. Domeniconi
2
OutlineIntroductionPrevious WorkObjectiveBackground on Principal Component Analysis (PCA) and Random Projection (RP)Test Data SetsExperimental DesignExperimental ResultsFuture Work
3
Introduction
“Random projection in dimensionality reduction: Applications to image and text data” from KDD 2001, by Bingham and Mannila compared principal component analysis (PCA) to random projection (RP) for text and image dataFor future work, they said: “A still more realistic application of random projection would be to use it in a data mining problem”
4
Previous WorkIn 2001, Bingham and Mannila compared PCA to RP for images and textIn 2001, Torkkola discussed both Latent Semantic Indexing (LSI) and RP in classifying text for very low dimension levels LSI is very similar to PCA for text data Used the Reuters-21578 data base
In 2003, Fradkin and Madigan discussed background of RPIn 2003, Lin and Gunopulos combined LSI with RP No real data mining comparison between the two
5
ObjectivePrincipal Component Analysis (PCA): Find components that make projections
uncorrelated by selecting the highest eigenvalues of the covariance matrix
Maximizes retained variance
Random Projection (RP) Find components that make projections
uncorrelated by multiplying by a random matrix Minimizes computations for a particular dimension
size
Determine whether RP is a viable dimensionality reduction method
6
Principal Component Analysis
Normalize the input data then center the input data by subtracting the mean which results in X, used below Compute the global mean and covariance matrix of X:
Compute the eigenvalues and eigenvectors of the covariance matrixArrange eigenvectors in the order of magnitude of their eigenvalues. Take the first d eigenvectors as principle components.Put the d eigenvectors as columns in a matrix M.Determine the reduced output E by multiplying M by X
N
n
Tnn
N 1
))((1
xxxxΣCovariance
7
With X being an n x p matrix calculate E using:
with projection matrix P and q is the
number of reduced dimensionsP, p x q, is a matrix with elements rij rij = random Gaussian
P can also be constructed in one of the following ways:
rij = 1 with probability of 0.5 each rij = *(1) with probability of 1/6 each, or 0 with a
probability of 2/3
Random Projection
XPq
E1
3
8
SPAM Email DatabaseSPAM E-mail Database, generated June/July 1999Determine whether email is spam or not Previous tests have generated an 7%
misclassification errorSource of data: http://www.ics.uci.eud/~mlearn/MLRepository.html
Number of instances: 4,601 (1,813 Spam = 39.4%)
9
SPAM Email Database
Number of attributes: 58 Attributes: 48 attributes = word frequency 6 attributes = character frequency 1 attribute = average length of uninterrupted
sequence of capital letters 1 attribute = longest uninterrupted sequence
of capital letters 1 attribute = sum of the length of
uninterrupted sequences of capital letters 1 attribute = class spam (1=Spam, 0=Not
Spam)
10
Yahoo News Categories
Introduced in “Impact of Similarity Measures on Web-Page Clustering” by Alexander Strehl, et al.
Located at: ftp://ftp.cs.umn/dept/users/boley/PDDPdata/
Data consists of 2,340 documents in 20 Yahoo news categories.After stemming, the data base consists of 21,839 wordsStrehl was able to reduce the number of words to 2,903 by selecting only those words that appear in 1% to 10% of all articles
11
Yahoo News Categories
Category No. Category No.
Business 142 E: Online 65
Entertainment (E) 9 E: People 248
E: Art 24 E: Review 158
E: Cable 44 E: Stage 18
E: Culture 74 E: Television 187
E: Film 278 E: Variety 54
E: Industry 70 Health 494
E: Media 21 Politics 114
E: Multimedia 14 Sports 141
E: Music 125 Technology 60
Number of documents in each category
Number of documents in each category
2,340 total2,340 total
12
Revised Yahoo News Categories
Category No.
Business 142
Entertainment (Total)
1,389
Health 494
Politics 114
Sports 141
Technology 60
Combined 15 Entertainment categories in one category
13
Yahoo News Characteristics
With the various simplifications and revisions, the Yahoo News Database has the following characteristics: 2,340 documents 2,903 words 6 categories
Even with these simplifications and revisions, there are still too many attributes to do effective data mining
14
Experimental DesignPerform PCA and RP on each data set for wide range of dimension numbers Run RP multiple times due to random nature of
algorithm Determine relative times for each reduction
Compare PCA and RP results in various data mining techniques This would include Naïve Bayes, Nearest
Neighbor and Decision Trees Determine relative times for each technique
Compare PCA and RP on time and accuracy
15
Retained VarianceRetained Variance (r) is the percentage of the original variance that the PCA reduced data set covers, the equation for this is:
where iare the eigenvalues, m is the original number of dimensions, and d is the reduced number dimensions.In many applications, r should be above 90%
%100*
1
1
m
ii
d
ii
r
16
Retained Variance Percent
Dimension # Retained Variance5 26.4
10 38.315 48.220 57.325 65.730 73.635 80.640 87.045 92.550 96.8
SPAM Database
Dimension # Retained Variance25 10.9250 16.95
100 26.44150 34.28200 41.06250 47.04300 52.37600 75.22
Yahoo News Database
17
PCA and RP Time Comparison
SPAM Database
Dimension #PCA time
Average RP
Time PCA/RP5 1.3 0.057 23.0
10 1.2 0.075 16.115 1.4 0.081 17.120 1.4 0.088 15.725 1.2 0.100 11.730 1.4 0.109 13.235 1.2 0.125 9.440 1.2 0.126 9.445 1.2 0.141 8.650 1.2 0.155 8.0
Ran RP 5 times for each dimensionTimes in Seconds
Reduction performed in Matlab on Pentium III 1 GHz computer with 256 MB RAM
Time of PCA divided by Time of RP
Time of PCA divided by Time of RP
RP averages over 10 times faster than PCA
RP averages over 10 times faster than PCA
18
PCA and RP Time Comparison
Dimension #Retained VariancePCA Time
RP Average
Time PCA/RP25 2404 6.02 39950 2223 7.66 290
100 2342 11.34 207150 2397 14.09 170200 2378 16.95 140250 2685 18.55 145300 2489 23.80 105600 2551 33.74 76
Yahoo News Database
Ran RP 5 times for each dimension
Times in Seconds
Reduction performed in Matlab on Pentium III 1 GHz computer with 256 MB RAM
Time of PCA divided by the Time of RP
Time of PCA divided by the Time of RP
RP averages over 100 times faster than PCA
RP averages over 100 times faster than PCA
19
Data Mining
Explored various data mining techniques using the Weka software package. The following produced the best results: IB1: Nearest Neighbor J48: Decision Trees
The following produced poor results are will not be used: Naïve Bayes: Overall poor results SVM (SVO): Too slow with similar results to
others
20
Data Mining ProceduresFor each data set imported into Weka: Convert the numerical categories to nominal
values Randomize the order of the entries Run J48 and IB1 the data
Determine % Correct and check F-Measure statistics
Ran PCA once for each dimension number and RP 5 times for each dimension number Used 67% training/33% testing split Tested on 1564 for SPAM and 796 for Yahoo
21
Results-J48 Spam Data
# Dim PCA RP Max RP Min Full Set5 90.601 76.726 73.210
10 89.578 76.343 72.25115 90.921 75.000 72.18720 89.962 76.662 74.55225 89.578 75.064 74.48930 89.962 79.540 74.74435 88.619 75.831 74.42540 88.875 77.749 74.04145 90.345 79.220 75.51258 91.432
• PCA gave uniformly good results for all dimension levels• PCA gave results comparable to the 91.4% percent correct for the full data set• RP was 15% below full data set results
Percent Correct
22
Results-J48 Spam Data
68.000
70.000
72.000
74.000
76.000
78.000
80.000
82.000
5 10 15 20 25 30 35 40 45
RP Max
RP Min
RP gave consistent results with a very small split between maximum and minimum values
% Correct vs. Dimension #
23
Results-IB1 Spam Data
• PCA gave uniformly good results for all dimension levels• PCA gave results comparable to the 89.5% percent correct for the full data set• RP was 10% below full data set results
Percent Correct# Dim PCA RP Max RP Min Full Set
5 89.450 81.522 78.77210 91.113 79.987 77.55815 90.601 81.522 78.06920 89.450 80.499 77.11025 89.386 80.755 79.15630 89.642 81.841 78.90035 91.049 82.481 79.92340 90.154 81.266 79.41245 89.067 81.905 79.02858 89.450
24
Results-IB1 Spam Data
RP gave consistent results with a very small split between maximum and minimum values
% Correct vs. Dimension #
74.000
76.000
78.000
80.000
82.000
84.000
5 10 15 20 25 30 35 40 45
RP Max
RP Min
25
Results SPAM Data PCA gave consistent results for all dimension levels Expected lower dimension levels to
not perform as well
RP gave consistent, but lower, results for all dimension levels Also expected lower dimension levels
to not perform as well
26
Results-J48 Yahoo Data
• PCA gave uniformly good results for all dimension levels• RP was over 30% below PCA results
Percent Correct# Dim PCA RP Max RP Min
25 90.452 56.030 52.13650 92.085 59.548 54.146
100 89.824 60.553 58.668150 90.452 58.668 56.281200 89.322 58.920 58.920250 91.332 58.794 58.417300 90.327 58.417 56.658600 90.201 58.794 58.417
Note: Did not run data mining on full data set due to large dimension number
27
Results-J48 Yahoo Data
RP gave consistent results with a very small split between maximum and minimum valuesRP results were much lower than PCA
% Correct vs. Dimension #
50.000
60.000
70.000
80.000
90.000
25 50 100 150 200 250 300 600
RP Max
RP Min
PCA
28
Results-IB1 Yahoo Data
• PCA percent correct decreased as dimension number increased• RP was 20% below PCA at low dimension numbers, decreasing to 0% at high dimension numbers
Percent Correct# Dim PCA RP Max RP Min
25 93.216 73.869 70.60350 94.347 79.987 76.759
100 90.201 82.663 81.658150 87.814 80.025 79.146200 87.312 82.035 80.653250 86.558 82.789 79.271300 84.422 82.035 78.895600 79.900 82.789 82.035
Note: Did not run data mining on full data set due to large dimension number
29
Results-IB1 Yahoo Data
RP gave consistent results with a very small split between maximum and minimum valuesRP results were similar to PCA at high dimension levels
% Correct vs. Dimension #
70.000
75.000
80.000
85.000
90.000
95.000
25 50 100
150
200
250
300
600
RP Max
RP Min
PCA
30
Results Yahoo DataPCA showed consistently high results for the Decision Tree output, but showed decreasing results for higher dimensions for Nearest Neighbor output Could be over fitting in Nearest
Neighbor case Decision Tree has pruning to prevent
over fitting
31
Results Yahoo Data
RP showed consistent results for both Nearest Neighbor and Decision Trees The lower dimension numbers gave
slightly lower results Approximately 10-20% for dimension
numbers less than 100 The Nearest Neighbor results were
20% higher than Decision Tree results
32
Overall ResultsRP gives consistent results with few inconsistencies over multiple runs In general RP is faster by many orders (10 to 100) of magnitude over PCA but in most cases produced lower accuracyThe RP results are closer to PCA using the Nearest Neighbor data mining techniqueWould suggest using RP if speed of processing is most important
33
Future WorkNeed to examine additional data sets to determine if results are consistentBoth PCA and RP are linear tools. They map the original dataset using a linear mapping.Examine deriving PCA using SVD for speed A more general comparison would include non-linear dimensionality reduction methods such as: Kernel PCA SVM
34
References
E. Bingham and H. Mannila, “Random projection in dimensionality reduction: Applications to image and text data”, KDD 2001D. Fradkin and D. Madigan, “Experiments with Random Projections for Machine Learning”, SOGLDD ’03, August 2003J. Lin and D. Gunopulos, “Dimensionality Reduction by Random Projection and Latent Semantic Indexing”, Proceedings of the Text Mining Workshop, at the 3rd SIAM International Conference on Data Mining, May 2003K. Torkkola, “Linear Discriminant Analysis in Document Classification”, IEEE Workshop on Text Mining (TextDM’2001), November 2001
35
Questions?