a comparative study of tf*idf , lsi and multi-words for text classification
DESCRIPTION
A comparative study of TF*IDF , LSI and multi-words for text classification. Presenter : Jian-Ren Chen Authors : W en Zhang , T aketoshi Y oshida , X ijin T ang 2011.ESWA. Outlines. Motivation Objectives Methodology Experiments Conclusions Comments. Motivation. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/1.jpg)
Intelligent Database Systems Lab
Presenter : JIAN-REN CHEN
Authors : Wen Zhang, Taketoshi Yoshida, Xijin Tang
2011.ESWA
A comparative study of TF*IDF, LSI and multi-words for text classification
![Page 2: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/2.jpg)
Intelligent Database Systems Lab
OutlinesMotivationObjectivesMethodologyExperimentsConclusionsComments
![Page 3: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/3.jpg)
Intelligent Database Systems Lab
MotivationAlthough TF*IDF, LSI and multi-word have been proposed for a long
time, there is no comparative study on these indexing methods,
and no results are reported concerning their classification
performances.
![Page 4: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/4.jpg)
Intelligent Database Systems Lab
Objectives
• A comparative study of TF*IDF, LSI and multi-words for text classification.- information retrieval- text categorization
• indexing term:① semantic quality② statistical quality
![Page 5: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/5.jpg)
Intelligent Database Systems Lab
Methodology - TF*IDF
1) wi,j : the weight for term i in document j2) N : the number of documents in the collection3) tfi,j : is the term frequency of term i in document j4) dfi : is the document frequency of term i in the collection
Terms (keywords) of the document collection
documents
![Page 6: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/6.jpg)
Intelligent Database Systems Lab
Methodology - LSIGiven a term-document matrix X = [x1 , x2 , ... , xn ] є Rm
and suppose the rank of X is r, LSI decomposes the X using SVD as follows:
Terms (keywords) of the document collection
documents
1.
Xk=Uk’ΣkVkT’2.
![Page 7: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/7.jpg)
Intelligent Database Systems Lab
Methodology - Multi-word
the length of the multi-word should be between 2 and 6
its occurrence frequency should be at least twice in a document.
![Page 8: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/8.jpg)
Intelligent Database Systems Lab
Experiments - Datasets Chinese corpus : TanCorpV1.0
14150 documents 20 categories
Select
1200 documents 219,115 sentences 5,468,301 individual words
agriculture history politics economy
English corpus : Reuters-22173 distribution 1.022173 documents 135 categories
Select
2032 documents 50,837 sentences 281,111 individual words
Crude (520) agriculture (574) Trade (514) Interest (424)
![Page 9: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/9.jpg)
Intelligent Database Systems Lab
Experiments - Evaluation
![Page 10: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/10.jpg)
Intelligent Database Systems Lab
Experiments - Chinese
![Page 11: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/11.jpg)
Intelligent Database Systems Lab
Experiments - English
![Page 12: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/12.jpg)
Intelligent Database Systems Lab
Experiments – t-test
![Page 13: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/13.jpg)
Intelligent Database Systems Lab
Comparison
information retrieval
text categorization
computationcomplexity
TF*IDF Chinese O(n m)
LSI English best O(n2r3)
multi-word O(ms2)
![Page 14: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/14.jpg)
Intelligent Database Systems Lab
Conclusions
• LSI can produce better indexing in discriminative power.
• LSI and multi-word have better semantic quality than TF*IDF, and TF*IDF has better statistical quality than the other two methods.
• The number of dimension is still a decisive factor for indexing when we use different indexing methods for classification.
![Page 15: A comparative study of TF*IDF , LSI and multi-words for text classification](https://reader035.vdocument.in/reader035/viewer/2022062816/56814c52550346895db964e2/html5/thumbnails/15.jpg)
Intelligent Database Systems Lab
Comments• Advantages
- Compare with TF*IDF, LSI and multi-words• Disadvantage
- semantic quality and statistical quality are considered
merely by our intuition instead of theory• Applications
- text mining