text mining mengmeng & jack_lsu
TRANSCRIPT
![Page 1: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/1.jpg)
Introduction to Text Mining in SAS® Enterprise MinerMengmeng Liu and Jack DaiNovember 10, 2014
![Page 2: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/2.jpg)
Growing Text
• First tweet: 2006
• Tweets per day in 2007: 5,000
• Tweets per day in 2013: 500,000,000
![Page 3: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/3.jpg)
What is Text Mining?
Unstructured Text Data
Numeric Data
Statistical Analysis
![Page 4: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/4.jpg)
Text Mining Process Flow in SAS Enterprise Miner
![Page 5: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/5.jpg)
Data Importing
• Clean the text as much as you can before importing
Import node Data Structure
File Import All text in one file (CSV)
Text Import Separate documents (TXT, PDF)
Create New Data Sources
SAS dataset (sas7bdat)
![Page 6: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/6.jpg)
Data Importing
![Page 7: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/7.jpg)
Text Parsing
• Parse the variable with longest length
• Associate similar terms into one group
• Build customized dictionary of relevant terms
• Control number of terms per document
![Page 8: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/8.jpg)
Text Filter
• Correct misspellings
• Assign frequency weights and term weights
• Manually filter out terms using filter view
Target variables Options
Present Mutual information
Not present Entropy
![Page 9: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/9.jpg)
• Text cluster groups documents with similar text contents
• Convert documents to Singular Value Decomposition(SVD) based on the term weights and frequency weights
• Group documents into mutually exclusive cluster based on SVD
• Select dimensions of SVD and numbers of clusters
• Select number of clusters
Text Cluster
![Page 10: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/10.jpg)
Text Topic
• Create a number of topics that are prevalent in documents
• Score each document on probability of containing the topic
• Each document could have multiple topics
![Page 11: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/11.jpg)
Possible statistical analysis methods
For Classification Purposes:
![Page 12: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/12.jpg)
Demo: Hotel Reviews for Riviera
![Page 13: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/13.jpg)
Data Structure
![Page 14: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/14.jpg)
Cleaning the raw data
![Page 15: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/15.jpg)
Manually filtering terms
Terms Eliminated:• quot• riviera• hotel• stay verb• strip• vegas• year• riv
![Page 16: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/16.jpg)
Results
![Page 17: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/17.jpg)
Alternative: Read the Reviews
![Page 18: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/18.jpg)
Questions?
![Page 19: Text mining mengmeng & jack_lsu](https://reader035.vdocument.in/reader035/viewer/2022062406/55c55789bb61ebbd358b47b1/html5/thumbnails/19.jpg)
Resources:
• Dr. Jim Love and Dr. Joni Shreve from LSU ISDS Department
• Data obtained from UCI Data Repository• http://www.internetlivestats.com/twitter-statistics/• ‘Text Analytics Using SAS Enterprise Miner’