text mining with r · rob zinkov text mining with r october 19th, 2010 9 / 38. readability score...
TRANSCRIPT
![Page 1: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/1.jpg)
Text Mining with R
Rob Zinkov
October 19th, 2010
Rob Zinkov () Text Mining with R October 19th, 2010 1 / 38
![Page 2: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/2.jpg)
Outline
1 Introduction
2 Readability
3 Summarization
4 Topic Modeling
5 Sentiment Analysis
6 Entity Extraction
7 Demo
Rob Zinkov () Text Mining with R October 19th, 2010 2 / 38
![Page 3: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/3.jpg)
Introduction
What is Text Mining?
Text mining is any process or program that:
Raw human written text
↓
Structured information
Rob Zinkov () Text Mining with R October 19th, 2010 3 / 38
![Page 4: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/4.jpg)
Introduction
R is very good for this
Rob Zinkov () Text Mining with R October 19th, 2010 4 / 38
![Page 5: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/5.jpg)
Introduction
Themes
• R is a great glue language
• CRAN already has a lots of packages that work well together
Rob Zinkov () Text Mining with R October 19th, 2010 5 / 38
![Page 6: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/6.jpg)
Introduction
Caveats
• I use outside libraries more than necessary
• Many of these algorithms could be written completely in R
• There are nicer ways to integrate these libraries
• Text mining is a vast field that can’t be covered in 40 minutes
Rob Zinkov () Text Mining with R October 19th, 2010 6 / 38
![Page 7: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/7.jpg)
Readability
Readability
Rob Zinkov () Text Mining with R October 19th, 2010 7 / 38
![Page 8: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/8.jpg)
Readability
• Readability gives us an idea of the difficulty of the document
• It also gives a rough measure of the quality
Rob Zinkov () Text Mining with R October 19th, 2010 8 / 38
![Page 9: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/9.jpg)
Readability
Flesch-Kincaid readability test
Readability can be roughly measured with
206.876− 1.015(ws
)− 84.6
( y
w
)where
w = total wordsy = total syllabless = total sentences
Rob Zinkov () Text Mining with R October 19th, 2010 9 / 38
![Page 10: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/10.jpg)
Readability
Score Notes
90.0-100.0 easily understandable by an average 11-year-old student
60.0-70.0 easily understandable by 13- to 15-year-old students
0.0-30.0 best understood by university graduates
Rob Zinkov () Text Mining with R October 19th, 2010 10 / 38
![Page 11: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/11.jpg)
Readability
Flesch-Kincaid Reading Age
(0.39 ∗ ASL) + (11.8 ∗ ASW )− 15.59
where ASL = average sentence lengthwhere ASW = average syllables per word
Rob Zinkov () Text Mining with R October 19th, 2010 11 / 38
![Page 12: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/12.jpg)
Readability
Rob Zinkov () Text Mining with R October 19th, 2010 12 / 38
![Page 13: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/13.jpg)
Readability
system(paste("java -jar CmdFlesh.jar","test_review.txt"))
Rob Zinkov () Text Mining with R October 19th, 2010 13 / 38
![Page 14: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/14.jpg)
Readability
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 14 / 38
![Page 15: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/15.jpg)
Readability
Notes
This algorithm isn’t hard to implement.Quick trick. Count vowel clusters in words to estimate syllables
Rob Zinkov () Text Mining with R October 19th, 2010 15 / 38
![Page 16: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/16.jpg)
Summarization
Summarization is about a succient distinctdistillation of the relevant content in a document.
Rob Zinkov () Text Mining with R October 19th, 2010 16 / 38
![Page 17: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/17.jpg)
Summarization
Rob Zinkov () Text Mining with R October 19th, 2010 17 / 38
![Page 18: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/18.jpg)
Summarization
Most approaches involve selecting out the most relevant sentencesThe simplest technique is to just look for sentences with popular terms
Rob Zinkov () Text Mining with R October 19th, 2010 18 / 38
![Page 19: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/19.jpg)
Summarization
I use libots, as an example but there is much better work
Rob Zinkov () Text Mining with R October 19th, 2010 19 / 38
![Page 20: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/20.jpg)
Summarization
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 20 / 38
![Page 21: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/21.jpg)
Topic Modeling
Topic Modeling
Topic Modeling is a way to group and categorize documentsUsually unsupervised approach
Rob Zinkov () Text Mining with R October 19th, 2010 21 / 38
![Page 22: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/22.jpg)
Topic Modeling
Topic Modeling - continued
CRAN includes a package for topicmodelingThis package using LDA and CTM
Rob Zinkov () Text Mining with R October 19th, 2010 22 / 38
![Page 23: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/23.jpg)
Topic Modeling
LDA - Latent Dirchilet Allocation
θk |j ∼ D[α]
φw |k ∼ D[β]
zij ∼ θk |j
xij ∼ φw |zij
Rob Zinkov () Text Mining with R October 19th, 2010 23 / 38
![Page 24: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/24.jpg)
Topic Modeling
njkw = #{i : xij = w , zij = k}
Rob Zinkov () Text Mining with R October 19th, 2010 24 / 38
![Page 25: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/25.jpg)
Topic Modeling
CTM - Coorelated Topic Models
Rob Zinkov () Text Mining with R October 19th, 2010 25 / 38
![Page 26: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/26.jpg)
Topic Modeling
CTM - Coorelated Topic Models
θk |j ∼ log(N(µ,Σ))
φw |k ∼ D[β]
zij ∼ θk |j
xij ∼ φw |zij
Rob Zinkov () Text Mining with R October 19th, 2010 26 / 38
![Page 27: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/27.jpg)
Topic Modeling
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 27 / 38
![Page 28: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/28.jpg)
Sentiment Analysis
Rob Zinkov () Text Mining with R October 19th, 2010 28 / 38
![Page 29: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/29.jpg)
Sentiment Analysis
Sentiment analysis is about gauging mood based on the text.
Rob Zinkov () Text Mining with R October 19th, 2010 29 / 38
![Page 30: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/30.jpg)
Sentiment Analysis
Opinion corpus available at:
Wiebe’s corporahttp://www.cs.pitt.edu/mpqa/Sentiwordnet:http://sentiwordnet.isti.cnr.it/
Rob Zinkov () Text Mining with R October 19th, 2010 30 / 38
![Page 31: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/31.jpg)
Sentiment Analysis
For more sophistication
• Best solved using a Conditional Random Field
• This area is still new
• No R libraries
• Entity Extraction needed for more fine-grained sentiment
Rob Zinkov () Text Mining with R October 19th, 2010 31 / 38
![Page 32: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/32.jpg)
Sentiment Analysis
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 32 / 38
![Page 33: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/33.jpg)
Entity Extraction
Named Entity Recognition
The purpose of NER is to extract out and label phrases in a sentence
Bill Clinton arrived at the United Nations Building in Manhattan.
Rob Zinkov () Text Mining with R October 19th, 2010 33 / 38
![Page 34: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/34.jpg)
Entity Extraction
Challenges
• People and locations may be referred to in ambiguous ways.
• Entity may never have been seen before
• Entity may be referred to with pronouns
• Wikipedia and Capitalization heuristics aren’t good enough
Rob Zinkov () Text Mining with R October 19th, 2010 34 / 38
![Page 35: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/35.jpg)
Entity Extraction
I use the Illinois Named Entity Extractorhttp://cogcomp.cs.illinois.edu/page/software view/4
Rob Zinkov () Text Mining with R October 19th, 2010 35 / 38
![Page 36: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/36.jpg)
Entity Extraction
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 36 / 38
![Page 37: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/37.jpg)
Demo
Conclusions
There are lots of interesting things you can do with text miningR is very good at integrating all of them.
Rob Zinkov () Text Mining with R October 19th, 2010 37 / 38
![Page 38: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0](https://reader034.vdocument.in/reader034/viewer/2022050517/5fa18ac7b41bf0376f650a06/html5/thumbnails/38.jpg)
Demo
Questions?
Rob Zinkov () Text Mining with R October 19th, 2010 38 / 38