text mining with r · rob zinkov text mining with r october 19th, 2010 9 / 38. readability score...
TRANSCRIPT
Text Mining with R
Rob Zinkov
October 19th, 2010
Rob Zinkov () Text Mining with R October 19th, 2010 1 / 38
Outline
1 Introduction
2 Readability
3 Summarization
4 Topic Modeling
5 Sentiment Analysis
6 Entity Extraction
7 Demo
Rob Zinkov () Text Mining with R October 19th, 2010 2 / 38
Introduction
What is Text Mining?
Text mining is any process or program that:
Raw human written text
↓
Structured information
Rob Zinkov () Text Mining with R October 19th, 2010 3 / 38
Introduction
R is very good for this
Rob Zinkov () Text Mining with R October 19th, 2010 4 / 38
Introduction
Themes
• R is a great glue language
• CRAN already has a lots of packages that work well together
Rob Zinkov () Text Mining with R October 19th, 2010 5 / 38
Introduction
Caveats
• I use outside libraries more than necessary
• Many of these algorithms could be written completely in R
• There are nicer ways to integrate these libraries
• Text mining is a vast field that can’t be covered in 40 minutes
Rob Zinkov () Text Mining with R October 19th, 2010 6 / 38
Readability
Readability
Rob Zinkov () Text Mining with R October 19th, 2010 7 / 38
Readability
• Readability gives us an idea of the difficulty of the document
• It also gives a rough measure of the quality
Rob Zinkov () Text Mining with R October 19th, 2010 8 / 38
Readability
Flesch-Kincaid readability test
Readability can be roughly measured with
206.876− 1.015(ws
)− 84.6
( y
w
)where
w = total wordsy = total syllabless = total sentences
Rob Zinkov () Text Mining with R October 19th, 2010 9 / 38
Readability
Score Notes
90.0-100.0 easily understandable by an average 11-year-old student
60.0-70.0 easily understandable by 13- to 15-year-old students
0.0-30.0 best understood by university graduates
Rob Zinkov () Text Mining with R October 19th, 2010 10 / 38
Readability
Flesch-Kincaid Reading Age
(0.39 ∗ ASL) + (11.8 ∗ ASW )− 15.59
where ASL = average sentence lengthwhere ASW = average syllables per word
Rob Zinkov () Text Mining with R October 19th, 2010 11 / 38
Readability
Rob Zinkov () Text Mining with R October 19th, 2010 12 / 38
Readability
system(paste("java -jar CmdFlesh.jar","test_review.txt"))
Rob Zinkov () Text Mining with R October 19th, 2010 13 / 38
Readability
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 14 / 38
Readability
Notes
This algorithm isn’t hard to implement.Quick trick. Count vowel clusters in words to estimate syllables
Rob Zinkov () Text Mining with R October 19th, 2010 15 / 38
Summarization
Summarization is about a succient distinctdistillation of the relevant content in a document.
Rob Zinkov () Text Mining with R October 19th, 2010 16 / 38
Summarization
Rob Zinkov () Text Mining with R October 19th, 2010 17 / 38
Summarization
Most approaches involve selecting out the most relevant sentencesThe simplest technique is to just look for sentences with popular terms
Rob Zinkov () Text Mining with R October 19th, 2010 18 / 38
Summarization
I use libots, as an example but there is much better work
Rob Zinkov () Text Mining with R October 19th, 2010 19 / 38
Summarization
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 20 / 38
Topic Modeling
Topic Modeling
Topic Modeling is a way to group and categorize documentsUsually unsupervised approach
Rob Zinkov () Text Mining with R October 19th, 2010 21 / 38
Topic Modeling
Topic Modeling - continued
CRAN includes a package for topicmodelingThis package using LDA and CTM
Rob Zinkov () Text Mining with R October 19th, 2010 22 / 38
Topic Modeling
LDA - Latent Dirchilet Allocation
θk |j ∼ D[α]
φw |k ∼ D[β]
zij ∼ θk |j
xij ∼ φw |zij
Rob Zinkov () Text Mining with R October 19th, 2010 23 / 38
Topic Modeling
njkw = #{i : xij = w , zij = k}
Rob Zinkov () Text Mining with R October 19th, 2010 24 / 38
Topic Modeling
CTM - Coorelated Topic Models
Rob Zinkov () Text Mining with R October 19th, 2010 25 / 38
Topic Modeling
CTM - Coorelated Topic Models
θk |j ∼ log(N(µ,Σ))
φw |k ∼ D[β]
zij ∼ θk |j
xij ∼ φw |zij
Rob Zinkov () Text Mining with R October 19th, 2010 26 / 38
Topic Modeling
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 27 / 38
Sentiment Analysis
Rob Zinkov () Text Mining with R October 19th, 2010 28 / 38
Sentiment Analysis
Sentiment analysis is about gauging mood based on the text.
Rob Zinkov () Text Mining with R October 19th, 2010 29 / 38
Sentiment Analysis
Opinion corpus available at:
Wiebe’s corporahttp://www.cs.pitt.edu/mpqa/Sentiwordnet:http://sentiwordnet.isti.cnr.it/
Rob Zinkov () Text Mining with R October 19th, 2010 30 / 38
Sentiment Analysis
For more sophistication
• Best solved using a Conditional Random Field
• This area is still new
• No R libraries
• Entity Extraction needed for more fine-grained sentiment
Rob Zinkov () Text Mining with R October 19th, 2010 31 / 38
Sentiment Analysis
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 32 / 38
Entity Extraction
Named Entity Recognition
The purpose of NER is to extract out and label phrases in a sentence
Bill Clinton arrived at the United Nations Building in Manhattan.
Rob Zinkov () Text Mining with R October 19th, 2010 33 / 38
Entity Extraction
Challenges
• People and locations may be referred to in ambiguous ways.
• Entity may never have been seen before
• Entity may be referred to with pronouns
• Wikipedia and Capitalization heuristics aren’t good enough
Rob Zinkov () Text Mining with R October 19th, 2010 34 / 38
Entity Extraction
I use the Illinois Named Entity Extractorhttp://cogcomp.cs.illinois.edu/page/software view/4
Rob Zinkov () Text Mining with R October 19th, 2010 35 / 38
Entity Extraction
Demo!
Rob Zinkov () Text Mining with R October 19th, 2010 36 / 38
Demo
Conclusions
There are lots of interesting things you can do with text miningR is very good at integrating all of them.
Rob Zinkov () Text Mining with R October 19th, 2010 37 / 38
Demo
Questions?
Rob Zinkov () Text Mining with R October 19th, 2010 38 / 38