text mining with r · rob zinkov text mining with r october 19th, 2010 9 / 38. readability score...

38
Text Mining with R Rob Zinkov October 19th, 2010 Rob Zinkov () Text Mining with R October 19th, 2010 1 / 38

Upload: others

Post on 09-Aug-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Text Mining with R

Rob Zinkov

October 19th, 2010

Rob Zinkov () Text Mining with R October 19th, 2010 1 / 38

Page 2: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Outline

1 Introduction

2 Readability

3 Summarization

4 Topic Modeling

5 Sentiment Analysis

6 Entity Extraction

7 Demo

Rob Zinkov () Text Mining with R October 19th, 2010 2 / 38

Page 3: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Introduction

What is Text Mining?

Text mining is any process or program that:

Raw human written text

Structured information

Rob Zinkov () Text Mining with R October 19th, 2010 3 / 38

Page 4: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Introduction

R is very good for this

Rob Zinkov () Text Mining with R October 19th, 2010 4 / 38

Page 5: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Introduction

Themes

• R is a great glue language

• CRAN already has a lots of packages that work well together

Rob Zinkov () Text Mining with R October 19th, 2010 5 / 38

Page 6: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Introduction

Caveats

• I use outside libraries more than necessary

• Many of these algorithms could be written completely in R

• There are nicer ways to integrate these libraries

• Text mining is a vast field that can’t be covered in 40 minutes

Rob Zinkov () Text Mining with R October 19th, 2010 6 / 38

Page 7: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

Readability

Rob Zinkov () Text Mining with R October 19th, 2010 7 / 38

Page 8: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

• Readability gives us an idea of the difficulty of the document

• It also gives a rough measure of the quality

Rob Zinkov () Text Mining with R October 19th, 2010 8 / 38

Page 9: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

Flesch-Kincaid readability test

Readability can be roughly measured with

206.876− 1.015(ws

)− 84.6

( y

w

)where

w = total wordsy = total syllabless = total sentences

Rob Zinkov () Text Mining with R October 19th, 2010 9 / 38

Page 10: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

Score Notes

90.0-100.0 easily understandable by an average 11-year-old student

60.0-70.0 easily understandable by 13- to 15-year-old students

0.0-30.0 best understood by university graduates

Rob Zinkov () Text Mining with R October 19th, 2010 10 / 38

Page 11: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

Flesch-Kincaid Reading Age

(0.39 ∗ ASL) + (11.8 ∗ ASW )− 15.59

where ASL = average sentence lengthwhere ASW = average syllables per word

Rob Zinkov () Text Mining with R October 19th, 2010 11 / 38

Page 12: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

Rob Zinkov () Text Mining with R October 19th, 2010 12 / 38

Page 13: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

system(paste("java -jar CmdFlesh.jar","test_review.txt"))

Rob Zinkov () Text Mining with R October 19th, 2010 13 / 38

Page 14: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

Demo!

Rob Zinkov () Text Mining with R October 19th, 2010 14 / 38

Page 15: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Readability

Notes

This algorithm isn’t hard to implement.Quick trick. Count vowel clusters in words to estimate syllables

Rob Zinkov () Text Mining with R October 19th, 2010 15 / 38

Page 16: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Summarization

Summarization is about a succient distinctdistillation of the relevant content in a document.

Rob Zinkov () Text Mining with R October 19th, 2010 16 / 38

Page 17: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Summarization

Rob Zinkov () Text Mining with R October 19th, 2010 17 / 38

Page 18: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Summarization

Most approaches involve selecting out the most relevant sentencesThe simplest technique is to just look for sentences with popular terms

Rob Zinkov () Text Mining with R October 19th, 2010 18 / 38

Page 19: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Summarization

I use libots, as an example but there is much better work

Rob Zinkov () Text Mining with R October 19th, 2010 19 / 38

Page 20: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Summarization

Demo!

Rob Zinkov () Text Mining with R October 19th, 2010 20 / 38

Page 21: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Topic Modeling

Topic Modeling

Topic Modeling is a way to group and categorize documentsUsually unsupervised approach

Rob Zinkov () Text Mining with R October 19th, 2010 21 / 38

Page 22: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Topic Modeling

Topic Modeling - continued

CRAN includes a package for topicmodelingThis package using LDA and CTM

Rob Zinkov () Text Mining with R October 19th, 2010 22 / 38

Page 23: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Topic Modeling

LDA - Latent Dirchilet Allocation

θk |j ∼ D[α]

φw |k ∼ D[β]

zij ∼ θk |j

xij ∼ φw |zij

Rob Zinkov () Text Mining with R October 19th, 2010 23 / 38

Page 24: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Topic Modeling

njkw = #{i : xij = w , zij = k}

Rob Zinkov () Text Mining with R October 19th, 2010 24 / 38

Page 25: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Topic Modeling

CTM - Coorelated Topic Models

Rob Zinkov () Text Mining with R October 19th, 2010 25 / 38

Page 26: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Topic Modeling

CTM - Coorelated Topic Models

θk |j ∼ log(N(µ,Σ))

φw |k ∼ D[β]

zij ∼ θk |j

xij ∼ φw |zij

Rob Zinkov () Text Mining with R October 19th, 2010 26 / 38

Page 27: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Topic Modeling

Demo!

Rob Zinkov () Text Mining with R October 19th, 2010 27 / 38

Page 28: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Sentiment Analysis

Rob Zinkov () Text Mining with R October 19th, 2010 28 / 38

Page 29: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Sentiment Analysis

Sentiment analysis is about gauging mood based on the text.

Rob Zinkov () Text Mining with R October 19th, 2010 29 / 38

Page 30: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Sentiment Analysis

Opinion corpus available at:

Wiebe’s corporahttp://www.cs.pitt.edu/mpqa/Sentiwordnet:http://sentiwordnet.isti.cnr.it/

Rob Zinkov () Text Mining with R October 19th, 2010 30 / 38

Page 31: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Sentiment Analysis

For more sophistication

• Best solved using a Conditional Random Field

• This area is still new

• No R libraries

• Entity Extraction needed for more fine-grained sentiment

Rob Zinkov () Text Mining with R October 19th, 2010 31 / 38

Page 32: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Sentiment Analysis

Demo!

Rob Zinkov () Text Mining with R October 19th, 2010 32 / 38

Page 33: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Entity Extraction

Named Entity Recognition

The purpose of NER is to extract out and label phrases in a sentence

Bill Clinton arrived at the United Nations Building in Manhattan.

Rob Zinkov () Text Mining with R October 19th, 2010 33 / 38

Page 34: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Entity Extraction

Challenges

• People and locations may be referred to in ambiguous ways.

• Entity may never have been seen before

• Entity may be referred to with pronouns

• Wikipedia and Capitalization heuristics aren’t good enough

Rob Zinkov () Text Mining with R October 19th, 2010 34 / 38

Page 35: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Entity Extraction

I use the Illinois Named Entity Extractorhttp://cogcomp.cs.illinois.edu/page/software view/4

Rob Zinkov () Text Mining with R October 19th, 2010 35 / 38

Page 36: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Entity Extraction

Demo!

Rob Zinkov () Text Mining with R October 19th, 2010 36 / 38

Page 37: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Demo

Conclusions

There are lots of interesting things you can do with text miningR is very good at integrating all of them.

Rob Zinkov () Text Mining with R October 19th, 2010 37 / 38

Page 38: Text Mining with R · Rob Zinkov Text Mining with R October 19th, 2010 9 / 38. Readability Score Notes 90.0-100.0 easily understandable by an average 11-year-old student 60.0-70.0

Demo

Questions?

Rob Zinkov () Text Mining with R October 19th, 2010 38 / 38