big data: data analysis boot camp textual analysisccartled/teaching/2019-fall/dataanalysis/... ·...
TRANSCRIPT
-
Big Data: Data Analysis Boot CampTextual Analysis
Chuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhDChuck Cartledge, PhD
24 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 201924 August 2019
1/35
-
2/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Table of contents (1 of 1)
1 Intro.2 Background
Contextualize3 Hands-on
Examples from the textA little silliness
4 Q & A5 Conclusion
6 References
7 Files
8 Misc.Equations
c©Old Dominion University
-
3/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
What are we going to cover?
We’re going to talk about:
Differences between numerical andtextual data analysis.
Define common textual dataanalysis terms and ideas.
Use different textual analysis tools(knn, näıve Bayes, logit, andsupport vector machines)
c©Old Dominion University
-
4/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Contextualize
Processing textual data is messy.
With numerical data, there are a limited number of ways to getdata ready for analysis:
1 Ignore records that are missing/incomplete
2 Fill in missing values (mean, mode, estimated)
3 Accept incomplete records and adjust the uncertainties
Textual data is harder. Data may be complete, but very hard toget ready for analysis.
c©Old Dominion University
-
5/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Contextualize
Textual “data wrangling”
There are a few “normal” processing steps to prepare textual datafrom analysis:
1 Change all text to the same case (usually lower case)
2 Remove all non-textual glyphs (punctuation marks and so on)
3 Remove all numbers
4 Remove all “stop words” (stop words are language anddomain specific)
5 Remove all “white space”
6 Apply stemming techniques to what remains
c©Old Dominion University
-
6/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Contextualize
What does all this mean?
A sentence that starts like this[6]:Text Mining (or Text Analytics) applies analytic tools to learn from collections of text data, like social media,books, newspapers, emails, etc.
Ends up like this:text min text analyt appli analyt tool learn collect text data like social media book newspap email etc
c©Old Dominion University
-
7/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Contextualize
A few definitions[2]
TF Term Frequency, which measures how frequently a termoccurs in a document.
tf(t, d) = Number of time the term t appears in the documentTotal number of terms in the documentd
IDF Inverse Document Frequency, which measures how importanta term is (whether the term is common or rare across alldocuments).
IDF(t,D) = log N|{d∈D:t∈d}|
D : the corpus, a collection of documentsN : total number of documents in the corpus N = | D || {d ∈ D : t ∈ d} | : number of documents where the term tappears (i.e., tf(t, d) 6= 0).
c©Old Dominion University
-
8/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Contextualize
TF and IDF for sample string.
The terms:
[1] "analyt" "appli" "book" "collect" "data" "email" "etc"
[8] "learn" "like" "media" "mine" "newspap" "social" "text"
[15] "tool"
The frequency of each term:
[1] 2 1 1 1 1 1 1 1 1 1 1 1 1 3 1
IDF is not a real useful metric with only one document:
weightTfIdf(TermDocumentMatrix(corp1))$v
==>named numeric(0)(See next slide for code.)
c©Old Dominion University
-
9/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Contextualize
R script to create sample text “normalization”
library(NLP)
library(tm)
a
-
10/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
Whats happening in the beginning?
We gather up a predefined set of documents, save them locally,and create a term frequency object:
tempFile
-
11/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
Afterwards we look at the corpus:
[1] " -- Dumping the object: processed (of type: list, class: VCorpus)"
[2] " -- Dumping the object: processed (of type: list, class: Corpus)"
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 2000
[1] " -- Dumping the object:
head(Frequencies[order(Frequencies, decreasing = T)], 5)
(of type: double, class: numeric)"
film movi one like charact
11109 6857 5759 3998 3855
[1] " -- Dumping the object:
head(DocFrequencies[order(DocFrequencies, decreasing = T)], 5)
(of type: double, class: numeric)"
film one movi like charact
1797 1763 1642 1538 1431
We now know the most common terms across the 2,000documents in the corpus.
c©Old Dominion University
-
12/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
Gathering a few corpus statistics.
It is easy to think about how terms and documents create a2-dimensional array.
[1] " -- Dumping the object: moreThanOnce (of type: integer, class: integer)"
[1] 9748
[1] " -- Dumping the object: total (of type: integer, class: integer)"
[1] 30585
[1] " -- Dumping the object: prop (of type: double, class: numeric)"
[1] 0.3187183
[1] " -- Dumping the object: ncol(SparseRemoved) (of type: integer, class: integer)"
[1] 202
[1] " -- Dumping the object: sum(rowSums(as.matrix(SparseRemoved)) == 0)
(of type: integer, class: integer)"
[1] 0
[1] " -- Dumping the object: colnames(SparseRemoved) (of type: character, class: character)"
[1] "act" "action" "actor" "actual" "almost" "along"
Columns that have only one entry are assumed to not be toointeresting.
c©Old Dominion University
-
13/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
Create a dataframe with all the data
quality
-
14/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well do knn classifiers do? (1 of 2)
The code:
Class3n
-
15/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well do knn classifiers do? (2 of 2)
The results:
[1] " -- Dumping the object: confusionMatrix(Class3n,
as.factor(TrainDF$quality)) (of type: list,
class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 358 126
1 134 382
Accuracy : 0.74
...
[1] " -- Dumping the object: confusionMatrix(Class5n,
as.factor(TrainDF$quality)) (of type: list,
class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 336 162
1 156 346
Accuracy : 0.682
c©Old Dominion University
-
16/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well will a näıve Bayes classifier do? (1 of 2)
The code:
model
-
17/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well will a näıve Bayes classifier do? (2 of 2)
The results:
[1] " -- Dumping the object: confusionMatrix(
as.factor(TrainDF$quality), classifNB)
(of type: list, class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 353 139
1 74 434
Accuracy : 0.787
...
[1] " -- Dumping the object: confusionMatrix(
as.factor(TestDF$quality), classifNB)
(of type: list, class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 335 173
1 120 372
Accuracy : 0.707
c©Old Dominion University
-
18/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well will logistic regression (logit) do? (1 of 2)
The code:
model
-
19/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well will logistic regression (logit) do? (2 of 2)
The results:[1] " -- Dumping the object: summary(model) (of type: list, class: summary.glm)"
glm(formula = quality ~ lengths, family = binomial)
...
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.6383373 0.1171536 -5.449 5.07e-08 ***
lengths 0.0018276 0.0003113 5.871 4.32e-09 ***
...
[1] " -- Dumping the object: tbl (of type: integer, class: table)"
quality
classif 0 1
0 614 507
1 386 493
...
[1] " -- Dumping the object: confusionMatrix(TrainDF$quality, TrainDF$classif) (of type: list, class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 418 74
1 69 439
Accuracy : 0.857
...
[1] " -- Dumping the object: confusionMatrix(TestDF$quality, TestDF$classif) (of type: list, class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 377 131
1 145 347
Accuracy : 0.724c©Old Dominion University
-
20/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well will a support vector machine (svm) do? (1 of2)
“The support vector classifier is a natural approachfor classification in the two-class setting, if the boundarybetween the two classes is linear.”
James, et al. [1]
The code:
modelSVM
-
21/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Examples from the text
How well will a support vector machine (svm) do? (2 of2)
The results:
[1] " -- Dumping the object: confusionMatrix(
TrainDF$quality, classifSVMtrain)
(of type: list, class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 449 43
1 38 470
Accuracy : 0.919
...
[1] " -- Dumping the object: confusionMatrix(
TestDF$quality, classifSVMtest)
(of type: list, class: confusionMatrix)"
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 378 130
1 146 346
Accuracy : 0.724
c©Old Dominion University
-
22/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
A little silliness
Looking at term frequency in a PDF.
We will do a few things:
1 Read text directly from aPDF.
2 “Normalize” the text.
3 Look at the text in differentways.
Attached file.
c©Old Dominion University
-
23/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
A little silliness
Same image.
Attached file.
c©Old Dominion University
-
24/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
A little silliness
Look at text frequency as a B&W word cloud
Attached file.
c©Old Dominion University
-
25/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
A little silliness
Look at text frequency as a color word cloud
Attached file.
c©Old Dominion University
-
26/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
A little silliness
More colorful examples from Romeo and Juliet
Attached file.c©Old Dominion University
-
27/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Q & A time.
Q: How do you catch a uniquerabbit?A: Unique up on it!
Q: How do you catch a tamerabbit?A: The tame way!
c©Old Dominion University
-
28/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
What have we covered?
Compared and contrastednumerical and textual data analysisProvided a few numericaldefinitions (TF, IDF) that arefundamental to textual analysisApplied different textual analysistools and techniques (knn, näıveBayes, logit, and support vectormachine)Looked at different graphical waystextual data could be displayed
Next: Serial vs. parallel processing
c©Old Dominion University
-
29/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
References (1 of 2)
[1] Gareth James, Daniela Witten, Trevor Hastie, and RobertTibshirani, An Introduction to Statistical Learning, vol. 6,Springer, 2013.
[2] TF-IDF Staff, What does tf-idf mean?,http://www.tfidf.com/, 2017.
[3] Wikipedia Staff, Logistic function,https://en.wikipedia.org/wiki/Logistic_function,2017.
[4] , Naive Bayes classifier, https://en.wikipedia.org/wiki/Naive_Bayes_classifier,2017.
c©Old Dominion University
http://www.tfidf.com/https://en.wikipedia.org/wiki/Logistic_functionhttps://en.wikipedia.org/wiki/Naive_Bayes_classifierhttps://en.wikipedia.org/wiki/Naive_Bayes_classifier
-
30/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
References (2 of 2)
[5] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,Introduction to Data Mining, Pearson Education India, 2006.
[6] G Williams, Hands-On Data Science with R: Text Mining,2016.
c©Old Dominion University
-
31/35
Intro. Background Hands-on Q & A Conclusion References Files Misc.
Files of interest
1 Revised textual analysis
script2 Silly textual analysis
script3 PDF file used with silly
textual analysis script
4 R library script file
5 Other ways to display word
clouds
6 Code snippets
c©Old Dominion University
rm(list=ls())
library(lattice)library(ggplot2)library(NLP)library(tm)library(class)library(caret)library(e1071)library(topicmodels)library(qdapDictionaries)library(qdapRegex)
library(qdapTools)library(RColorBrewer)library(qdap)library(psych)
source("library.R")
assignBinary threshold]
-
Hands-On Data Science with R
Text Mining
10th January 2016
Visit http://HandsOnDataScience.com/ for more Chapters.
Text Mining (or Text Analytics) applies analytic tools to learn from collections of text data,like social media, books, newspapers, emails, etc. The goal can be considered to be similar tohumans learning by reading such material. However, using automated algorithms we can learnfrom massive amounts of text, very much more than a human can. The material could consist ofmillions of newspaper articles to perhaps summarise the main themes and to identify those thatare of most interest to particular people. Or we might be monitoring twitter feeds to identifyemerging topics that we might need to act upon, as it emerges.
The required packages for this chapter include:
library(tm) # Framework for text mining.
library(qdap) # Quantitative discourse analysis of transcripts.
library(qdapDictionaries)
library(dplyr) # Data wrangling, pipe operator %>%().
library(RColorBrewer) # Generate palette of colours for plots.
library(ggplot2) # Plot word frequencies.
library(scales) # Include commas in numbers.
library(Rgraphviz) # Correlation plots.
As we work through this chapter, new R commands will be introduced. Be sure to review thecommand’s documentation and understand what the command does. You can ask for help usingthe ? command as in:
?read.csv
We can obtain documentation on a particular package using the help= option of library():
library(help=rattle)
This chapter is intended to be hands on. To learn effectively, you are encouraged to have Rrunning (e.g., RStudio) and to run all the commands as they appear here. Check that you getthe same output, and you understand the output. Try some variations. Explore.
Copyright © 2013-2015 Graham Williams. You can freely copy, distribute,or adapt this material, as long as the attribution is retained and derivativework is provided under the same license.
http://HandsOnDataScience.com/
http://creativecommons.org/licenses/by-nc-sa/4.0/
-
Data Science with R Hands-On Text Mining
1 Getting Started: The Corpus
The primary package for text mining, tm (Feinerer and Hornik, 2015), provides a frameworkwithin which we perform our text mining. A collection of other standard R packages add valueto the data processing and visualizations for text mining.
The basic concept is that of a corpus. This is a collection of texts, usually stored electronically,and from which we perform our analysis. A corpus might be a collection of news articles fromReuters or the published works of Shakespeare. Within each corpus we will have separate docu-ments, which might be articles, stories, or book volumes. Each document is treated as a separateentity or record.
Documents which we wish to analyse come in many different formats. Quite a few formats aresupported by tm (Feinerer and Hornik, 2015), the package we will illustrate text mining with inthis module. The supported formats include text, PDF, Microsoft Word, and XML.
A number of open source tools are also available to convert most document formats to text files.For our corpus used initially in this module, a collection of PDF documents were converted to textusing pdftotext from the xpdf application which is available for GNU/Linux and MS/Windowsand others. On GNU/Linux we can convert a folder of PDF documents to text with:
system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")
The -enc ASCII7 ensures the text is converted to ASCII since otherwise we may end up withbinary characters in our text documents.
We can also convert Word documents to text using anitword, which is another applicationavailable for GNU/Linux.
system("for f in *.doc; do antiword $f; done")
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 1 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
https://secure.wikimedia.org/wikipedia/en/wiki/corpus
-
Data Science with R Hands-On Text Mining
1.1 Corpus Sources and Readers
There are a variety of sources supported by tm. We can use getSources() to list them.
getSources()
## [1] "DataframeSource" "DirSource" "URISource" "VectorSource"
## [5] "XMLSource" "ZipSource"
In addition to different kinds of sources of documents, our documents for text analysis will comein many different formats. A variety are supported by tm:
getReaders()
## [1] "readDOC" "readPDF"
## [3] "readPlain" "readRCV1"
## [5] "readRCV1asPlain" "readReut21578XML"
## [7] "readReut21578XMLasPlain" "readTabular"
## [9] "readTagged" "readXML"
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 2 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
1.2 Text Documents
We load a sample corpus of text documents. Our corpus consists of a collection of researchpapers all stored in the folder we identify below. To work along with us in this module, youcan create your own folder called corpus/txt and place into that folder a collection of textdocuments. It does not need to be as many as we use here but a reasonable number makes itmore interesting.
cname
-
Data Science with R Hands-On Text Mining
## ai02.txt 2 PlainTextDocument list
## ai03.txt 2 PlainTextDocument list
## ai97.txt 2 PlainTextDocument list
....
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 4 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
1.3 PDF Documents
If instead of text documents we have a corpus of PDF documents then we can use the readPDF()reader function to convert PDF into text and have that loaded as out Corpus.
docs
-
Data Science with R Hands-On Text Mining
1.4 Word Documents
A simple open source tool to convert Microsoft Word documents into text is antiword. Theseparate antiword application needs to be installed, but once it is available it is used by tm toconvert Word documents into text for loading into R.
To load a corpus of Word documents we use the readDOC() reader function:
docs
-
Data Science with R Hands-On Text Mining
2 Exploring the Corpus
We can (and should) inspect the documents using inspect(). This will assure us that data hasbeen loaded properly and as we expect.
inspect(docs[16])
##
## Metadata: corpus specific: 0, document level (indexed): 0
## Content: documents: 1
##
## [[1]]
##
## Metadata: 7
## Content: chars: 44776
viewDocs % extract2(n) %>% as.character() %>% writeLines()}viewDocs(docs, 16)
## Hybrid weighted random forests for
## classifying very high-dimensional data
## Baoxun Xu1 , Joshua Zhexue Huang2 , Graham Williams2 and
## Yunming Ye1
## 1
##
....
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 7 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
3 Preparing the Corpus
We generally need to perform some pre-processing of the text data to prepare for the text anal-ysis. Example transformations include converting the text to lower case, removing numbers andpunctuation, removing stop words, stemming and identifying synonyms. The basic transformsare all available within tm.
getTransformations()
## [1] "removeNumbers" "removePunctuation" "removeWords"
## [4] "stemDocument" "stripWhitespace"
The function tm map() is used to apply one of these transformations across all documents withina corpus. Other transformations can be implemented using R functions and wrapped withincontent transformer() to create a function that can be passed through to tm map(). We willsee an example of that in the next section.
In the following sections we will apply each of the transformations, one-by-one, to remove un-wanted characters from the text.
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 8 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
3.1 Simple Transforms
We start with some manual special transforms we may want to do. For example, we might wantto replace “/”, used sometimes to separate alternative words, with a space. This will avoid thetwo words being run into one string of characters through the transformations. We might alsoreplace “@” and “|” with a space, for the same reason.
To create a custom transformation we make use of content transformer() to create a functionto achieve the transformation, and then apply it to the corpus using tm map().
toSpace
-
Data Science with R Hands-On Text Mining
3.2 Conversion to Lower Case
docs
-
Data Science with R Hands-On Text Mining
3.3 Remove Numbers
docs
-
Data Science with R Hands-On Text Mining
3.4 Remove Punctuation
docs
-
Data Science with R Hands-On Text Mining
3.5 Remove English Stop Words
docs
-
Data Science with R Hands-On Text Mining
3.6 Remove Own Stop Words
docs
-
Data Science with R Hands-On Text Mining
3.7 Strip Whitespace
docs
-
Data Science with R Hands-On Text Mining
3.8 Specific Transformations
We might also have some specific transformations we would like to perform. The examples heremay or may not be useful, depending on how we want to analyse the documents. This is reallyfor illustration using the part of the document we are looking at here, rather than suggestingthis specific transform adds value.
toString
-
Data Science with R Hands-On Text Mining
3.9 Stemming
docs
-
Data Science with R Hands-On Text Mining
4 Creating a Document Term Matrix
A document term matrix is simply a matrix with documents as the rows and terms as the columnsand a count of the frequency of words as the cells of the matrix. We use DocumentTermMatrix()to create the matrix:
dtm
-
Data Science with R Hands-On Text Mining
5 Exploring the Document Term Matrix
We can obtain the term frequencies as a vector by converting the document term matrix into amatrix and summing the column counts:
freq
-
Data Science with R Hands-On Text Mining
6 Distribution of Term Frequencies
# Frequency of frequencies.
head(table(freq), 15)
## freq
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
## 2381 1030 503 311 210 188 134 130 82 83 65 61 54 52 51
tail(table(freq), 15)
## freq
## 483 544 547 555 578 609 611 616 703 709 776 887 1366 1446 3101
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
So we can see here that there are 2381 terms that occur just once.
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 20 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
7 Conversion to Matrix and Save to CSV
We can convert the document term matrix to a simple matrix for writing to a CSV file, forexample, for loading the data into other software if we need to do so. To write to CSV we firstconvert the data structure into a simple matrix:
m
-
Data Science with R Hands-On Text Mining
8 Removing Sparse Terms
We are often not interested in infrequent terms in our documents. Such “sparse” terms can beremoved from the document term matrix quite easily using removeSparseTerms():
dim(dtm)
## [1] 46 6508
dtms
-
Data Science with R Hands-On Text Mining
9 Identifying Frequent Items and Associations
One thing we often to first do is to get an idea of the most frequent terms in the corpus. We usefindFreqTerms() to do this. Here we limit the output to those terms that occur at least 1,000times:
findFreqTerms(dtm, lowfreq=1000)
## [1] "data" "mine" "use"
So that only lists a few. We can get more of them by reducing the threshold:
findFreqTerms(dtm, lowfreq=100)
## [1] "accuraci" "acsi" "adr" "advers" "age"
## [6] "algorithm" "allow" "also" "analysi" "angioedema"
## [11] "appli" "applic" "approach" "area" "associ"
## [16] "attribut" "australia" "australian" "avail" "averag"
## [21] "base" "build" "call" "can" "care"
## [26] "case" "chang" "claim" "class" "classif"
....
We can also find associations with a word, specifying a correlation limit.
findAssocs(dtm, "data", corlimit=0.6)
## $data
## mine induct challeng know answer
## 0.90 0.72 0.70 0.65 0.64
## need statistician foundat general boost
## 0.63 0.63 0.62 0.62 0.61
## major mani come
....
If two words always appear together then the correlation would be 1.0 and if they never appeartogether the correlation would be 0.0. Thus the correlation is a measure of how closely associatedthe words are in the corpus.
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 23 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
10 Correlations Plots
accuraci acsi adr
advers
age
algorithm
allowalso
analysi
angioedema
appli applic
approach
area
associ
attribut
australia
australian avail
averag
base
build
call can
care
case
chang
claim
class
classif
classifi
cluster
collect
combin common compar
comput
condit
confer
consid consist
contain cost
csiro
current
data
databas dataset
day
decis
plot(dtm,
terms=findFreqTerms(dtm, lowfreq=100)[1:50],
corThreshold=0.5)
Rgraphviz (Hansen et al., 2016) from the BioConductor repository for R (bioconductor.org) isused to plot the network graph that displays the correlation between chosen words in the corpus.Here we choose 50 of the more frequent words as the nodes and include links between wordswhen they have at least a correlation of 0.5.
By default (without providing terms and a correlation threshold) the plot function chooses arandom 20 terms with a threshold of 0.7.
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 24 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
11 Correlations Plot—Options
accuraci acsi adr
advers
age
algorithm
allowalso
analysi
angioedema
appli applic
approach
area
associ
attribut
australia
australian avail
averag
base
build
call can
care
case
chang
claim
class
classif
classifi
cluster
collect
combin common compar
comput
condit
confer
consid consist
contain cost
csiro
current
data
databas dataset
day
decis
plot(dtm,
terms=findFreqTerms(dtm, lowfreq=100)[1:50],
corThreshold=0.5)
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 25 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
12 Plotting Word Frequencies
We can generate the frequency count of all words in a corpus:
freq %
ggplot(aes(word, freq)) +
geom_bar(stat="identity") +
theme(axis.text.x=element_text(angle=45, hjust=1))
0
1000
2000
3000
algori
thm can
cluste
rda
ta
datas
etfea
tur
metho
dmi
nemo
del
patte
rn rule se
ttre
eus
e
word
freq
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 26 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
13 Word Clouds
effectconsidergovern
adrcomponlocatcare
environ
sciencsegment
timewhole miner
movedefin
begin
taxa
t
endvarieti element
complexpremium
databas
right
market
can
result
threeinclud
mean
thusarchitectur
term
simpl
frequentclaim
call
streamangioedema
donohostahel
individuthreshold
mml
monitor
report
howevbase
due
observ
advers
normal
withoutdiseas
esophag
pressreview
classifi
output
advanc
instal
transformset
tar
outlierentiti
theo
ri
interv
kdd
two
known
high
caa
knowledg
hiddeninduct
basic direct
conclus
access
rattl
rather
distinct
datasetless
receivmonth
busi startlarger
methodolog
show
behaviour
day
new
parallel
introduct
visual
technolog
spot
demonstr
will
version
increas
csiro
resp
ons
machin
geograph
split
type
subsequ
count
var
discovanoth
variablrespect
pmml
sequenc
sampl
unex
pect
interpret
oper
existconsid
parti
t
purpos
visualis
run
typic
forward
yearnode
limit
linux
distribut
hot
act
episod
help
author
acsimani
consequ
predictensembl
seri
ace
item
now
illustr
calcul
usag
insurca
se
list
region
gis
multipl
specif
difficult
breiman
log
intellig
line
understand
applide
sign
issu
condit
paramet
sinc
general
detect
conferfemal
input
next
variousmap
success
rank
made intrus
algorithm
refer
provid
ann
neur
al
tree
subspac
mathemat
within
form
reco
rdgraphic
http
deriv
project
confid
group
eventutar
overal
smyth
engin
determin
periodupon
reduc
vect
or
valu
key
muchweight
matrix
tradit
tabl
recent
prop
ort
exposur
employ
introduc
divers
attribut
categor
drug
volapproach
contextdata
user
becom
summari
hadi
queensland
distanc
probabl
hazardproblem
tmip
altern
rnn
health
occur
even
portf
olio
quit
plot
networkjournal
inform
system
interest pro
ceed
error
higher
worker
frequenc
layer
real
exclus
cand
id
cost
plane
factorcustom
addit
alendron
cluster
hepat
complet
focu
s
common
build
deploy
postdischarg
control
gnuallow
appear
definit
evalu
descript
smaller
averagstage
studi
disc
over
i
subset
patientoccurr
publish
represent
describ
advantag
age
interfac
differrandom
futu
r
one
therefor
open
make
medic
compar
use
william
som
choos
prepar
supp
still
optim
adm
iss
discuss
chen
milli
on
natur
artifici
preprocess
patte
rn
offic
score
softwar
minimum
fit
must
strength
ieee
generat
practicview
degr
e
doctor
targ
et
prior
previous gender
outcom
cove
r
scheme
rule
train
hybridcodepa
ge
area
fig
situat
assess
categori
statist
prune
part
temporwindow point
program
found
produc
suitabl
mbs
characterist
popul
mine
match
research
standard
pbs
dimension
hospitsepar
medicar
togeth
singl
forest
address
way
nation
collect
servic
posit
built
densiti
techniqu
order
applic
tool
miss
depend
met
hod
loca
l
analysi
consist
relat
four
expect
link
comparison
grow
accord
experi
regress
rang australia
exam
pl
suppa
final
mutarc risk
expert
certa
in
mod
elleverag
copyright
number
step
creat
residualleverag
structur
interesting
actual
extract
connect
possibl
graham
high
light
length
lead
class
continu
australian
contain
acm
import
deliv
specifi
decis
domain
clinic
remain
field
univers
test
estim
firstsee
learn
action
export
implement
major
shown
cart
yes
select
nugget
industri
evolutionari
second
captur
insight
languag
literatur
interact
avail
piatetskyshapirosuggest
size
also
debian
administr
managnote
labo
rato
ri
unit
objectstate
chang
experiment
find
messag
explor
space
follo
w
independ
tem
plat
intern
patholog
work
total
like
transact
rare
investig
drg
ratio
spatial
functiongive
usual
simpli
reaction
tune
activ valid
goal
sever
propos
correspond
reason
good
huang
appropri
clearcorrel
prototyp
aim
initi
file
cca
inhibitor
global
accuraci
current
fraud
support
least
developmay
packag
obtain
detail
need
remov
commonwealth
involv
task
best
equat
particular
paper
might
section
process
emerg
benefit
rate
public
comput
preadmiss
wellabl
regular
indic
among
polici
better
present
figuroftengi
ven
larg
identifi
asso
ci
chosen
igi
concept
orig
in
similar
abst
ract
effici
idea
orga
nis
classif
text
sourc
mea
sur
world
canberra
low
perform
main
top
gain
automat
repres
construct
analys
search
framework
signific
level
featur sequenti
requir
improvcombin
small
take
gap
We can generate a word cloud as an effective alternative to providing a quick visual overview ofthe frequency of words in a corpus.
The wordcloud (?) package provides the required function.
library(wordcloud)
set.seed(123)
wordcloud(names(freq), freq, min.freq=40)
Notice the use of set.seed() only so that we can obtain the same layout each time—otherwisea random layout is chosen, which is not usually an issue.
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 27 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
13.1 Reducing Clutter With Max Words
providservicfigur
casepatient
databaspopul
valu
data
set
distribut
dataperform
prob
lem
knowledg
function
describ
min
edeveloprecord
processalso
drug
risk
number
first
paper
can
discoveri
one
includ
user
kdd
cluster
map
research
inform
statist
generat
treeforest
learn
randomsourc
may
associ
computidentifi
present
systemdecis
studi
mea
sur
support
tabl
group
sequenc
algorithm
time classde
tect
health
interest
appr
oach
analysi
relat
select
set
variabl two
techniqu
larg
base
structur
williamclassif
event
perio
d
similarnew
section
tempor
differtest
rule
insur
method
pattern
model
high
will
outlier
mani
work
train
applic
general
use
examplresult
featur
To increase or reduce the number of words displayed we can tune the value of max.words=. Herewe have limited the display to the 100 most frequent words.
set.seed(142)
wordcloud(names(freq), freq, max.words=100)
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 28 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
13.2 Reducing Clutter With Min Freq
larg
evalu
importrandom
like
section
unde
rsta
nd
includsubset
methodindividu
perform
combin
stepevent workattribut selectform
averagyear
effect
univers
clas
s
link
repres
call
reaction
supp
ortfunction
baseuserexplorvariabl
measurrank
informtabl
stat
ist
stud
i
analysi areadistanctool
level
outlier
servic
experi
time
data
search
cost
william
subspac
patientshowimplement
make
similar
databas
consist
record
lead unexpect
exis
t
propos
utar
angioedematempor
patternstage
intellig
figurdistribut
condit
result
discussaccuraci
drug
australia
identifi
dom
ain
appli
hospit
generalmani
number
perio
d
paper
case
featurbuild
process
describ
hot
technolog
sequencratio
insur
day
well
given
kdduse
rnn
order
within
node
target
applic
two
chang
discov
error
total
transact
smal
l
graham
acsi
fig
need
expect
ofte
n
learn
multipl
structur
interest
current
claim
new
dete
ct
journalobserv
report
allow
sizeone
adr
provid
approachhigh
can
find
classif
window
weight
modelcompar
neural
differ
mine
requir
threeaustralian
particular
vectorsourc
healthsingl
indicentiti
system
valucommon
advers
pmml
occur
cluster
exampl
state
increas
howev
rattl
task
hybridusual
network
defin
mean
regress
unit
dataset
rule
visualrelat
algorithm
follow
page
intern
http
type
point
expe
rtproceed
effici
machin
open
avail
collect
object
csiro
discoveritrain
generat
scienc
map
consid
confer
may
tree
packag
refer
sampl
will
group
forest
popul
also
polici
comput
interesting
firstoper
episod
associ
problem
medic
decis
set
interv
signific
predict
classifi
research
contain
knowledg
risk
present
developcare
nugget
techniqu
estim
test
age
spot
A more common approach to increase or reduce the number of words displayed is by tuning thevalue of min.freq=. Here we have limited the display to those words that occur at least 100times.
set.seed(142)
wordcloud(names(freq), freq, min.freq=100)
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 29 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
13.3 Adding Some Colour
larg
evalu
importrandom
like
section
unde
rsta
nd
includsubset
methodindividu
perform
combin
stepevent workattribut selectform
averagyear
effect
univers
clas
s
link
repres
call
reaction
supp
ortfunction
baseuserexplorvariabl
measurrank
informtabl
stat
ist
stud
i
analysi areadistanctool
level
outlier
servic
experi
time
data
search
cost
william
subspac
patientshowimplement
make
similar
databas
consist
record
lead unexpect
exis
t
propos
utar
angioedematempor
patternstage
intellig
figurdistribut
condit
result
discussaccuraci
drug
australia
identifi
dom
ain
appli
hospit
generalmani
number
perio
d
paper
case
featurbuild
process
describ
hot
technolog
sequencratio
insur
day
well
given
kdduse
rnn
order
within
node
target
applic
two
chang
discov
error
total
transact
smal
l
graham
acsi
fig
need
expect
ofte
n
learn
multipl
structur
interest
current
claim
new
dete
ct
journalobserv
report
allow
sizeone
adr
provid
approachhigh
can
find
classif
window
weight
modelcompar
neural
differ
mine
requir
threeaustralian
particular
vectorsourc
healthsingl
indicentiti
system
valucommon
advers
pmml
occur
cluster
exampl
state
increas
howev
rattl
task
hybridusual
network
defin
mean
regress
unit
dataset
rule
visualrelat
algorithm
follow
page
intern
http
type
point
expe
rtproceed
effici
machin
open
avail
collect
object
csiro
discoveritrain
generat
scienc
map
consid
confer
may
tree
packag
refer
sampl
will
group
forest
popul
also
polici
comput
interesting
firstoper
episod
associ
problem
medic
decis
set
interv
signific
predict
classifi
research
contain
knowledg
risk
present
developcare
nugget
techniqu
estim
test
age
spot
We can also add some colour to the display. Here we make use of brewer.pal() from RColor-Brewer (Neuwirth, 2014) to generate a palette of colours to use.
set.seed(142)
wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 30 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
13.4 Varying the Scaling
largevalu
importrandom
like
section
unde
rsta
nd
includsubset
methodindividu
perform
combin
step eventwork
attribut
selectformaverag
year
effect
universcla
ss
linkrepres
call
reactionsu
ppor
t function
base
userexplor
variabl
measur
rank
inform
tablst
atist
stud
i
analysi
area
distanc
tool
level
outlierservic
experi
time
datasearch
cost
william
subspac
patient
show
implement
make
similar
databas
consist
record
lead
unexpect
exis
t
propos
utar
angioedema
tempor patternstage
intellig
figurdistribut
condit
result
discuss
accuraci
drug
australia
identifi
dom
ain
appli
hospit
general
mani
number
perio
d paper
case
featur
build
processdescrib
hot
technologsequenc
ratio
insur
day
well
given kddusernn
order
within
node
target
applic
two
chang
discov
errortotal
transact
smal
l
graham
acsi
fig
needexpect
ofte
n
learn
multipl
structur
interest
current
claim
newdete
ct
journalobserv
report
allow
size
one
adrprovid
approach
high
can
find
classif
window
weight
modelcompar
neuraldiffer
mine
requir
three
australian
particular
vectorsourc
health
singl
indic
entiti
system
valu
commonadvers
pmmloccur
clusterexampl
state
increas
howev
rattl
task
hybrid
usual
network
defin
mean
regress
unit
datasetrule visual
relat
algorithm
follow
page
intern
http
type
point
expe
rt proceed
effici
machin
openavail
collect
objectcsiro
discoveri
train
generatscienc
map
consid
confer
may
treepackag
refer
sampl
will
group
forestpopul
also
polici
comput
interesting
first
oper
episodassoci
problem
medic
decis set
interv
signific
predict
classifi
research
contain
knowledg
risk
present
develop
care
nugget
techniqu
estim
test
age
spot
We can change the range of font sizes used in the plot using the scale= option. By default themost frequent words have a scale of 4 and the least have a scale of 0.5. Here we illustrate theeffect of increasing the scale range.
set.seed(142)
wordcloud(names(freq), freq, min.freq=100, scale=c(5, .1), colors=brewer.pal(6, "Dark2"))
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 31 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
13.5 Rotating Words
larg
evalu impo
rt
randomlikesection
unde
rsta
nd
includsubset
methodin
divi
du
perform
combin
step
eventworkattribut
selectfo
rm averagyear
effect
univers
clas
slink
repres
call
reaction
supp
ort
function
base
userexplor
variabl
measur
rankinform
tabl
stat
ist
stud
i
analysi
area
distanc
tool
leve
l
outlier
servic
experi
timedata
search
cost
william
subspac
patientshow
implement
make
similar
databas
consist record
lead
unexpect
exis
t
proposutar
angioedema
temporpattern
stage
intellig
figur
distributcondit
resultdiscuss
accuraci
drug
australia
identifi
dom
ain
appli
hospit
general
mani
number
perio
d
paper
case
featur
build
process
desc
rib
hot
technologsequenc
ratio
insurday
well
givenkdduse
rnn
order
within
node
target
applic
two
changdi
scov
errortotal
transact
smal
l
graham
acsi
fig
need
expect
ofte
n
learn
mul
tipl
structur
inte
rest
current
claim
new
dete
ct
journalobservreport
allowsize
one
adr
provid
approach
high
can find
classif
window
weight
model
compar
neural
differ
minerequir
threeaustralian
particular
vectorsourc
health
singl
indic
entiti
system
valu
common
advers
pmml
occur
cluster
exampl
stat
e
increas
howev
rattl
task
hybrid
usual
network
defin
mean
regress
unit
datasetrule
visual
relat
algo
rithm
follow
page
intern
http
typepoint
expe
rt
proceed
effici
machin
open
avail
collect
object
csiro
discoveritrain
generat
scienc
map
consid
confer
may
tree
packagre
fer
samplwill
group
fore
st
popul
also
policicomput
interesting
first
oper
episodas
soci
problem
medic
decis
set
interv
sign
ific
predict
classifi
researchcontainknowledg
risk
present
developcare
nugget
techniqu
estim
test
agesp
ot
We can change the proportion of words that are rotated by 90 degrees from the default 10% to,say, 20% using rot.per=0.2.
set.seed(142)
dark2
-
Data Science with R Hands-On Text Mining
14 Quantitative Analysis of Text
The qdap (Rinker, 2015) package provides an extensive suite of functions to support the quanti-tative analysis of text.
We can obtain simple summaries of a list of words, and to do so we will illustrate with theterms from our Term Document Matrix tdm. We first extract the shorter terms from each of ourdocuments into one long word list. To do so we convert tdm into a matrix, extract the columnnames (the terms) and retain those shorter than 20 characters.
words %
as.matrix %>%
colnames %>%
(function(x) x[nchar(x) < 20])
We can then summarise the word list. Notice, in particular, the use of dist tab() from qdap togenerate frequencies and percentages.
length(words)
## [1] 6456
head(words, 15)
## [1] "aaai" "aab" "aad" "aadrbhtm" "aadrbltn"
## [6] "aadrhtmliv" "aai" "aam" "aba" "abbrev"
## [11] "abbrevi" "abc" "abcd" "abdul" "abel"
summary(nchar(words))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 3.000 5.000 6.000 6.644 8.000 19.000
table(nchar(words))
##
## 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17
## 579 867 1044 1114 935 651 397 268 200 138 79 63 34 28 22
## 18 19
## 21 16
dist_tab(nchar(words))
## interval freq cum.freq percent cum.percent
## 1 3 579 579 8.97 8.97
## 2 4 867 1446 13.43 22.40
## 3 5 1044 2490 16.17 38.57
## 4 6 1114 3604 17.26 55.82
## 5 7 935 4539 14.48 70.31
## 6 8 651 5190 10.08 80.39
## 7 9 397 5587 6.15 86.54
## 8 10 268 5855 4.15 90.69
## 9 11 200 6055 3.10 93.79
## 10 12 138 6193 2.14 95.93
....
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 33 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
14.1 Word Length Counts
0
300
600
900
5 10 15 20Number of Letters
Num
ber o
f Wor
ds
A simple plot is then effective in showing the distribution of the word lengths. Here we create asingle column data frame that is passed on to ggplot() to generate a histogram, with a verticalline to show the mean length of words.
data.frame(nletters=nchar(words)) %>%
ggplot(aes(x=nletters)) +
geom_histogram(binwidth=1) +
geom_vline(xintercept=mean(nchar(words)),
colour="green", size=1, alpha=.5) +
labs(x="Number of Letters", y="Number of Words")
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 34 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
14.2 Letter Frequency
ZJQX
WYKVFBGHPDMCULSNOTRAI
E
0% 2% 4% 6% 8% 10% 12%Proportion
Lette
r
Next we want to review the frequency of letters across all of the words in the discourse. Somedata preparation will transform the vector of words into a list of letters, which we then constructa frequency count for, and pass this on to be plotted.
We again use a pipeline to string together the operations on the data. Starting from the vec-tor of words stored in word we split the words into characters using str split() from stringr(Wickham, 2015), removing the first string (an empty string) from each of the results (usingsapply()). Reducing the result into a simple vector, using unlist(), we then generate a dataframe recording the letter frequencies, using dist tab() from qdap. We can then plot the letterproportions.
library(dplyr)
library(stringr)
words %>%
str_split("") %>%
sapply(function(x) x[-1]) %>%
unlist %>%
dist_tab %>%
mutate(Letter=factor(toupper(interval),
levels=toupper(interval[order(freq)]))) %>%
ggplot(aes(Letter, weight=percent)) +
geom_bar() +
coord_flip() +
labs(y="Proportion") +
scale_y_continuous(breaks=seq(0, 12, 2),
label=function(x) paste0(x, "%"),
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 35 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
expand=c(0,0), limits=c(0,12))
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 36 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
14.3 Letter and Position Heatmap
.010 .019 .013 .010 .010 .007 .005 .003 .002 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000
.006 .001 .004 .002 .002 .002 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.013 .003 .007 .006 .004 .004 .003 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
.008 .002 .005 .005 .004 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000
.006 .021 .010 .016 .014 .008 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000
.005 .001 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.004 .001 .004 .004 .002 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.005 .005 .002 .004 .003 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
.007 .015 .009 .011 .012 .009 .007 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000
.002 .000 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.002 .000 .001 .003 .001 .000 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.005 .005 .008 .008 .006 .004 .004 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000
.009 .003 .007 .005 .003 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
.005 .010 .012 .008 .007 .009 .005 .004 .003 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000
.005 .021 .009 .008 .009 .005 .005 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000
.011 .003 .006 .005 .002 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
.001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.009 .012 .013 .009 .010 .009 .006 .004 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000
.015 .004 .011 .008 .007 .006 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000
.008 .005 .012 .013 .009 .008 .007 .005 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000
.004 .010 .005 .005 .004 .003 .002 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000
.003 .001 .003 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.005 .002 .002 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.001 .002 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.001 .001 .002 .001 .001 .001 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000
.001 .000 .000 .001 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000Z
Y
X
W
V
U
T
S
R
Q
P
O
N
M
L
K
J
I
H
G
F
E
D
C
B
A
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19Position
Lette
r
Proportion0.000
0.005
0.010
0.015
0.020
The qheat() function from qdap provides an effective visualisation of tabular data. Here wetransform the list of words into a position count of each letter, and constructing a table of theproportions that is passed on to qheat() to do the plotting.
words %>%
lapply(function(x) sapply(letters, gregexpr, x, fixed=TRUE)) %>%
unlist %>%
(function(x) x[x!=-1]) %>%
(function(x) setNames(x, gsub("\\d", "", names(x)))) %>%(function(x) apply(table(data.frame(letter=toupper(names(x)),
position=unname(x))),
1, function(y) y/length(x))) %>%
qheat(high="green", low="yellow", by.column=NULL,
values=TRUE, digits=3, plot=FALSE) +
labs(y="Letter", x="Position") +
theme(axis.text.x=element_text(angle=0)) +
guides(fill=guide_legend(title="Proportion"))
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 37 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
14.4 Miscellaneous Functions
We can generate gender from a name list, using the genderdata (?) package
devtools::install_github("lmullen/gender-data-pkg")
name2sex(qcv(graham, frank, leslie, james, jacqui, jack, kerry, kerrie))
## The genderdata package needs to be installed.
## Error in install genderdata package(): Failed to install the genderdata package.
## Please try installing the package for yourself using the following command:
## install.packages("genderdata", repos = "http://packages.ropensci.org", type
= "source")
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 38 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
15 Word Distances
Continuous bag of words (CBOW). Word2Vec associates each word in a vocabulary with a uniquevector of real numbers of length d. Words that have a similar syntactic context appear closertogether within the vector space. The syntactic context is based on a set of words within aspecific window size.
install.packages("tmcn.word2vec", repos="http://R-Forge.R-project.org")
## Installing package into ’/home/gjw/R/x86 64-pc-linux-gnu-library/3.2’
## (as ’lib’ is unspecified)
##
## The downloaded source packages are in
## '/tmp/Rtmpt1u3GR/downloaded_packages'
library(tmcn.word2vec)
model
-
Data Science with R Hands-On Text Mining
16 Review—Preparing the Corpus
Here in one sequence is collected the code to perform a text mining project. Notice that we wouldnot necessarily do all of these steps so pick and choose as is appropriate to your situation.
# Required packages
library(tm)
library(wordcloud)
# Locate and load the Corpus.
cname
-
Data Science with R Hands-On Text Mining
17 Review—Analysing the Corpus
# Document term matrix.
dtm
-
Data Science with R Hands-On Text Mining
18 LDA
Topic Models such as Latent Dirichlet Allocation has been popular for text mining in last 15years. Applied with varying degrees of success. Text is fed into LDA to extract the topicsunderlying the text document. Examples are the AP corpus and the Science Corpus 1880-2002(Blei and Lafferty 2009). PERHAPS USEFUL IN BOOK?
When is LDA applicable - it will fail on some data and need to choose number of topics tofind and how many documents are needed. HOw do we know the topics learned are correcttopics.
Two fundemental papers - independelty discovered: Blei, Ng, Jordan - NIPS 2001 with 11kcitations. Other paper is Pritchard, Stephens, and Donnelly in Genetics June 200 14K citations- models are exactly the same except for minor differences: except topics versus populationstructures.
No theoretic analysis as such. How to guarantee correct topics and how efficient is the learningprocedure?
Observations:
LDA won’t work on many short tweets or very few long documents.
We should not liberally over-fit the LDA with too many redundant topics...
Limiting factors:
We should use as many documents as we can and short documents less than 10 words won’twork even if there are many of them. Need sufficiently long documents.
Small Dirichlet paramenter helps especially if we overfit. See Long Nguen’s keynote at PAKDD2015 in Vietnam.
number of documents the most important factor
document length plays a useful role too
avoid overfitting as you get too many topics and don’t really learn anything as the humn needsto cull the topics.
New work detects new topics as they emerge.
library(lda)
## Error in library(lda): there is no package called ’lda’
# From demo(lda)
library("ggplot2")
library("reshape2")
data(cora.documents)
## Warning in data(cora.documents): data set ’cora.documents’ not found
data(cora.vocab)
## Warning in data(cora.vocab): data set ’cora.vocab’ not found
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 42 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
-
Data Science with R Hands-On Text Mining
theme_set(theme_bw())
set.seed(8675309)
K
-
Data Science with R Hands-On Text Mining
19 Further Reading and Acknowledgements
The Rattle Book, published by Springer, provides a comprehensiveintroduction to data mining and analytics using Rattle and R.It is available from Amazon. Other documentation on a broaderselection of R topics of relevance to the data scientist is freelyavailable from http://datamining.togaware.com, including theDatamining Desktop Survival Guide.
This chapter is one of many chapters available from http://HandsOnDataScience.com. In particular follow the links on thewebsite with a * which indicates the generally more developed chap-ters.
Other resources include:
The Journal of Statistical Software article, Text Mining Infrastructure in R is a good starthttp://www.jstatsoft.org/v25/i05/paper
Bilisoly (2008) presents methods and algorithms for text mining using Perl.
Thanks also to Tony Nolan for suggestions of some of the examples used in this chapter.
Some of the qdap examples were motivated by http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/.
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 44 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896
http://www.amazon.com/gp/product/1441998896/ref=as_li_qf_sp_asin_tl?ie=UTF8&tag=togaware-20&linkCode=as2&camp=217145&creative=399373&creativeASIN=1441998896
http://datamining.togaware.com
http://datamining.togaware.com/survivor/index.html
http://HandsOnDataScience.com
http://HandsOnDataScience.com
http://www.jstatsoft.org/v25/i05/paper
http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/
http://trinkerrstuff.wordpress.com/2014/10/31/exploration-of-letter-make-up-of-english-words/
-
Data Science with R Hands-On Text Mining
20 References
Bilisoly R (2008). Practical Text Mining with Perl. Wiley Series on Methods and Applicationsin Data Mining. Wiley. ISBN 9780470382851. URL http://books.google.com.au/books?id=YkMFVbsrdzkC.
Feinerer I, Hornik K (2015). tm: Text Mining Package. R package version 0.6-2, URL https://CRAN.R-project.org/package=tm.
Hansen KD, Gentry J, Long L, Gentleman R, Falcon S, Hahne F, Sarkar D (2016). Rgraphviz:Provides plotting capabilities for R graph objects. R package version 2.12.0.
Neuwirth E (2014). RColorBrewer: ColorBrewer Palettes. R package version 1.1-2, URLhttps://CRAN.R-project.org/package=RColorBrewer.
R Core Team (2015). R: A Language and Environment for Statistical Computing. R Foundationfor Statistical Computing, Vienna, Austria. URL https://www.R-project.org/.
Rinker T (2015). qdap: Bridging the Gap Between Qualitative Data and Quantitative Analysis.R package version 2.2.4, URL https://CRAN.R-project.org/package=qdap.
Wickham H (2015). stringr: Simple, Consistent Wrappers for Common String Operations. Rpackage version 1.0.0, URL https://CRAN.R-project.org/package=stringr.
Williams GJ (2009). “Rattle: A Data Mining GUI for R.” The R Journal, 1(2), 45–55. URLhttp://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf.
Williams GJ (2011). Data Mining with Rattle and R: The art of excavating data for knowledgediscovery. Use R! Springer, New York.
This document, sourced from TextMiningO.Rnw bitbucket revision 76, was processed by KnitRversion 1.12 of 2016-01-06 and took 41.3 seconds to process. It was generated by gjw on nyxrunning Ubuntu 14.04.3 LTS with Intel(R) Xeon(R) CPU W3520 @ 2.67GHz having 8 cores and12.3GB of RAM. It completed the processing 2016-01-10 09:58:57.
Copyright © 2013-2015 [email protected] Module: TextMiningO Page: 45 of 46
Draft Only
Generated 2016-01-10 10:00:58+11:00
http://books.google.com.au/books?id=YkMFVbsrdzkC
http://books.google.com.au/books?id=YkMFVbsrdzkC
https://CRAN.R-project.org/package=tm
https://CRAN.R-project.org/package=tm
https://CRAN.R-project.org/package=RColorBrewer
https://www.R-project.org/
https://CRAN.R-project.org/package=qdap
https://CRAN.R-project.org/package=stringr
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Williams.pdf
-
Draft Only
Generated 2016-01-10 10:00:58+11:00
Getting Started: The Corpus
Corpus Sources and Readers
Text Documents
PDF Documents
Word Documents
Exploring the Corpus
Preparing the Corpus
Simple Transforms
Conversion to Lower Case
Remove Numbers
Remove Punctuation
Remove English Stop Words
Remove Own Stop Words
Strip Whitespace
Specific Transformations
Stemming
Creating a Document Term Matrix
Exploring the Document Term Matrix
Distribution of Term Frequencies
Conversion to Matrix and Save to CSV
Removing Sparse Terms
Identifying Frequent Items and Associations
Correlations Plots
Correlations Plot—Options
Plotting Word Frequencies
Word Clouds
Reducing Clutter With Max Words
Reducing Clutter With Min Freq
Adding Some Colour
Varying the Scaling
Rotating Words
Quantitative Analysis of Text
Word Length Counts
Letter Frequency
Letter and Position Heatmap
Miscellaneous Functions
Word Distances
Review—Preparing the Corpus
Review—Analysing the Corpus
LDA
Further Reading and Acknowledgements
References
.ls.objects
-
Creating Shaped Wordclouds Using R
Tidewater Big Data EnthusiastsChuck Cartledge
Developer
November 3, 2016 at 11:04pm
Contents
List of Figures i
1 Introduction 1
2 Discussion 1
3 Conclusion 7
A Misc. files 9
List of Figures
1 A sample word cloud based on Romeo and Juliet. . . . . . . . . . . . . . . . 22 A more interesting word cloud based on Romeo and Juliette. . . . . . . . . . 33 An empty word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 A filled word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 A filled USA word cloud figure. . . . . . . . . . . . . . . . . . . . . . . . . . 66 A collection of sample word clouds. . . . . . . . . . . . . . . . . . . . . . . . 8
i
-
1 Introduction
The R library wordcloud provides an easy way to create an image showing how often a word(or tag) appears in a corpus (see Figure 1 on the following page). In a word cloud, the sizeof a word indicates how often that word appears. Word cloud words can be colored as well.
While word clouds are easy to create, often the clouds could be shaped differently tocreate a more lasting and profound impression (see Figure 2 on page 3).
2 Discussion
The R library wordcloud21 provides the capability of creating a word cloud that takes theshape of an image, or the shape of letters. The collection of predefined shapes include:
• ’cardiod’ – a heart shape
• ’circle’ – the default
• ’diamond’ – an alias for a square
• ’pentagon’ – the five sided object
• ’star’ – a five pointed star
• ’triangle’ – a triangle with the wide base at the bottom
• ’triangle-forward’ – a triangle with the wide base at the left
This collection of shapes (when combined with a user specified background color), maybe enough to satisfy a wide variety of needs. But it is the figPath option that offers themost potential.
The figPath option can point to a figure that contains the image the cloud path shouldfill.
Here are the steps to create an “interesting” shape to fill with a word cloud:
1. Download/create an image with only two items (see Figure 3 on page 4):
• A white background, and• A black outline of the shape.
2. Fill the interior of the shape with the same color as the outline (see Figure 4 on page 5).
3. Pass the location of the filled image as the figPath parameter (see Figure 5 on page 6).
-
Figure 1: A sample word cloud based on Romeo and Juliet. The image was created usingwordcloud function in the wordcloud library and the text from “Romeo and Juliet.”
2
-
Figure 2: A more interesting word cloud based on Romeo and Juliette. The image wascreated using wordcloud2 function in the wordcloud2 library and the text from “Romeo andJuliet.”
3
-
Figure 3: An empty word cloud figure.
4
-
Figure 4: A filled word cloud figure.
5
-
Figure 5: A filled USA word cloud figure.
The wordcloud2 function behaves slightly differently than most of the other R plot func-tions that I’ve used. The result from both wordcloud2 and letterCloud is not displayablewithin R. These functions actually create an HTML page in a temporary directory with em-bedded JavaScript that performs the placement of the words within the shape, and providesa level of interaction after the page is displayed. R “understands” that the product fromthese functions is an HTML widget and starts up the default browser to show the page. Thepage, and its sub-directories are removed when R ends.
The fact that the page uses JavaScript introduces some interesting aspects. Buried in theJavaScript used by the page to place the words in the cloud are a plethora of Math.random()calls. The JavaScript specification says that the Math.random() function has to return avalue greater than or equal to 0, and less than 1, which is reasonable for a random function.The specification also says that the implementation of the random function is up to theJavaScript application, and does not specify how the numbers are to be generated. Meaningthat the same HTML page being viewed by two different browsers, may generate two differentsequences of random numbers. Most random number generators have the capability of settinga seed value so that a repeatable sequence can be generated. JavaScript does not supportthe idea of a random number seed. The HTML page and collection of directories can bemoved to a server where they are available for use and support.
All of this means that each loading and viewing of the page will generate a different
1Available from:https://github.com/Lchiffon/wordcloud2
6
https://github.com/Lchiffon/wordcloud2
-
image, and there is no practical way to “get back” to an image that was good.In the Files section (see Section A on page 9) is an R script and support files to work
with. The R script was used to create various images (see Figure 6 on the following page).
3 Conclusion
The wordcloud2 library enables you to create word clouds of arbitrary shape inside an HTMLusing JavaScript to position and orient each word. Each HTML page and its associatedlibrary files are placed in individual directories that are removed when the creating R processterminates. Pages and files can be moved, or copied for safe keeping if desired. Because thepages use the Math.random() JavaScript function, each time the page is loaded, words willbe positioned differently in the cloud. If the desired shape has an internal hole, then it ispossible that some words may not be placed in the cloud.
wordcloud2 allows you to create word clouds to support your data visualization needs.
7
-
(a) A heart.
(b) The letters “USA”.
(c) A star.
(d) The USA.
Figure 6: A collection of sample word clouds. These images were created with the attachedR script.
8
-
A Misc. files
The files used to create all these figures are attached to this report. They are:
1. romeoAndJuliet.base64 – default text used to demonstrate the software
2. heart.png – a heart shape with a hole
3. usa.png – an outline of the continental United States
4. wordCloud.R – an R script to demonstrate making word clouds
9
ClJvbWVvIGFuZCBKdWxpZXQKU2hha2VzcGVhcmUgaG9tZXBhZ2UgfCBSb21lbyBhbmQgSnVsaWV0IHwgRW50aXJlIHBsYXkKQUNUIEkKUFJPTE9HVUUKCiAgICBUd28gaG91c2Vob2xkcywgYm90aCBhbGlrZSBpbiBkaWduaXR5LAogICAgSW4gZmFpciBWZXJvbmEsIHdoZXJlIHdlIGxheSBvdXIgc2NlbmUsCiAgICBGcm9tIGFuY2llbnQgZ3J1ZGdlIGJyZWFrIHRvIG5ldyBtdXRpbnksCiAgICBXaGVyZSBjaXZpbCBibG9vZCBtYWtlcyBjaXZpbCBoYW5kcyB1bmNsZWFuLgogICAgRnJvbSBmb3J0aCB0aGUgZmF0YWwgbG9pbnMgb2YgdGhlc2UgdHdvIGZvZXMKICAgIEEgcGFpciBvZiBzdGFyLWNyb3NzJ2QgbG92ZXJzIHRha2UgdGhlaXIgbGlmZTsKICAgIFdob3NlIG1pc2FkdmVudHVyZWQgcGl0ZW91cyBvdmVydGhyb3dzCiAgICBEbyB3aXRoIHRoZWlyIGRlYXRoIGJ1cnkgdGhlaXIgcGFyZW50cycgc3RyaWZlLgogICAgVGhlIGZlYXJmdWwgcGFzc2FnZSBvZiB0aGVpciBkZWF0aC1tYXJrJ2QgbG92ZSwKICAgIEFuZCB0aGUgY29udGludWFuY2Ugb2YgdGhlaXIgcGFyZW50cycgcmFnZSwKICAgIFdoaWNoLCBidXQgdGhlaXIgY2hpbGRyZW4ncyBlbmQsIG5vdWdodCBjb3VsZCByZW1vdmUsCiAgICBJcyBub3cgdGhlIHR3byBob3VycycgdHJhZmZpYyBvZiBvdXIgc3RhZ2U7CiAgICBUaGUgd2hpY2ggaWYgeW91IHdpdGggcGF0aWVudCBlYXJzIGF0dGVuZCwKICAgIFdoYXQgaGVyZSBzaGFsbCBtaXNzLCBvdXIgdG9pbCBzaGFsbCBzdHJpdmUgdG8gbWVuZC4KClNDRU5FIEkuIFZlcm9uYS4gQSBwdWJsaWMgcGxhY2UuCgogICAgRW50ZXIgU0FNUFNPTiBhbmQgR1JFR09SWSwgb2YgdGhlIGhvdXNlIG9mIENhcHVsZXQsIGFybWVkIHdpdGggc3dvcmRzIGFuZCBidWNrbGVycyAKClNBTVBTT04KCiAgICBHcmVnb3J5LCBvJyBteSB3b3JkLCB3ZSdsbCBub3QgY2FycnkgY29hbHMuCgpHUkVHT1JZCgogICAgTm8sIGZvciB0aGVuIHdlIHNob3VsZCBiZSBjb2xsaWVycy4KClNBTVBTT04KCiAgICBJIG1lYW4sIGFuIHdlIGJlIGluIGNob2xlciwgd2UnbGwgZHJhdy4KCkdSRUdPUlkKCiAgICBBeSwgd2hpbGUgeW91IGxpdmUsIGRyYXcgeW91ciBuZWNrIG91dCBvJyB0aGUgY29sbGFyLgoKU0FNUFNPTgoKICAgIEkgc3RyaWtlIHF1aWNrbHksIGJlaW5nIG1vdmVkLgoKR1JFR09SWQoKICAgIEJ1dCB0aG91IGFydCBub3QgcXVpY2tseSBtb3ZlZCB0byBzdHJpa2UuCgpTQU1QU09OCgogICAgQSBkb2cgb2YgdGhlIGhvdXNlIG9mIE1vbnRhZ3VlIG1vdmVzIG1lLgoKR1JFR09SWQoKICAgIFRvIG1vdmUgaXMgdG8gc3RpcjsgYW5kIHRvIGJlIHZhbGlhbnQgaXMgdG8gc3RhbmQ6CiAgICB0aGVyZWZvcmUsIGlmIHRob3UgYXJ0IG1vdmVkLCB0aG91IHJ1bm4nc3QgYXdheS4KClNBTVBTT04KCiAgICBBIGRvZyBvZiB0aGF0IGhvdXNlIHNoYWxsIG1vdmUgbWUgdG8gc3RhbmQ6IEkgd2lsbAogICAgdGFrZSB0aGUgd2FsbCBvZiBhbnkgbWFuIG9yIG1haWQgb2YgTW9udGFndWUncy4KCkdSRUdPUlkKCiAgICBUaGF0IHNob3dzIHRoZWUgYSB3ZWFrIHNsYXZlOyBmb3IgdGhlIHdlYWtlc3QgZ29lcwogICAgdG8gdGhlIHdhbGwuCgpTQU1QU09OCgogICAgVHJ1ZTsgYW5kIHRoZXJlZm9yZSB3b21lbiwgYmVpbmcgdGhlIHdlYWtlciB2ZXNzZWxzLAogICAgYXJlIGV2ZXIgdGhydXN0IHRvIHRoZSB3YWxsOiB0aGVyZWZvcmUgSSB3aWxsIHB1c2gKICAgIE1vbnRhZ3VlJ3MgbWVuIGZyb20gdGhlIHdhbGwsIGFuZCB0aHJ1c3QgaGlzIG1haWRzCiAgICB0byB0aGUgd2FsbC4KCkdSRUdPUlkKCiAgICBUaGUgcXVhcnJlbCBpcyBiZXR3ZWVuIG91ciBtYXN0ZXJzIGFuZCB1cyB0aGVpciBtZW4uCgpTQU1QU09OCgogICAgJ1RpcyBhbGwgb25lLCBJIHdpbGwgc2hvdyBteXNlbGYgYSB0eXJhbnQ6IHdoZW4gSQogICAgaGF2ZSBmb3VnaHQgd2l0aCB0aGUgbWVuLCBJIHdpbGwgYmUgY3J1ZWwgd2l0aCB0aGUKICAgIG1haWRzLCBhbmQgY3V0IG9mZiB0aGVpciBoZWFkcy4KCkdSRUdPUlkKCiAgICBUaGUgaGVhZHMgb2YgdGhlIG1haWRzPwoKU0FNUFNPTgoKICAgIEF5LCB0aGUgaGVhZHMgb2YgdGhlIG1haWRzLCBvciB0aGVpciBtYWlkZW5oZWFkczsKICAgIHRha2UgaXQgaW4gd2hhdCBzZW5zZSB0aG91IHdpbHQuCgpHUkVHT1JZCgogICAgVGhleSBtdXN0IHRha2UgaXQgaW4gc2Vuc2UgdGhhdCBmZWVsIGl0LgoKU0FNUFNPTgoKICAgIE1lIHRoZXkgc2hhbGwgZmVlbCB3aGlsZSBJIGFtIGFibGUgdG8gc3RhbmQ6IGFuZAogICAgJ3RpcyBrbm93biBJIGFtIGEgcHJldHR5IHBpZWNlIG9mIGZsZXNoLgoKR1JFR09SWQoKICAgICdUaXMgd2VsbCB0aG91IGFydCBub3QgZmlzaDsgaWYgdGhvdSBoYWRzdCwgdGhvdQogICAgaGFkc3QgYmVlbiBwb29yIEpvaG4uIERyYXcgdGh5IHRvb2whIGhlcmUgY29tZXMKICAgIHR3byBvZiB0aGUgaG91c2Ugb2YgdGhlIE1vbnRhZ3Vlcy4KClNBTVBTT04KCiAgICBNeSBuYWtlZCB3ZWFwb24gaXMgb3V0OiBxdWFycmVsLCBJIHdpbGwgYmFjayB0aGVlLgoKR1JFR09SWQoKICAgIEhvdyEgdHVybiB0aHkgYmFjayBhbmQgcnVuPwoKU0FNUFNPTgoKICAgIEZlYXIgbWUgbm90LgoKR1JFR09SWQoKICAgIE5vLCBtYXJyeTsgSSBmZWFyIHRoZWUhCgpTQU1QU09OCgogICAgTGV0IHVzIHRha2UgdGhlIGxhdyBvZiBvdXIgc2lkZXM7IGxldCB0aGVtIGJlZ2luLgoKR1JFR09SWQoKICAgIEkgd2lsbCBmcm93biBhcyBJIHBhc3MgYnksIGFuZCBsZXQgdGhlbSB0YWtlIGl0IGFzCiAgICB0aGV5IGxpc3QuCgpTQU1QU09OCgogICAgTmF5LCBhcyB0aGV5IGRhcmUuIEkgd2lsbCBiaXRlIG15IHRodW1iIGF0IHRoZW07CiAgICB3aGljaCBpcyBhIGRpc2dyYWNlIHRvIHRoZW0sIGlmIHRoZXkgYmVhciBpdC4KCiAgICBFbnRlciBBQlJBSEFNIGFuZCBCQUxUSEFTQVIKCkFCUkFIQU0KCiAgICBEbyB5b3UgYml0ZSB5b3VyIHRodW1iIGF0IHVzLCBzaXI/CgpTQU1QU09OCgogICAgSSBkbyBiaXRlIG15IHRodW1iLCBzaXIuCgpBQlJBSEFNCgogICAgRG8geW91IGJpdGUgeW91ciB0aHVtYiBhdCB1cywgc2lyPwoKU0FNUFNPTgoKICAgIFtBc2lkZSB0byBHUkVHT1JZXSBJcyB0aGUgbGF3IG9mIG91ciBzaWRlLCBpZiBJIHNheQogICAgYXk/CgpHUkVHT1JZCgogICAgTm8uCgpTQU1QU09OCgogICAgTm8sIHNpciwgSSBkbyBub3QgYml0ZSBteSB0aHVtYiBhdCB5b3UsIHNpciwgYnV0IEkKICAgIGJpdGUgbXkgdGh1bWIsIHNpci4KCkdSRUdPUlkKCiAgICBEbyB5b3UgcXVhcnJlbCwgc2lyPwoKQUJSQUhBTQoKICAgIFF1YXJyZWwgc2lyISBubywgc2lyLgoKU0FNUFNPTgoKICAgIElmIHlvdSBkbywgc2lyLCBJIGFtIGZvciB5b3U6IEkgc2VydmUgYXMgZ29vZCBhIG1hbiBhcyB5b3UuCgpBQlJBSEFNCgogICAgTm8gYmV0dGVyLgoKU0FNUFNPTgoKICAgIFdlbGwsIHNpci4KCkdSRUdPUlkKCiAgICBTYXkgJ2JldHRlcjonIGhlcmUgY29tZXMgb25lIG9mIG15IG1hc3RlcidzIGtpbnNtZW4uCgpTQU1QU09OCgogICAgWWVzLCBiZXR0ZXIsIHNpci4KCkFCUkFIQU0KCiAgICBZb3UgbGllLgoKU0FNUFNPTgoKICAgIERyYXcsIGlmIHlvdSBiZSBtZW4uIEdyZWdvcnksIHJlbWVtYmVyIHRoeSBzd2FzaGluZyBibG93LgoKICAgIFRoZXkgZmlnaHQKCiAgICBFbnRlciBCRU5WT0xJTwoKQkVOVk9MSU8KCiAgICBQYXJ0LCBmb29scyEKICAgIFB1dCB1cCB5b3VyIHN3b3JkczsgeW91IGtub3cgbm90IHdoYXQgeW91IGRvLgoKICAgIEJlYXRzIGRvd24gdGhlaXIgc3dvcmRzCgogICAgRW50ZXIgVFlCQUxUCgpUWUJBTFQKCiAgICBXaGF0LCBhcnQgdGhvdSBkcmF3biBhbW9uZyB0aGVzZSBoZWFydGxlc3MgaGluZHM/CiAgICBUdXJuIHRoZWUsIEJlbnZvbGlvLCBsb29rIHVwb24gdGh5IGRlYXRoLgoKQkVOVk9MSU8KCiAgICBJIGRvIGJ1dCBrZWVwIHRoZSBwZWFjZTogcHV0IHVwIHRoeSBzd29yZCwKICAgIE9yIG1hbmFnZSBpdCB0byBwYXJ0IHRoZXNlIG1lbiB3aXRoIG1lLgoKVFlCQUxUCgogICAgV2hhdCwgZHJhd24sIGFuZCB0YWxrIG9mIHBlYWNlISBJIGhhdGUgdGhlIHdvcmQsCiAgICBBcyBJIGhhdGUgaGVsbCwgYWxsIE1vbnRhZ3VlcywgYW5kIHRoZWU6CiAgICBIYXZlIGF0IHRoZWUsIGNvd2FyZCEKCiAgICBUaGV5IGZpZ2h0CgogICAgRW50ZXIsIHNldmVyYWwgb2YgYm90aCBob3VzZXMsIHdobyBqb2luIHRoZSBmcmF5OyB0aGVuIGVudGVyIENpdGl6ZW5zLCB3aXRoIGNsdWJzCgpGaXJzdCBDaXRpemVuCgogICAgQ2x1YnMsIGJpbGxzLCBhbmQgcGFydGlzYW5zISBzdHJpa2UhIGJlYXQgdGhlbSBkb3duIQogICAgRG93biB3aXRoIHRoZSBDYXB1bGV0cyEgZG93biB3aXRoIHRoZSBNb250YWd1ZXMhCgogICAgRW50ZXIgQ0FQVUxFVCBpbiBoaXMgZ293biwgYW5kIExBRFkgQ0FQVUxFVAoKQ0FQVUxFVAoKICAgIFdoYXQgbm9pc2UgaXMgdGhpcz8gR2l2ZSBtZSBteSBsb25nIHN3b3JkLCBobyEKCkxBRFkgQ0FQVUxFVAoKICAgIEEgY3J1dGNoLCBhIGNydXRjaCEgd2h5IGNhbGwgeW91IGZvciBhIHN3b3JkPwoKQ0FQVUxFVAoKICAgIE15IHN3b3JkLCBJIHNheSEgT2xkIE1vbnRhZ3VlIGlzIGNvbWUsCiAgICBBbmQgZmxvdXJpc2hlcyBoaXMgYmxhZGUgaW4gc3BpdGUgb2YgbWUuCgogICAgRW50ZXIgTU9OVEFHVUUgYW5kIExBRFkgTU9OVEFHVUUKCk1PTlRBR1VFCgogICAgVGhvdSB2aWxsYWluIENhcHVsZXQsLS1Ib2xkIG1lIG5vdCwgbGV0IG1lIGdvLgoKTEFEWSBNT05UQUdVRQoKICAgIFRob3Ugc2hhbHQgbm90IHN0aXIgYSBmb290IHRvIHNlZWsgYSBmb2UuCgogICAgRW50ZXIgUFJJTkNFLCB3aXRoIEF0dGVuZGFudHMKClBSSU5DRQoKICAgIFJlYmVsbGlvdXMgc3ViamVjdHMsIGVuZW1pZXMgdG8gcGVhY2UsCiAgICBQcm9mYW5lcnMgb2YgdGhpcyBuZWlnaGJvdXItc3RhaW5lZCBzdGVlbCwtLQogICAgV2lsbCB0aGV5IG5vdCBoZWFyPyBXaGF0LCBobyEgeW91IG1lbiwgeW91IGJlYXN0cywKICAgIFRoYXQgcXVlbmNoIHRoZSBmaXJlIG9mIHlvdXIgcGVybmljaW91cyByYWdlCiAgICBXaXRoIHB1cnBsZSBmb3VudGFpbnMgaXNzdWluZyBmcm9tIHlvdXIgdmVpbnMsCiAgICBPbiBwYWluIG9mIHRvcnR1cmUsIGZyb20gdGhvc2UgYmxvb2R5IGhhbmRzCiAgICBUaHJvdyB5b3VyIG1pc3RlbXBlcidkIHdlYXBvbnMgdG8gdGhlIGdyb3VuZCwKICAgIEFuZCBoZWFyIHRoZSBzZW50ZW5jZSBvZiB5b3VyIG1vdmVkIHByaW5jZS4KICAgIFRocmVlIGNpdmlsIGJyYXdscywgYnJlZCBvZiBhbiBhaXJ5IHdvcmQsCiAgICBCeSB0aGVlLCBvbGQgQ2FwdWxldCwgYW5kIE1vbnRhZ3VlLAogICAgSGF2ZSB0aHJpY2UgZGlzdHVyYidkIHRoZSBxdWlldCBvZiBvdXIgc3RyZWV0cywKICAgIEFuZCBtYWRlIFZlcm9uYSdzIGFuY2llbnQgY2l0aXplbnMKICAgIENhc3QgYnkgdGhlaXIgZ3JhdmUgYmVzZWVtaW5nIG9ybmFtZW50cywKICAgIFRvIHdpZWxkIG9sZCBwYXJ0aXNhbnMsIGluIGhhbmRzIGFzIG9sZCwKICAgIENhbmtlcidkIHdpdGggcGVhY2UsIHRvIHBhcnQgeW91ciBjYW5rZXInZCBoYXRlOgogICAgSWYgZXZlciB5b3UgZGlzdHVyYiBvdXIgc3RyZWV0cyBhZ2FpbiwKICAgIFlvdXIgbGl2ZXMgc2hhbGwgcGF5IHRoZSBmb3JmZWl0IG9mIHRoZSBwZWFjZS4KICAgIEZvciB0aGlzIHRpbWUsIGFsbCB0aGUgcmVzdCBkZXBhcnQgYXdheToKICAgIFlvdSBDYXB1bGV0OyBzaGFsbCBnbyBhbG9uZyB3aXRoIG1lOgogICAgQW5kLCBNb250YWd1ZSwgY29tZSB5b3UgdGhpcyBhZnRlcm5vb24sCiAgICBUbyBrbm93IG91ciBmdXJ0aGVyIHBsZWFzdXJlIGluIHRoaXMgY2FzZSwKICAgIFRvIG9sZCBGcmVlLXRvd24sIG91ciBjb21tb24ganVkZ21lbnQtcGxhY2UuCiAgICBPbmNlIG1vcmUsIG9uIHBhaW4gb2YgZGVhdGgsIGFsbCBtZW4gZGVwYXJ0LgoKICAgIEV4ZXVudCBhbGwgYnV0IE1PTlRBR1VFLCBMQURZIE1PTlRBR1VFLCBhbmQgQkVOVk9MSU8KCk1PTlRBR1VFCgogICAgV2hvIHNldCB0aGlzIGFuY2llbnQgcXVhcnJlbCBuZXcgYWJyb2FjaD8KICAgIFNwZWFrLCBuZXBoZXcsIHdlcmUgeW91IGJ5IHdoZW4gaXQgYmVnYW4/CgpCRU5WT0xJTwoKICAgIEhlcmUgd2VyZSB0aGUgc2VydmFudHMgb2YgeW91ciBhZHZlcnNhcnksCiAgICBBbmQgeW91cnMsIGNsb3NlIGZpZ2h0aW5nIGVyZSBJIGRpZCBhcHByb2FjaDoKICAgIEkgZHJldyB0byBwYXJ0IHRoZW06IGluIHRoZSBpbnN0YW50IGNhbWUKICAgIFRoZSBmaWVyeSBUeWJhbHQsIHdpdGggaGlzIHN3b3JkIHByZXBhcmVkLAogICAgV2hpY2gsIGFzIGhlIGJyZWF0aGVkIGRlZmlhbmNlIHRvIG15IGVhcnMsCiAgICBIZSBzd3VuZyBhYm91dCBoaXMgaGVhZCBhbmQgY3V0IHRoZSB3aW5kcywKICAgIFdobyBub3RoaW5nIGh1cnQgd2l0aGFsIGhpc3MnZCBoaW0gaW4gc2Nvcm46CiAgICBXaGlsZSB3ZSB3ZXJlIGludGVyY2hhbmdpbmcgdGhydXN0cyBhbmQgYmxvd3MsCiAgICBDYW1lIG1vcmUgYW5kIG1vcmUgYW5kIGZvdWdodCBvbiBwYXJ0IGFuZCBwYXJ0LAogICAgVGlsbCB0aGUgcHJpbmNlIGNhbWUsIHdobyBwYXJ0ZWQgZWl0aGVyIHB