hands on: text mining with r

HANDS ON: TEXT MINING WITH R

Jahnab Kumar Deka

Introduction• To learn from collections of text documents like books,

newspapers, emails, etc.

Important Terms: • Tokenization • Tagging (Noun/Verb/…)• Chunking(Noun Phase)• Stemming(-ing/-s/-ed)

Important packages in R• library(tm) # Framework for text mining.• library(SnowballC) # Provides wordStem() for stemming.

• library(qdap) # Quantitative discourse analysis of transcripts.

• library(qdapDictionaries)• library(dplyr) # Data preparation and pipes %>%.• library(RColorBrewer) # Generate palette of colours for plots.

• library(ggplot2) # Plot word frequencies.• library(scales) # Include commas in numbers.• library(Rgraphviz) # Correlation plots.

Corpus• Collection of text

• Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record.

• Any file format can be converted to text file for corpusEg:• PDF to Text File

• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")• Word Document to Text File

• system("for f in *.doc; do antiword $f; done")

Corpus• Consider folder corpus/txt

• List some of file names

Loading Corpus• Loading Corpus

** Using DirSource() the source object is passed on to Corpus() which loads the documents.• In case of PDF Documents

• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF()

• In case of Word Documents• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-

r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included

Exploration of Corpus• inspect()

• Preparing the corpus• Transformation type

• tm map() is used to apply one of this transformation• Other transformations can be implemented using R functions and wrapped

within content_transformer()

Transformation Example• replace “/”, “@” and “\\|” with a space

• Alternate method

• Conversion to toLower Case

• Remove Numbers

• Remove Punctuation

Contd...• Remove English Stop Words

• Remove Own Stop Words

• Strip Whitespace

• Specific Transformations

Contd...• Stemming

• Creating a Document Term Matrix A matrix with documents as the rows

terms as the columnscount of the frequency of words as the cells of the matrix.

• Term frequency

Contd...• Frequency order of item

• ord <- order(freq)• Least Frequent item

• freq[head(ord)]• Most frequent item

• freq[tail(ord)]

• Document Term matrix to CSV• dtm <- DocumentTermMatrix(docs)• m <- as.matrix(dtm)• write.csv(m, file="dtm.csv")

Contd...• Removing Sparse Terms

• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor• the resulting matrix contains only terms with a sparse factor of less than sparse.

• Frequent items and association

** lowfreq = terms that occur at least 1000 times

• Association with word with correlation limit

• // association of “data” with other word• // two words always appear together => correlation would be 1.0

Correlation

• 50 of the more frequent words• With minimum correlation of 0.5• Word occurrences 100

• By default • 20 random terms • With minimum correlation of 0.7

Plotting word frequencies• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)• wf <- data.frame(word=names(freq), freq=freq)• //words that occurs at least 500 times in the corpus

Word cloud

Size of Word & Frequency • For word limitation

• wordcloud(names(freq), freq, max.words=100)• For term frequency limitation

• wordcloud(names(freq), freq, min.freq=100)• Adding Color

• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))

Quantitative Analysis of Text (qdap)• Extracting the column names (the terms) and retain those shorter

than 20 characters

• To generate frequencies and percentage

Contd...• Word Length Counts

** vertical line = Mean length of words

Letter and Position Heatmap

hands on: text mining with r

Data & Analytics