hands on: text mining with r
TRANSCRIPT
![Page 1: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/1.jpg)
HANDS ON: TEXT MINING WITH R
Jahnab Kumar Deka
![Page 2: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/2.jpg)
Introduction• To learn from collections of text documents like books,
newspapers, emails, etc.
Important Terms: • Tokenization • Tagging (Noun/Verb/…)• Chunking(Noun Phase)• Stemming(-ing/-s/-ed)
![Page 3: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/3.jpg)
Important packages in R• library(tm) # Framework for text mining.• library(SnowballC) # Provides wordStem() for stemming.
• library(qdap) # Quantitative discourse analysis of transcripts.
• library(qdapDictionaries)• library(dplyr) # Data preparation and pipes %>%.• library(RColorBrewer) # Generate palette of colours for plots.
• library(ggplot2) # Plot word frequencies.• library(scales) # Include commas in numbers.• library(Rgraphviz) # Correlation plots.
![Page 4: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/4.jpg)
Corpus• Collection of text
• Each corpus will have separate articles, stories, volumes, each treated as a separate entity or record.
• Any file format can be converted to text file for corpusEg:• PDF to Text File
• system("for f in *.pdf; do pdftotext -enc ASCII7 -nopgbrk $f; done")• Word Document to Text File
• system("for f in *.doc; do antiword $f; done")
![Page 5: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/5.jpg)
Corpus• Consider folder corpus/txt
• List some of file names
![Page 6: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/6.jpg)
Loading Corpus• Loading Corpus
** Using DirSource() the source object is passed on to Corpus() which loads the documents.• In case of PDF Documents
• docs <- Corpus(DirSource(cname), readerControl=list(reader=readPDF)) ** xpdf application needs to be installed for readPDF()
• In case of Word Documents• docs <- Corpus(DirSource(cname), readerControl=list(reader=readDOC("-
r -s"))) ** -r requests that removed text be included in the output ** -s requests that text hidden by Word be included
![Page 7: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/7.jpg)
Exploration of Corpus• inspect()
• Preparing the corpus• Transformation type
• tm map() is used to apply one of this transformation• Other transformations can be implemented using R functions and wrapped
within content_transformer()
![Page 8: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/8.jpg)
Transformation Example• replace “/”, “@” and “\\|” with a space
• Alternate method
• Conversion to toLower Case
• Remove Numbers
• Remove Punctuation
![Page 9: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/9.jpg)
Contd...• Remove English Stop Words
• Remove Own Stop Words
• Strip Whitespace
• Specific Transformations
![Page 10: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/10.jpg)
Contd...• Stemming
• Creating a Document Term Matrix A matrix with documents as the rows
terms as the columnscount of the frequency of words as the cells of the matrix.
• Term frequency
![Page 11: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/11.jpg)
Contd...• Frequency order of item
• ord <- order(freq)• Least Frequent item
• freq[head(ord)]• Most frequent item
• freq[tail(ord)]
• Document Term matrix to CSV• dtm <- DocumentTermMatrix(docs)• m <- as.matrix(dtm)• write.csv(m, file="dtm.csv")
![Page 12: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/12.jpg)
Contd...• Removing Sparse Terms
• dtms <- removeSparseTerms(dtm, 0.1) //Sparse factor• the resulting matrix contains only terms with a sparse factor of less than sparse.
• Frequent items and association
** lowfreq = terms that occur at least 1000 times
• Association with word with correlation limit
• // association of “data” with other word• // two words always appear together => correlation would be 1.0
![Page 13: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/13.jpg)
Correlation
• 50 of the more frequent words• With minimum correlation of 0.5• Word occurrences 100
• By default • 20 random terms • With minimum correlation of 0.7
![Page 14: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/14.jpg)
Plotting word frequencies• freq <- sort(colSums(as.matrix(dtm)), decreasing=TRUE)• wf <- data.frame(word=names(freq), freq=freq)• //words that occurs at least 500 times in the corpus
![Page 15: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/15.jpg)
Word cloud
![Page 16: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/16.jpg)
Size of Word & Frequency • For word limitation
• wordcloud(names(freq), freq, max.words=100)• For term frequency limitation
• wordcloud(names(freq), freq, min.freq=100)• Adding Color
• wordcloud(names(freq), freq, min.freq=100, colors=brewer.pal(6, "Dark2"))
![Page 17: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/17.jpg)
Quantitative Analysis of Text (qdap)• Extracting the column names (the terms) and retain those shorter
than 20 characters
• To generate frequencies and percentage
![Page 18: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/18.jpg)
Contd...• Word Length Counts
** vertical line = Mean length of words
![Page 19: hands on: Text Mining With R](https://reader034.vdocument.in/reader034/viewer/2022042618/589eb0cf1a28ab38288b6d55/html5/thumbnails/19.jpg)
Letter and Position Heatmap