text mining with r for social science research
TRANSCRIPT
![Page 1: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/1.jpg)
Text Mining with R for Social Science Research
Ryan Wesslen
![Page 2: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/2.jpg)
OutlineHour 1: Fundamentals of Text Mining with R•Why text, examples and AlchemyAPI demo (10 min)• “Bag of Words” Overview (10 min)• Text Preprocessing & Visualizations using R Studio (30 min)• 10 minute break
Hour 2: Applications• Federalist Papers (History/Political Science) – 30 min• Naïve Bayes Classifier using Word Occurrences• K-Nearest Neighbor Classifier using Topic Modeling
• Federal Reserve Beige Book (Economics) – 30 min• Lexicon-based Sentiment Analysis
![Page 3: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/3.jpg)
Objective• Learn the basics of text preprocessing using R.
• Learn the “bag of words” approach to text analytics (tokenize, cleaning, word cloud, associations, visualizations)
• Run three text mining applications for social sciences: Text Classification (Naïve Bayes & K-Nearest Neighbors), Topic Modeling (LDA) and Lexicon-based Sentiment Analysis• The Federalist papers (Poli/Hist) and the Fed Reserve Beige Book
(Econ)
• Learn resources (papers, textbooks, lecture notes, blogs) to encourage further research in text mining and natural language processing (NLP).
![Page 4: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/4.jpg)
Why analyze text?•Growing• Interesting•Untapped
![Page 6: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/6.jpg)
Language Technology
Coreference resolution
Question answering (QA)
Part-of-speech (POS) tagging
Word sense disambiguation (WSD)Paraphrase
Named entity recognition (NER)
ParsingSummarization
Information extraction (IE)
Machine translation (MT)Dialog
Sentiment analysis
mostly solved
making good progress
still really hard
Spam detection (Classification)Let’s go to Agra!
Buy V1AGRA …
✓✗
Colorless green ideas sleep furiously. ADJ ADJ NOUN VERB ADV
Einstein met with UN officials in PrincetonPERSON ORG LOC
You’re invited to our dinner party, Friday May 27 at 8:30
PartyMay 27add
Best roast chicken in San Francisco!
The waiter ignored us for 20 minutes.
Carter told Mubarak he shouldn’t run again.
I need new batteries for my mouse.
The 13th Shanghai International Film Festival…
第 13届上海国际电影节开幕…
The Dow Jones is up
Housing prices rose
Economy is good
Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?
I can see Alcatraz from the window!
XYZ acquired ABC yesterdayABC has been taken over by XYZ
Where is Citizen Kane playing in SF?
Castro Theatre at 7:30. Do you want a ticket?
The S&P500 jumped
Source: Dan Jurafsky
![Page 7: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/7.jpg)
Why else is text mining difficult?non-standard English
Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥
segmentation issues idiomsdark horse
get cold feetlose face
throw in the towel
neologisms
unfriendRetweet
bromance
tricky entity names
Where is A Bug’s Life playing …Let It Be was recorded …… a mutation on the for gene …
the New York-New Haven Railroadthe New York-New Haven Railroad
Source: Dan Jurafsky (modified)
sarcasm
A: I love Justin Bieber. Do you like him to?B: Yeah. Sure. I absolutely love him.
![Page 8: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/8.jpg)
AlchemyAPI Example•Go to http://www.alchemyapi.com/• Click on the homepage• As an introduction, copy/paste Federalist Paper #10: https://www.congress.gov/resources/display/content/The+Federalist+Papers#TheFederalistPapers-10• Click and explore!
![Page 9: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/9.jpg)
Federalist Papers & Text Classification
1:10pm
![Page 10: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/10.jpg)
Alexander Hamilton
![Page 12: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/12.jpg)
Federalist Paper setup•Not so true story, bro (about how many papers each wrote)• Reality: the authorship of twelve papers is disputed• Hamilton claimed authorship before he was killed; Madison
disputed those claims eight years later. • Adair (1944), Moesteller & Wallace (1963), Fung (2003),
Collins et al (2004)
• Three tasks:• Pre-process and exploratory data (word cloud, associations,
etc.)• Naïve Bayes Classification to predict author of the 12 disputed
papers based on word counts. • Topic modeling to identify key themes and k-nearest neighbors
to predict author based on papers’ topics.
![Page 13: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/13.jpg)
Basic Text Terminology
Corpus
Document
Term
![Page 14: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/14.jpg)
“Bag of Words” Approach
• Simplest way to quantify text• Counts the term count per
document• Document-Term Matrix
• Ignores word order
• N-grams (uni-,bi-,tri-, etc)• Good at classification
• Like Spam Filter• Bad at semantic meaning
Source: Chris Manning
![Page 15: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/15.jpg)
Preprocessing
• Tokenization• Cleaning: Lower case, white space, punctuation• Stemming, Lemmatization and/or Collocations• Filter: remove stop words
Tokenize Clean Stem Filter
Then a hurricane came, and devastation reigned
then a hurricane came and devastation reigned
then a hurricane came and devastation reigned
then a hurricane came and devastation reigned
![Page 16: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/16.jpg)
Part 1: R Studio, Working Directory & R Packages•Open R Studio and FederalistPapers.R (see GitHub site)
1:20pm Code Lines: 1 - 49
![Page 17: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/17.jpg)
Part 2: Load csv (text) file & view
Code Lines: 50-79
![Page 18: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/18.jpg)
Part 3a: Pre-processing
Federalist Paper 1: Before
Federalist Paper 1: After
Code Lines: 71-88
![Page 19: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/19.jpg)
Part 3b: Additional Pre-processing
Federalist Paper 1: After
Code Lines: 89-104
![Page 20: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/20.jpg)
Part 4a: Document-Term Matrices
Code Lines: 142-149
![Page 21: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/21.jpg)
Part 4b: Word CloudCode Lines: 151-165
1:30pm
![Page 22: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/22.jpg)
Part 4c: Term FrequenciesCode Lines: 167-171
![Page 23: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/23.jpg)
Part 4d: Word AssociationsCode Lines: 173-188
![Page 24: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/24.jpg)
Part 4e: Word Clustering: Hierarchal
Code Lines: 189-201
![Page 25: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/25.jpg)
Part 4f: K-Means Word Clustering
Code Lines: 202-207
![Page 26: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/26.jpg)
Part 5: Redo with Stemming, Bigrams and additional stop words
Uncomment (CTRL + SHIFT + C) and run lines 107-139
Code Lines: 107-139then rerun lines 141-206
![Page 27: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/27.jpg)
Classification Models (Overview)• Classification models predict class labels
•Class labels = categories• For example, binary (yes or no), ordinal (high, medium, low) or
nominal (dog, cat, kangaroo)
• Classification models are a type of supervised learning as the class labels (“y variables”) are known (observed).
•Determining the disputed Federalist papers is a binary classification problem as the author of the distputed papers is one of two authors: Hamilton or Madison.
1:50pm - 2pm
![Page 28: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/28.jpg)
Types of Classification Models• There are many different models (algorithms) that can be used for classification problems.
• Examples: Logistic Regression, Decision Tree, Support Vector Machine, Neural Networks
•We are going to use Naïve Bayes and k-nearest neighbors.
•We will use different feature variables (X variables)• Naïve Bayes = Word Presence (1/0) as X Variables• k-Nearest Neighbors = Topic Probabilities as X Variables
![Page 29: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/29.jpg)
Naïve Bayes• Naïve Bayes is an algorithm based on Bayes Theorem (conditional probability)
• Updates the probability of the predicted class (e.g. who is the author) based on words found in the class (author’s papers).• Example in Spam Filters: The word “Viagra” increases odds an
email is spam
• Pro: Simple, can handle many features (x variables) • Con: Difficult to interpret, subject to assumptions (e.g. independence of x variables)
• See these slides for a deeper overview of Naïve Bayes and its assumptions.
![Page 30: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/30.jpg)
Basics of Predictive Modeling• In predictive modeling, datasets are divided into training and test (sometimes called validation)
• Federalist Papers:• Training Dataset = 65 papers* with known author (known label)• Test Dataset = 12 papers with disputed author (missing label)*Excludes papers written by John Jay (five) and written by both Madison & Hamilton (three)
• Our objective is to build a model that successfully predicts the training dataset authors (accuracy).
• After building the model, apply it to the test dataset to predict the authors for the 12 disputed papers.
![Page 31: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/31.jpg)
Part 6: Naïve Bayes Pre-Processing
Code Lines: 208-219
![Page 32: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/32.jpg)
Conditional Probabilities
Update
Code Lines: 231-241
![Page 33: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/33.jpg)
Odds RatiosCode Lines: 242-248
![Page 34: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/34.jpg)
Train Naïve Bayes and Predict Training Dataset
Code Lines: 250-273
![Page 35: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/35.jpg)
Predict Test (Disputed) Dataset
Code Lines: 275-290
![Page 36: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/36.jpg)
Part 7: Running Topic Modeling
This will take about 4 mins, depending on the computer you run it on
Code Lines: 295-308
![Page 37: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/37.jpg)
Topic Modeling Overview
Source: David Blei (link to article)
![Page 38: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/38.jpg)
Create LDAVis Tool & Label Topics
Code Lines: 295-308
![Page 39: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/39.jpg)
LDAVis Package to Visualize Topics
Index.html file in the “Federalist” folder in your working directory. Open with FireFox; it is not supported by Chrome or IE.
![Page 40: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/40.jpg)
Topic Clustering Heatmap via R Shiny
Code Lines: 321-349
![Page 41: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/41.jpg)
Topic Clustering (Plot)
![Page 42: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/42.jpg)
Nearest NeighborsCode Lines: 350-370
![Page 43: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/43.jpg)
Predictions
• Naïve Bayes predicts 9 of the 12 papers as written by Madison.
• K-NN predicts only 4 of the 12 papers as written by Madison
• Why? How stable are these results??
Code Lines: 371-373
![Page 44: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/44.jpg)
Beige Book & Sentiment Analysis
2:30pm
![Page 45: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/45.jpg)
Sentiment Analysis• Two main types of textual information. • Facts and Opinions
• Search engines are optimized for facts.
• Sentiment Analysis is a growing attempt (not completely solved) to optimize the discovery of opinions.
•Opinions Mining or Sentiment Analysis is an attempt to recognize the opinion or sentiment that a person holds toward an object.
Source: Richard Heimann
![Page 46: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/46.jpg)
Where do we find sentiment?•Movie / Books: Are the reviews on this movie/book positive/negative?
• Product Sales: What is thought of the new iPhone?
• Public Sentiment: How do consumers feel about the economy? How is consumer sentiment effecting sales by sector?
• Politics: How are voters polarized, if at all around a candidate or policy?
• Prediction: Stock Prices, Election Outcomes, Market Trends, Product Sales
Source: Richard Heimann
![Page 47: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/47.jpg)
Three Types of Sentiment Analysis•Dictionary Based Sentiment Analysis • i.e. Is an attitude toward an object positive or negative? • Build dictionary of positive / negative words and count
net occurrence• Supervised Learning for Sentiment Analysis. • i.e. Given data we have seen in the past, can we predict
class assignment for our polarity measure (positive/neutral/negative) • e.g. Naive Bayes, MaxEnt, SVM
•Unsupervised Sentiment Analysis • i.e. No dictionaries. No labeled data. No training
algorithms. And, scale words (often bi-grams) and users on a single dimension. • e.g. latent variable models – Item Response Theory (IRT)
Source: Richard Heimann
![Page 48: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/48.jpg)
Federal Reserve Beige Book• The Beige Book is a report published by the United States Federal Research Board (FRB) eight times a year. • The Beige Book has been in publication since 1985 and is now published online. • The report is published by each (n=12) of the Federal Reserve Bank districts.• The content is rather anecdotal. The report interviews key business contacts, economists, market experts, and others to get their opinion about the economy. • The data used in this book can be found on GitHub, as well as the Python code for all the scraping and parsing.
Source: Richard Heimann
![Page 49: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/49.jpg)
Beige Book Case Study: Initial Steps• Step 1: Download SentimentBeigeBook.zip from https://github.com/wesslen/BeigeBookSentimentAnalysis
• Step 2: Save into a local directory. Open R Studio
• Step 3: Open “sentiment_analysis.R” and “sentiment.R”
![Page 50: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/50.jpg)
Step 1: Working Directory• Run “sentiment.R”. This counts the net number of positive minus negative words in the document given the sentiment (lexicon) dictionary. It will be used later on.
• Set working directory based on where you downloaded the zip file contents.
•Note: For Windows, use “C:\Directory\Folder\” formatting
![Page 51: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/51.jpg)
Step 2: Import in Dictionaries
![Page 52: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/52.jpg)
Step 3: Import Corpus / Text
![Page 53: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/53.jpg)
Step 4: Pre-Processing
![Page 54: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/54.jpg)
Step 5: Create corpus & tokenize
![Page 55: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/55.jpg)
Step 6: Stemming & Stop Words
![Page 56: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/56.jpg)
Step 7: Term-Document Matrix
![Page 57: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/57.jpg)
Step 8: Explore Common Words
![Page 58: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/58.jpg)
Step 9: Add more pos/neg words
![Page 59: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/59.jpg)
Step 10: Word Associations
![Page 60: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/60.jpg)
Step 11: Word Cloud
![Page 61: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/61.jpg)
Step 12: Sentiment Scoring
First six records of BB.sentiment
![Page 62: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/62.jpg)
Step 13: Normalizing Scores
First six records of BB.sentiment (updated)
![Page 63: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/63.jpg)
Step 14: Score Histograms
Raw Scored Sentiment
Scaled Scored Sentiment
![Page 64: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/64.jpg)
Step 15: Plot Historical Sentiment
![Page 65: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/65.jpg)
Step 16: Run beigebookplots.R
![Page 66: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/66.jpg)
Concluding Thoughts• Bag of words approach is a simple text mining framework
• Works well for exploratory analysis, classification and basic sentiment analysis.• Deeper models are needed to identify semantic meaning (e.g. GloVe, recurrent
neural networks, see Stanford Deep Learning NLP class materials)
• R is a great tool for simple, visual-based text mining• However, R has limitations (scale, functions, etc.)• Python (nltk) and Java are better for large-scale, PhD-dissertation research
• Text mining is an iterative process• There is not a single model or method that always works – depends on context!• If your initial results are vague, enhance pre-processing
• e.g. remove more stop words, try bi/trigrams, try stemming or lemmas, customize lexicon
• Text mining is an art, not a science.• Need domain experience AND algorithms• If you’re a social scientist, make friends with a computer scientist (and vice
versa).
![Page 67: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/67.jpg)
Project Mosaic & Next Workshop• Project Mosaic offers consulting, workshops and other collaborative research opportunities.• Upcoming Workshops:
https://projectmosaic.uncc.edu/events-list/
•Next month workshop on Text Mining for Twitter• Will include reference to SOPHI, UNCC Data Science
Initiative’s data warehouse that includes GNIP access to historical Twitter data.• If you are planning on attending, please register for
credentials for a Twitter API before the workshop.• Follow these instructions (to set up with R connector): http
://www.r-bloggers.com/setting-up-the-twitter-r-package-for-text-analytics/
![Page 68: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/68.jpg)
Proprietary Text Mining Tools• AlchemyAPI • limited free use
• Taste Analytics Signals • Two week free trial
• SAS Enterprise Miner • student version available via UNCC
• SAS Sentiment Analysis • available on some UNCC cpu’s
Hamilton Soundtrack Amazon Reviews
![Page 69: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/69.jpg)
Open Source Text Mining Tools• R tm package• Great for simple analysis but difficult for more complex
analysis.
• Python nltk package• Probably one of the best open source text mining packages
• Python gensim package • Another fantastic Python package – focuses on Topic modeling
•Mallet • Great NLP toolkit but requires background in Java and
command line
![Page 70: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/70.jpg)
Online Text / NLP Courses• Introductory / Intermediate:• Dan Juravsky (Stanford),
Introductory Text Mining Class• Chris Manning / Dan Juravsky (Standford),
Coursera Natural Language Processing Class• ChengXiang Zhai (Univ Illinois Champaign Urbana),
Coursera Text Mining & Analytics Course
• Advanced (but way cool, cutting edge stuff):• Richard Socher (Stanford),
Deep Learning for Natural Language Processing
![Page 71: Text Mining with R for Social Science Research](https://reader037.vdocument.in/reader037/viewer/2022102720/58ec9a181a28ab385a8b45ad/html5/thumbnails/71.jpg)
Blogs• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words
• http://www.alchemyapi.com/developers/getting-started-guide/twitter-sentiment-analysis
• https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/
• http://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/
•Want more? Follow this link for all R “text” blogs on Rbloggers website