text mining with r for social science research

71
Text Mining with R for Social Science Research Ryan Wesslen

Upload: ryan-wesslen

Post on 11-Apr-2017

386 views

Category:

Education


11 download

TRANSCRIPT

Page 1: Text Mining with R for Social Science Research

Text Mining with R for Social Science Research

Ryan Wesslen

Page 2: Text Mining with R for Social Science Research

OutlineHour 1: Fundamentals of Text Mining with R•Why text, examples and AlchemyAPI demo (10 min)• “Bag of Words” Overview (10 min)• Text Preprocessing & Visualizations using R Studio (30 min)• 10 minute break

Hour 2: Applications• Federalist Papers (History/Political Science) – 30 min• Naïve Bayes Classifier using Word Occurrences• K-Nearest Neighbor Classifier using Topic Modeling

• Federal Reserve Beige Book (Economics) – 30 min• Lexicon-based Sentiment Analysis

Page 3: Text Mining with R for Social Science Research

Objective• Learn the basics of text preprocessing using R.

• Learn the “bag of words” approach to text analytics (tokenize, cleaning, word cloud, associations, visualizations)

• Run three text mining applications for social sciences: Text Classification (Naïve Bayes & K-Nearest Neighbors), Topic Modeling (LDA) and Lexicon-based Sentiment Analysis• The Federalist papers (Poli/Hist) and the Fed Reserve Beige Book

(Econ)

• Learn resources (papers, textbooks, lecture notes, blogs) to encourage further research in text mining and natural language processing (NLP).

Page 4: Text Mining with R for Social Science Research

Why analyze text?•Growing• Interesting•Untapped

Page 5: Text Mining with R for Social Science Research

Big Data: Internetlivestats.com

Page 6: Text Mining with R for Social Science Research

Language Technology

Coreference resolution

Question answering (QA)

Part-of-speech (POS) tagging

Word sense disambiguation (WSD)Paraphrase

Named entity recognition (NER)

ParsingSummarization

Information extraction (IE)

Machine translation (MT)Dialog

Sentiment analysis

mostly solved

making good progress

still really hard

Spam detection (Classification)Let’s go to Agra!

Buy V1AGRA …

✓✗

Colorless green ideas sleep furiously. ADJ ADJ NOUN VERB ADV

Einstein met with UN officials in PrincetonPERSON ORG LOC

You’re invited to our dinner party, Friday May 27 at 8:30

PartyMay 27add

Best roast chicken in San Francisco!

The waiter ignored us for 20 minutes.

Carter told Mubarak he shouldn’t run again.

I need new batteries for my mouse.

The 13th Shanghai International Film Festival…

第 13届上海国际电影节开幕…

The Dow Jones is up

Housing prices rose

Economy is good

Q. How effective is ibuprofen in reducing fever in patients with acute febrile illness?

I can see Alcatraz from the window!

XYZ acquired ABC yesterdayABC has been taken over by XYZ

Where is Citizen Kane playing in SF?

Castro Theatre at 7:30. Do you want a ticket?

The S&P500 jumped

Source: Dan Jurafsky

Page 7: Text Mining with R for Social Science Research

Why else is text mining difficult?non-standard English

Great job @justinbieber! Were SOO PROUD of what youve accomplished! U taught us 2 #neversaynever & you yourself should never give up either♥

segmentation issues idiomsdark horse

get cold feetlose face

throw in the towel

neologisms

unfriendRetweet

bromance

tricky entity names

Where is A Bug’s Life playing …Let It Be was recorded …… a mutation on the for gene …

the New York-New Haven Railroadthe New York-New Haven Railroad

Source: Dan Jurafsky (modified)

sarcasm

A: I love Justin Bieber. Do you like him to?B: Yeah. Sure. I absolutely love him.

Page 8: Text Mining with R for Social Science Research

AlchemyAPI Example•Go to http://www.alchemyapi.com/• Click on the homepage• As an introduction, copy/paste Federalist Paper #10: https://www.congress.gov/resources/display/content/The+Federalist+Papers#TheFederalistPapers-10• Click and explore!

Page 9: Text Mining with R for Social Science Research

Federalist Papers & Text Classification

1:10pm

Page 10: Text Mining with R for Social Science Research

Alexander Hamilton

Page 11: Text Mining with R for Social Science Research

Genius.com’s “Non-Stop” Lyrics

Page 12: Text Mining with R for Social Science Research

Federalist Paper setup•Not so true story, bro (about how many papers each wrote)• Reality: the authorship of twelve papers is disputed• Hamilton claimed authorship before he was killed; Madison

disputed those claims eight years later. • Adair (1944), Moesteller & Wallace (1963), Fung (2003),

Collins et al (2004)

• Three tasks:• Pre-process and exploratory data (word cloud, associations,

etc.)• Naïve Bayes Classification to predict author of the 12 disputed

papers based on word counts. • Topic modeling to identify key themes and k-nearest neighbors

to predict author based on papers’ topics.

Page 13: Text Mining with R for Social Science Research

Basic Text Terminology

Corpus

Document

Term

Page 14: Text Mining with R for Social Science Research

“Bag of Words” Approach

• Simplest way to quantify text• Counts the term count per

document• Document-Term Matrix

• Ignores word order

• N-grams (uni-,bi-,tri-, etc)• Good at classification

• Like Spam Filter• Bad at semantic meaning

Source: Chris Manning

Page 15: Text Mining with R for Social Science Research

Preprocessing

• Tokenization• Cleaning: Lower case, white space, punctuation• Stemming, Lemmatization and/or Collocations• Filter: remove stop words

Tokenize Clean Stem Filter

Then a hurricane came, and devastation reigned

then a hurricane came and devastation reigned

then a hurricane came and devastation reigned

then a hurricane came and devastation reigned

Page 16: Text Mining with R for Social Science Research

Part 1: R Studio, Working Directory & R Packages•Open R Studio and FederalistPapers.R (see GitHub site)

1:20pm Code Lines: 1 - 49

Page 17: Text Mining with R for Social Science Research

Part 2: Load csv (text) file & view

Code Lines: 50-79

Page 18: Text Mining with R for Social Science Research

Part 3a: Pre-processing

Federalist Paper 1: Before

Federalist Paper 1: After

Code Lines: 71-88

Page 19: Text Mining with R for Social Science Research

Part 3b: Additional Pre-processing

Federalist Paper 1: After

Code Lines: 89-104

Page 20: Text Mining with R for Social Science Research

Part 4a: Document-Term Matrices

Code Lines: 142-149

Page 21: Text Mining with R for Social Science Research

Part 4b: Word CloudCode Lines: 151-165

1:30pm

Page 22: Text Mining with R for Social Science Research

Part 4c: Term FrequenciesCode Lines: 167-171

Page 23: Text Mining with R for Social Science Research

Part 4d: Word AssociationsCode Lines: 173-188

Page 24: Text Mining with R for Social Science Research

Part 4e: Word Clustering: Hierarchal

Code Lines: 189-201

Page 25: Text Mining with R for Social Science Research

Part 4f: K-Means Word Clustering

Code Lines: 202-207

Page 26: Text Mining with R for Social Science Research

Part 5: Redo with Stemming, Bigrams and additional stop words

Uncomment (CTRL + SHIFT + C) and run lines 107-139

Code Lines: 107-139then rerun lines 141-206

Page 27: Text Mining with R for Social Science Research

Classification Models (Overview)• Classification models predict class labels

•Class labels = categories• For example, binary (yes or no), ordinal (high, medium, low) or

nominal (dog, cat, kangaroo)

• Classification models are a type of supervised learning as the class labels (“y variables”) are known (observed).

•Determining the disputed Federalist papers is a binary classification problem as the author of the distputed papers is one of two authors: Hamilton or Madison.

1:50pm - 2pm

Page 28: Text Mining with R for Social Science Research

Types of Classification Models• There are many different models (algorithms) that can be used for classification problems.

• Examples: Logistic Regression, Decision Tree, Support Vector Machine, Neural Networks

•We are going to use Naïve Bayes and k-nearest neighbors.

•We will use different feature variables (X variables)• Naïve Bayes = Word Presence (1/0) as X Variables• k-Nearest Neighbors = Topic Probabilities as X Variables

Page 29: Text Mining with R for Social Science Research

Naïve Bayes• Naïve Bayes is an algorithm based on Bayes Theorem (conditional probability)

• Updates the probability of the predicted class (e.g. who is the author) based on words found in the class (author’s papers).• Example in Spam Filters: The word “Viagra” increases odds an

email is spam

• Pro: Simple, can handle many features (x variables) • Con: Difficult to interpret, subject to assumptions (e.g. independence of x variables)

• See these slides for a deeper overview of Naïve Bayes and its assumptions.

Page 30: Text Mining with R for Social Science Research

Basics of Predictive Modeling• In predictive modeling, datasets are divided into training and test (sometimes called validation)

• Federalist Papers:• Training Dataset = 65 papers* with known author (known label)• Test Dataset = 12 papers with disputed author (missing label)*Excludes papers written by John Jay (five) and written by both Madison & Hamilton (three)

• Our objective is to build a model that successfully predicts the training dataset authors (accuracy).

• After building the model, apply it to the test dataset to predict the authors for the 12 disputed papers.

Page 31: Text Mining with R for Social Science Research

Part 6: Naïve Bayes Pre-Processing

Code Lines: 208-219

Page 32: Text Mining with R for Social Science Research

Conditional Probabilities

Update

Code Lines: 231-241

Page 33: Text Mining with R for Social Science Research

Odds RatiosCode Lines: 242-248

Page 34: Text Mining with R for Social Science Research

Train Naïve Bayes and Predict Training Dataset

Code Lines: 250-273

Page 35: Text Mining with R for Social Science Research

Predict Test (Disputed) Dataset

Code Lines: 275-290

Page 36: Text Mining with R for Social Science Research

Part 7: Running Topic Modeling

This will take about 4 mins, depending on the computer you run it on

Code Lines: 295-308

Page 37: Text Mining with R for Social Science Research

Topic Modeling Overview

Source: David Blei (link to article)

Page 38: Text Mining with R for Social Science Research

Create LDAVis Tool & Label Topics

Code Lines: 295-308

Page 39: Text Mining with R for Social Science Research

LDAVis Package to Visualize Topics

Index.html file in the “Federalist” folder in your working directory. Open with FireFox; it is not supported by Chrome or IE.

Page 40: Text Mining with R for Social Science Research

Topic Clustering Heatmap via R Shiny

Code Lines: 321-349

Page 41: Text Mining with R for Social Science Research

Topic Clustering (Plot)

Page 42: Text Mining with R for Social Science Research

Nearest NeighborsCode Lines: 350-370

Page 43: Text Mining with R for Social Science Research

Predictions

• Naïve Bayes predicts 9 of the 12 papers as written by Madison.

• K-NN predicts only 4 of the 12 papers as written by Madison

• Why? How stable are these results??

Code Lines: 371-373

Page 44: Text Mining with R for Social Science Research

Beige Book & Sentiment Analysis

2:30pm

Page 45: Text Mining with R for Social Science Research

Sentiment Analysis• Two main types of textual information. • Facts and Opinions

• Search engines are optimized for facts.

• Sentiment Analysis is a growing attempt (not completely solved) to optimize the discovery of opinions.

•Opinions Mining or Sentiment Analysis is an attempt to recognize the opinion or sentiment that a person holds toward an object.

Source: Richard Heimann

Page 46: Text Mining with R for Social Science Research

Where do we find sentiment?•Movie / Books: Are the reviews on this movie/book positive/negative?

• Product Sales: What is thought of the new iPhone?

• Public Sentiment: How do consumers feel about the economy? How is consumer sentiment effecting sales by sector?

• Politics: How are voters polarized, if at all around a candidate or policy?

• Prediction: Stock Prices, Election Outcomes, Market Trends, Product Sales

Source: Richard Heimann

Page 47: Text Mining with R for Social Science Research

Three Types of Sentiment Analysis•Dictionary Based Sentiment Analysis • i.e. Is an attitude toward an object positive or negative? • Build dictionary of positive / negative words and count

net occurrence• Supervised Learning for Sentiment Analysis. • i.e. Given data we have seen in the past, can we predict

class assignment for our polarity measure (positive/neutral/negative) • e.g. Naive Bayes, MaxEnt, SVM

•Unsupervised Sentiment Analysis • i.e. No dictionaries. No labeled data. No training

algorithms. And, scale words (often bi-grams) and users on a single dimension. • e.g. latent variable models – Item Response Theory (IRT)

Source: Richard Heimann

Page 48: Text Mining with R for Social Science Research

Federal Reserve Beige Book• The Beige Book is a report published by the United States Federal Research Board (FRB) eight times a year. • The Beige Book has been in publication since 1985 and is now published online. • The report is published by each (n=12) of the Federal Reserve Bank districts.• The content is rather anecdotal. The report interviews key business contacts, economists, market experts, and others to get their opinion about the economy. • The data used in this book can be found on GitHub, as well as the Python code for all the scraping and parsing.

Source: Richard Heimann

Page 49: Text Mining with R for Social Science Research

Beige Book Case Study: Initial Steps• Step 1: Download SentimentBeigeBook.zip from https://github.com/wesslen/BeigeBookSentimentAnalysis

• Step 2: Save into a local directory. Open R Studio

• Step 3: Open “sentiment_analysis.R” and “sentiment.R”

Page 50: Text Mining with R for Social Science Research

Step 1: Working Directory• Run “sentiment.R”. This counts the net number of positive minus negative words in the document given the sentiment (lexicon) dictionary. It will be used later on.

• Set working directory based on where you downloaded the zip file contents.

•Note: For Windows, use “C:\Directory\Folder\” formatting

Page 51: Text Mining with R for Social Science Research

Step 2: Import in Dictionaries

Page 52: Text Mining with R for Social Science Research

Step 3: Import Corpus / Text

Page 53: Text Mining with R for Social Science Research

Step 4: Pre-Processing

Page 54: Text Mining with R for Social Science Research

Step 5: Create corpus & tokenize

Page 55: Text Mining with R for Social Science Research

Step 6: Stemming & Stop Words

Page 56: Text Mining with R for Social Science Research

Step 7: Term-Document Matrix

Page 57: Text Mining with R for Social Science Research

Step 8: Explore Common Words

Page 58: Text Mining with R for Social Science Research

Step 9: Add more pos/neg words

Page 59: Text Mining with R for Social Science Research

Step 10: Word Associations

Page 60: Text Mining with R for Social Science Research

Step 11: Word Cloud

Page 61: Text Mining with R for Social Science Research

Step 12: Sentiment Scoring

First six records of BB.sentiment

Page 62: Text Mining with R for Social Science Research

Step 13: Normalizing Scores

First six records of BB.sentiment (updated)

Page 63: Text Mining with R for Social Science Research

Step 14: Score Histograms

Raw Scored Sentiment

Scaled Scored Sentiment

Page 64: Text Mining with R for Social Science Research

Step 15: Plot Historical Sentiment

Page 65: Text Mining with R for Social Science Research

Step 16: Run beigebookplots.R

Page 66: Text Mining with R for Social Science Research

Concluding Thoughts• Bag of words approach is a simple text mining framework

• Works well for exploratory analysis, classification and basic sentiment analysis.• Deeper models are needed to identify semantic meaning (e.g. GloVe, recurrent

neural networks, see Stanford Deep Learning NLP class materials)

• R is a great tool for simple, visual-based text mining• However, R has limitations (scale, functions, etc.)• Python (nltk) and Java are better for large-scale, PhD-dissertation research

• Text mining is an iterative process• There is not a single model or method that always works – depends on context!• If your initial results are vague, enhance pre-processing

• e.g. remove more stop words, try bi/trigrams, try stemming or lemmas, customize lexicon

• Text mining is an art, not a science.• Need domain experience AND algorithms• If you’re a social scientist, make friends with a computer scientist (and vice

versa).

Page 67: Text Mining with R for Social Science Research

Project Mosaic & Next Workshop• Project Mosaic offers consulting, workshops and other collaborative research opportunities.• Upcoming Workshops:

https://projectmosaic.uncc.edu/events-list/

•Next month workshop on Text Mining for Twitter• Will include reference to SOPHI, UNCC Data Science

Initiative’s data warehouse that includes GNIP access to historical Twitter data.• If you are planning on attending, please register for

credentials for a Twitter API before the workshop.• Follow these instructions (to set up with R connector): http

://www.r-bloggers.com/setting-up-the-twitter-r-package-for-text-analytics/

Page 68: Text Mining with R for Social Science Research

Proprietary Text Mining Tools• AlchemyAPI • limited free use

• Taste Analytics Signals • Two week free trial

• SAS Enterprise Miner • student version available via UNCC

• SAS Sentiment Analysis • available on some UNCC cpu’s

Hamilton Soundtrack Amazon Reviews

Page 69: Text Mining with R for Social Science Research

Open Source Text Mining Tools• R tm package• Great for simple analysis but difficult for more complex

analysis.

• Python nltk package• Probably one of the best open source text mining packages

• Python gensim package • Another fantastic Python package – focuses on Topic modeling

•Mallet • Great NLP toolkit but requires background in Java and

command line

Page 70: Text Mining with R for Social Science Research

Online Text / NLP Courses• Introductory / Intermediate:• Dan Juravsky (Stanford),

Introductory Text Mining Class• Chris Manning / Dan Juravsky (Standford),

Coursera Natural Language Processing Class• ChengXiang Zhai (Univ Illinois Champaign Urbana),

Coursera Text Mining & Analytics Course

• Advanced (but way cool, cutting edge stuff):• Richard Socher (Stanford),

Deep Learning for Natural Language Processing

Page 71: Text Mining with R for Social Science Research

Blogs• https://www.kaggle.com/c/word2vec-nlp-tutorial/details/part-1-for-beginners-bag-of-words

• http://www.alchemyapi.com/developers/getting-started-guide/twitter-sentiment-analysis

• https://eight2late.wordpress.com/2015/09/29/a-gentle-introduction-to-topic-modeling-using-r/

• http://www.r-bloggers.com/sentiment-analysis-on-donald-trump-using-r-and-tableau/

•Want more? Follow this link for all R “text” blogs on Rbloggers website