introduction to text analysis in r - environmental...

11
Introduction to Text Analysis in R - Environmental Activist Websites Robert Ackland & Timothy Graham 23 June 2017 First, we need to install some additional packages. if (!"SnowballC" %in% installed.packages()) install.packages("SnowballC") if (!"tm" %in% installed.packages()) install.packages("tm") library(tm) #this will also load SnowballC if (!"lattice" %in% installed.packages()) install.packages("lattice") library(lattice) if (!"wordcloud" %in% installed.packages()) install.packages("wordcloud") library(wordcloud) This exercise uses a dataset from [Ackland, R. and M. O’Neil (2011), “Online collective identity: The case of the environmental movement,” Social Networks, 33, 177-190]. The file “nano2seeds_v2.csv” contains website meta keywords for 161 environmental social movement organisations, collected in 2006. Note: not all the websites have meta keywords. The websites are also coded according to SMO ‘type’: Globals (issues of concern include climate change, forest and wildlife preservation, nuclear weapons, and sustainable trade), Toxics (issues include pollutants and environmental justice) and Bios (issues include genetic engineering, organic farming and patenting). df <- read.csv("http://vosonlab.net/papers/Taiwan_2017/nano2seeds_v2.csv",stringsAsFactors=FALSE) #df <- read.csv("nano2seeds_v2.csv",stringsAsFactors=FALSE) Part 1 Remove rows (websites) that do not have meta keywords. toRemove <- which(df$Meta.keywords=="") if (isTRUE(length(toRemove)!=0)) { df <- df[-toRemove,] } nrow(df) #81 websites have meta keywords We will work with the character vector of meta keywords. keywords <- df$Meta.keywords #just for convenience We convert the character encoding to UTF-8. This avoids errors relating to ‘odd’ characters in the text. This is usually a good idea, but there may be situations when it is not useful, or even detrimental. Note: 1

Upload: others

Post on 02-Jun-2020

16 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to Text Analysis in R - EnvironmentalActivist Websites

Robert Ackland & Timothy Graham23 June 2017

First, we need to install some additional packages.

if (!"SnowballC" %in% installed.packages()) install.packages("SnowballC")

if (!"tm" %in% installed.packages()) install.packages("tm")library(tm) #this will also load SnowballC

if (!"lattice" %in% installed.packages()) install.packages("lattice")library(lattice)

if (!"wordcloud" %in% installed.packages()) install.packages("wordcloud")library(wordcloud)

This exercise uses a dataset from [Ackland, R. and M. O’Neil (2011), “Online collective identity: The case ofthe environmental movement,” Social Networks, 33, 177-190]. The file “nano2seeds_v2.csv” contains websitemeta keywords for 161 environmental social movement organisations, collected in 2006. Note: not all thewebsites have meta keywords. The websites are also coded according to SMO ‘type’: Globals (issues ofconcern include climate change, forest and wildlife preservation, nuclear weapons, and sustainable trade),Toxics (issues include pollutants and environmental justice) and Bios (issues include genetic engineering,organic farming and patenting).

df <- read.csv("http://vosonlab.net/papers/Taiwan_2017/nano2seeds_v2.csv",stringsAsFactors=FALSE)#df <- read.csv("nano2seeds_v2.csv",stringsAsFactors=FALSE)

Part 1

Remove rows (websites) that do not have meta keywords.

toRemove <- which(df$Meta.keywords=="")

if (isTRUE(length(toRemove)!=0)) {df <- df[-toRemove,]

}

nrow(df) #81 websites have meta keywords

We will work with the character vector of meta keywords.

keywords <- df$Meta.keywords #just for convenience

We convert the character encoding to UTF-8. This avoids errors relating to ‘odd’ characters in the text.This is usually a good idea, but there may be situations when it is not useful, or even detrimental. Note:

1

Mac users may encounter errors/bugs relating to character encoding, and a workaround is to convert to‘utf-8-mac’:

keywords <- iconv(keywords, to = ’utf-8’)# **MAC USERS ONLY** should use this instead:keywords <- iconv(keywords,to="utf-8-mac")

Now we use the ‘tm’ package to convert the character vector to a Vcorpus object (volatile corpus).

myCorpus <- VCorpus(VectorSource(keywords))

Meta keyword text for individual websites can be accessed via the double brackets notation or the ‘dollarsign’ notation for accessing list elements. Let’s have a look at the meta keywords for a particular website.

df$Vertex[3] #http://www.gmwatch.org/ - "GMWatch provides the public with the latest news and comment on genetically modified (GMO) foods and cropsInstitute of Science in Society""myCorpus[[3]][[1]]# another way to access itmyCorpus[[3]]$content

We can perform a number of highly useful transformations of text using tm_map function (i.e. ‘mapping tothe corpus’). Not all of these transformations are useful in every scenario! They should be used only whenit makes sense, or as required, etc.

Note that the text in the provided dataset has already been processed/transformed and so most of thefollowing transfomations do not have an affect, but they are useful to know for use with other datasets.

Converting all the text to lowercase:

myCorpus <- tm_map(myCorpus, content_transformer(tolower))

Remove numbers from the text:

myCorpus <- tm_map(myCorpus, removeNumbers)

Remove punctuation from the text:

myCorpus <- tm_map(myCorpus, removePunctuation)

Word stemming is the process of reducing words to their root or base form (see, for example, https://en.wikipedia.org/wiki/Stemming). From the wikpedia page:

A stemming algorithm reduces the words "fishing", "fished", and "fisher" to the root word,"fish". On the other hand, "argue", "argued", "argues", "arguing", and "argus" reduce tothe stem "argu" (illustrating the case where the stem is not itself a word or root) but"argument" and "arguments" reduce to the stem "argument"."

Note that word stemming can be highly useful, but also highly detrimental! For this exercise we in fact willnot use it, and so will comment out the relevant syntax.

2

#myCorpus <- tm_map(myCorpus, stemDocument,lazy=TRUE) # use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores

We can also remove English ‘stop words’ from the text. These are common words (e.g. ‘the’, ‘and’, ‘or’) thatwe may want to exclude from our analysis. Once again, this is highly useful but also needs to be carefullyapplied.

myCorpus <- tm_map(myCorpus, removeWords, stopwords("english"),lazy=TRUE) # use lazy=TRUE argument to avoid warning on some machines with multiple CPU cores

Eliminate unnecessary ‘white space’ from the text. For example, “hello everyone my name is fred” becomes“hello everyone my name is fred”:

myCorpus <- tm_map(myCorpus, stripWhitespace, lazy=TRUE)

We can observe the difference now by examining website #3 again:

myCorpus[[3]]$content

We could also define our own stop words and transform the text using these. For example, we might thinkthat it is not interesting that an environmental social movement organisation has the word “environment”in its meta keywords.

myStopwords <- c("environment")myCorpus <- tm_map(myCorpus, removeWords, myStopwords)

Next we create a document-term matrix (DTM) from the fbCorpus object. DTMs are a very importantconcept for text analysis and are highly useful. DTMs can be thought about as a table (i.e. matrix) wherethe rows are ‘documents’ (i.e. website meta keyword fields in our dataset), and the columns are ‘terms’(i.e. each unique word found across all the documents in the dataset). The ‘cells’ (i.e. elements) of thematrix indicate how many times term n occurred in document m.

To better understand the concept of a DTM, let’s take quick digression and look at a simpler corpus thanour environmental social movement dataset.

#some test documentsmyText <- c("the quick brown furry fox jumped over a second furry brown fox",

"the sparse brown furry matrix","the quick matrix")

#create the corpusmyVCorpus <- VCorpus(VectorSource(myText))#create the DTMmyTdm <- DocumentTermMatrix(myVCorpus)#display the DTMas.matrix(myTdm)#produces:# Terms#Docs brown fox furry jumped matrix over quick second sparse the# 1 2 2 2 1 0 1 1 1 0 1# 2 1 0 1 0 1 0 0 0 1 1# 3 0 0 0 0 1 0 1 0 0 1

Getting back to the environmental SMO dataset, note: we use the control argument to specify that weonly want to retain words that are minimum character length of 3, up to a maximum of 20 characters.

3

dtm <- DocumentTermMatrix(myCorpus,control = list(wordLengths=c(3, 20)))dtminspect(dtm[1:5, 20:30])

With most real world text datasets we will have a “sparse matrix”, i.e. most of the elements of the matrixare 0, i.e. in our dataset most meta keyword fields contain only a small percentage of ‘vocabulary’ of termsobserved across the meta keywords collected from all the websites. What we want to do is remove termsthat occur very infrequently, which will leave us with the most ‘important’ terms. We remove sparse termsusing the removeSparseTerms function, which removes terms that occur equal to or less than a percentagethreshold.

To better understand the process of removing sparse terms, let’s look at the test dtm again.

#the second argument to the removeSparseTerms() function is the threshold for which terms are to be retained#in the following, only terms that appear in 99% or more documents are retainedas.matrix(removeSparseTerms(myTdm, .01))#in the following, only terms that appear in 50% or more documents are retainedas.matrix(removeSparseTerms(myTdm, .5))

There are 850 terms in our dtm. The following indicates that if we set the threshold for removing sparseterms to 0.95 (so a term has to appear in over 5% of documents), then we’d be left with a dtm containing52 terms.

removeSparseTerms(dtm, 0.95)

You should use trial and error to establish how many terms to drop from the dtm (note: you may decide tonot drop any terms). For our exercise we will set a threshold of 0.98.

dtmSparseRemoved <- removeSparseTerms(dtm, 0.98)

We can examine term frequencies in our data. We create a character vector of the sums of columns of ourdocument-term matrix (implicitly coercing it to a matrix object), meaning that have a named charactervector where the names are the unique terms in our document-term matrix, and the values of the elementsare the number of times that particular word occurs across all of our corpus.

freqTerms <- colSums(as.matrix(dtmSparseRemoved))freqTerms

We order the term frequencies and look at the 5 most frequent terms and then 5 least frequent terms:

orderTerms <- order(freqTerms,decreasing=TRUE)freqTerms[head(orderTerms)]freqTerms[tail(orderTerms)]

Which terms occurred at least 20 times?

findFreqTerms(dtmSparseRemoved, 20)

We can do a basic correlation analysis by looking at the correlations between terms with the findAssocsfunction. If two words always appear together in the same document then corr = 1. If two terms neverappear together then corr = 0. Let’s look at which terms co-occur with the term “good”, with a lowercorrelation limit of 0.5.

4

findAssocs(dtmSparseRemoved, "genetic", corlimit=0.5)

Next, we can do some text visualisation. First, we can plot our descriptive statistics in various ways. Forexample, using a barchart to visualise the 20 most frequent terms (we will use the lattice package for anice bar chart:

png("figures/barchart_frequent_terms.png", width=800, height=700)barchart(freqTerms[orderTerms[1:20]])dev.off()

This results in the following chart:

Part 2

Now we will be creating word clouds, which are a graphical display of relative frequencies of words/termswithin a corpus. There has been a lot of criticism of the use of word clouds - some people argue that bar

5

charts like the one we constructed in Part 1 are a more accurate way of visually displaying text frequencydata. However my opinion is that as long one knows how to interpet a word cloud and they are used incontext of descriptive analysis (i.e. not formal testing of hypotheses) then they can be a useful way of quicklyunderstanding topics/issues that are being discussed or engaged with by online actors (and communicatingthese findings to an audience).

Here are some blog pages discussing the merits of word clouds:

• https://onlinejournalismblog.com/2012/01/27/word-cloud-or-bar-chart/

• http://www.niemanlab.org/2011/10/word-clouds-considered-harmful/

• http://dataskeptic.com/epnotes/kill-the-word-cloud.php

• http://www.thrumpledumthrum.com/?p=154

• https://www.r-bloggers.com/building-a-better-word-cloud/

Word Cloud

A word cloud is another way of visually representing frequencies of words in a corpus. In the example below,we will first construct a word cloud for all of the websites in the dataset. Then we will construct a wordcloud for only the “bio” websites.

For all websites

First, create character vectors of the meta keywords of each of the different types of websites(Global/Toxic/Bio), by taking subset of elements from the relevant column of the dataframe. Weare also excluding those websites that had no meta keywords.

globalMeta <- df$Meta.keywords[which(df$Type=="Globals"&df$Meta.keywords!="")]globalMeta <- paste(globalMeta, collapse = " ")

bioMeta <- df$Meta.keywords[which(df$Type=="Bios"&df$Meta.keywords!="")]bioMeta <- paste(bioMeta, collapse = " ")

toxicMeta <- df$Meta.keywords[which(df$Type=="Toxics"&df$Meta.keywords!="")]toxicMeta <- paste(toxicMeta, collapse = " ")

Now, combine them together into a new dataframe:

df_ALL <- data.frame(group=c("Global","Bio","Toxic"),words=c(globalMeta,bioMeta,toxicMeta))View(df_ALL)

Now, create a text corpus using similar approach to Part 1.

# we create a character vector from the "words" column of df_ALLwords <- df_ALL$words

# we will convert the character encoding to UTF-8# just to be sure there are no odd characters that# may cause problems later onwords <- iconv(words, to = ’UTF-8’)

6

# ** MAC USERS ONLY **:#words <- iconv(words, to = ’UTF-8-mac’)

# using ’tm’ package we convert character vector to a Vcorpus object (volatile corpus)corp <- VCorpus(VectorSource(words))

## now we do transformations of text using tm_map (’mapping to the corpus’)

# eliminate extra whitespacecorp <- tm_map(corp, stripWhitespace)

# convert to all lowercasecorp <- tm_map(corp, content_transformer(tolower))

# perform stemming (not always useful!)#corp <- tm_map(corp, stemDocument)

# remove numbers (not always useful!)corp <- tm_map(corp, removeNumbers)

# remove punctuation (not always useful! e.g. text emoticons)corp <- tm_map(corp, removePunctuation)

# remove stop words (not always useful!)corp <- tm_map(corp, removeWords, stopwords("english"))

Now we can create the word cloud.

#note: if changing res of png, can’t have dimensions in pixels (led to wordclouds with very few words...)png("figures/word_cloud_enviro_all.png", width=12, height=8, units="in", res=300)wordcloud(corp,max.words=200,random.order=FALSE)dev.off()

This results in the following:

7

For “bio” websites

Constructing a word cloud just for “bio” websites involves very similar process to as above, but we startwith a character vector of just the meta keywords used by the bio sites.

bioMeta <- iconv(bioMeta, to = ’UTF-8’)# ** MAC USERS ONLY **:#bioMeta <- iconv(bioMeta, to = ’UTF-8-mac’)

bioCorp <- VCorpus(VectorSource(bioMeta))

bioCorp <- tm_map(bioCorp, stripWhitespace)

bioCorp <- tm_map(bioCorp, content_transformer(tolower))

8

#bioCorp <- tm_map(bioCorp, stemDocument)

bioCorp <- tm_map(bioCorp, removeNumbers)

bioCorp <- tm_map(bioCorp, removePunctuation)

bioCorp <- tm_map(bioCorp, removeWords, stopwords("english"))

We are now ready to create the word cloud.

#let’s use differnt colour for textcolorsx=c("red")

png("figures/word_cloud_enviro_bio.png", width=12, height=8, units="in", res=300)wordcloud(bioCorp,max.words=200,random.order=FALSE,colors=colorsx)dev.off()

This results in the following:

9

So what if you wanted to create a separate word cloud for each of the groups: bios, globals and toxics? Thisis where it would make sense to create a function.

Comparison Cloud

A comparison cloud is used to show the words that are being used by particular types of actors.

In Part 1, we created a document-term matrix (DTM) but here we will create the inverse of this matrix, theterm-document matrix (TDM).

tdm <- TermDocumentMatrix(corp)tdminspect(tdm[1:10,])

tdm2 <- as.matrix(tdm) #convert to matrix

colnames(tdm2) <- c("Global","Bio","Toxic")colorsx=c("blue","red","green")

png("figures/comparison_cloud_enviro.png", width=12, height=8, units="in", res=300)comparison.cloud(tdm2,max.words=200,random.order=FALSE,colors=colorsx)#commonality.cloud(tdm2,random.order=FALSE)dev.off()

This results in the following:

10

11