for the digital humanities r - brianjreilly.combrianjreilly.com/brian_j._reilly/rdh_files/r for the...

55
R for the Digital Humanities Brian J. Reilly 21 September 2014

Upload: trinhtuyen

Post on 05-Jun-2018

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Rfor the Digital Humanities

Brian J. Reilly21 September 2014

Page 2: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Overview

What is R?

Why use R?

Getting Started with R

A Taste of R Programming (Exploration)

A Taste of Humanities Statistical Analysis

Page 3: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

What is R?

R is a programming languageIt is the open-source dialect of S

R is an environment for statistical computingR is the rating of the new pirate movie

Page 4: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

So R you ready for R?

Page 5: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Why use R?

R is FREE !vs SPSS (IBM)vs STATA

R now has an expanded GUI (RStudio)R runs on Mac, Windows, UnixR has an AMAZING community

FAQs, fora, etc.Packages!

R is becoming THE program of choice

Page 6: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Why use R?

R can do anythingorder a pizza over the internettext parsing (Perl, Python, php, etc)database (Excel)data analysis (Excel, SPSS, STATA, etc)

Page 7: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Why not to use R?

R does require programmingAnd programming means typing

… and typos… and writer’s (coder’s?) block

SPSS does notSTATA is flexible$

Page 8: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

An oft quoted analogyWhen talking about user friendliness of computer software I like the analogy of cars vs. busses: [...] Using this analogy programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed. R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the passenger seat, and mountain climbing and spelunking gear in the back. R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.

Greg Snow, R-help (May 2006)

Page 9: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Getting Started with R & RStudio

Rhttp://www.r-project.org

RStudio (Integrated Development Environment)http://www.rstudio.com

Shiny (Web-based Display)http://shiny.rstudio.com

Page 10: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Getting Started with R & RStudio

For today, navigate to:http://home.comcast.net/~brian.j.reilly/Brian_J._Reilly/RDH.html

and download the script.

Open that script. RStudio should launch and we’re ready to go.

Alternatively, open RStudio and then go to the File menu and find and open the downloaded file.

Page 11: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Getting to know RStudio

console

script

Page 12: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Getting to know RStudio

Page 13: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

R is an interpreted language

That means you type your code and run it directly. No need for a compiler.

Usually, this means that the programs run more slowly. You won’t be using R to write the next big game, for example.

Page 14: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Let’s begin!

The traditional beginning. Type into your console what you see in light blue below:

>print("Hello world.")

Since R is an interpreted language, that was rather underwhelming.But notice: R finished your parens for you!

Page 15: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

R Scripts(Upper Left Pane in RStudio)Edit, Save, and Retrieve your ScriptsDid I mention save?Syntax Highlighting# indicates a comment (Use before each line!)control + shift + c to (un)commentYou can run the script by line or highlighted section. On a Mac:control + return (or enter)

Or click “Run” in the GUI.

Page 16: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

R is a Fancy Calculator

At the command-line prompt, type:>2+2>2*3>2^4

Order of operations! What is -32?>-3^2>(-3)^2

Page 17: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

R is a Fancy Logic Board

At the command-line prompt, type:>2==3>2!=3>TRUE & FALSE>TRUE | FALSE>2<3>2<=3>2>=3>isTRUE((3^4)==(9^2))

Page 18: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

R is a Not-So Fancy Calculator

At the command-line prompt, type:>5!Error: unexpected '!' in "5!"

The problem is that ! means “not”. We need to specify a function (more on this in a bit):>factorial(5)

Page 19: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

R is a Fancy Calculator

>factorial(5)>choose(8,3)

>"*"(10,c(2,3,4))

>is.infinite(10^(305:310))

Page 20: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Command-Line Notes

> indicates the command-line promptIf you press enter and do not see “>”, then you are still in your command:> (2+3+

Press ESC to start overR does not care about spaces:> (2+3)> ( 2 + 3)

Page 21: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Command-Line Notes

R is case sensitive!UP/DOWN arrow keys save you from retypingStart typing and press TAB for function namesI don’t know why you would want to, but the semi-colon ; allows you to enter more than one command on a single line:> (2+3)> (2+5)> (2+3);(2+5)

Page 22: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

HELP!!!

>help(mean)>?mean>help.search("piechart")>??piechart

R’s community is getting nicer (?)http://badhessian.org/2013/04/has-r-help-gotten-meaner-over-time-and-what-does-mancur-olson-have-to-say-about-it/

Google (surprisingly effective despite “R”)

Page 23: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

HELP!!! (e.g. >?pie)Description

Gives a descriptionUsage

Displays syntax and possible argumentsArguments

What inputs you can/need to enterNoteReferencesSee AlsoExamples

Page 24: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship
Page 25: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

ExamplesThere is an example function that allows you to have R run the examples for you.>example(mean)

NB: You should look at the function’s description first, since the example will create some variables. (And you should probably not call your variables x, y, z, etc. To see what variables, etc., you have loaded, see the Environment pane or type:>ls()

Use rm() to remove a variable.

Page 26: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

FunctionsBuilt in Functions (depends on Package)>sqrt(64)>abs(-35)>round(pi)>?round>round(pi,digits=4)>round(pi,4)>round(34.5)>round(35.5)

Page 27: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

FunctionsR allows you to create your own functions Use the function functionGeneral form:function(arg1,arg2,…){operation}

>MyFunction.f <- function(x,y){x+(2*y)}

>MyFunction.f(3,7)Advanced: R has some help if you need to debug a function, using the debug function

Page 28: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

FunctionsEach argument of a function is named:>?seq

If you follow the order you do not have to name the value for the argument.seq(from=3,to=27,by=3)seq(3,27,3)seq(by=3,to=27,from=3)

Advanced: There are generic functions that look for you for the appropriate method.

Page 29: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

PackagesPackages extend R’s functionality.They are collections of code for new functions, often with help files and even datasets.

Other people out there have saved you a lot of coding, so use them! But if you do, then you should definitely cite them.How?>install.packages(‘packagename’)>library(packagename)

Page 30: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

PackagesWhere to get them:Comprehensive R Archive Network (CRAN)http://cran.r-project.org

See what others use:http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/

Page 31: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Variables

Assigning values:>x<-5 #old-school; <<- (global)>x=6 #easily confused>7->x #useful if you forget; ->>

Always points to the variable.Unless reassigned, value stays the same!>x - 3>x>x <- x-3

Page 32: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Variables

Notice that, when you define a variable, you do not see the result. To see the result at the same time, simply place parens around the whole expression:>x <- 5>x>(x <- 7)

Page 33: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Vectors

R is a vector-based languageA vector is a sequence of components

sequence: so it’s orderedcomponents: elements

all of the same typecharacter stringslogical valuesnumeric values

A sequence of different types is a list

Page 34: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Vectors>Number.v <- c(1,2,3)>Number.v>class(Number.v)>Logic.v <- c(T,T,F,T,F)>Logic.v>class(Logic.v)>Character.v <-

c("Red","Yellow","Blue")>Character.v>class(Character.v)

Page 35: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

VectorsWhat if you combined vectors?>Number2.v <- seq(5,10,.5)No problem!>Number3.v <- c(Number.v, Number2.v)What if they are different classes?>Mix.v <- c(Number.v, Character.v)>Mix.v>class(Mix.v)R forces the numeric values to be treated like strings. CAVEAT PROGRAMMATOR

Page 36: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

VectorsR allows you to operate on across an entire vector without an explicit loop.This is really cool!>Number.v <- 1:10>Number.v*5

Again notice that this change was not assigned to a vector.>Number.v

Page 37: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Recycling: Uses and DangersR recycles:>Number.v <- 1:10>Number2.v <- c(10,20)>Number.v + Number2.v

R recycled Number2.v five times. >length(Number.v)>length(Number2.v)

Notice that the length of Number.v is 10, which is a multiple of the length of Number2.v, 2.

Page 38: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Recycling: Uses and DangersR will tell you when (and why) you are wrong:>x <- 1:10>z <- 1:17>w <- x+z

Warning message:In x + z : longer object length is not a multiple of shorter object length

Page 39: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

VectorsSince vectors are ordered you can call up:>Character.v[2]

You can also call up more than one:>Character.v[2:3]

And not necessarily sequentially:>Character.v[c(1,3)]

Page 40: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

DataFramesA DataFrame is your basic spreadsheet.It is a list of vectors of equal length.The vectors do not have to be of the same type.

Page 41: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Loading Data

Working DirectoryMenu: Session/Set Working Directory …Command-line:>getwd()>setwd("~/Path/Folder")(R uses forward slashes for all OSs.)

Import DataMenu: Tools/Import DatasetCommand-line: …depends on file type

Page 42: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Loading Data

.CSV>DataFrameName <-read.csv("~/Path/File.csv")

You can also save/export from the command line:>write.csv(Data,file="~/Path/File.csv")

Page 43: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Loading Data

.R (or .r ; also .RData or .rda)or, using the foreign package:

STATA: .datSPSS:etc…

But Excel sheets …Google!Need a special package

Page 44: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Loading Data

Direct from the web:>Data.Name.df <- read.csv("URL")

Always check to make sure the resulting class:>class(Data)

Page 45: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Analyzing Data: A Textual Example

A tour of some R codeDoes Chrétien de Troyes’s Lancelot follow

Zipf’s Law?What is Zipf’s Law?

« une loi souvent nommée Estoup-Zipf »Word FrequencyPn~1/n^aNow you get to use the lovely phrase:

Zipfian Distribution

Page 46: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Analyzing Data: A Textual Example

Load the text into a new variable>Lancelot.v <- scan(file="file",+ what = "character",+ fileEncoding = "encoding",+ sep= "\n")>Lancelot.v <- scan(file="URL",+ what = "character",+ fileEncoding = "encoding",+ sep= "\n")

Page 47: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Analyzing Data: A Textual Example

But first, know your data!

http://www.atilf.fr/dect/

Encoding = ISO-8859-1

This is essential because otherwise the accents are off. (You can try to load it without the encoding and take a look at the text to see.)

Page 48: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Analyzing Data: A Textual Example

The scan functionThe file argumentThe what argument

(character, logical, numeric, etc)The sep argument

default is white-space

Page 49: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Analyzing Data: A Textual Example

The tolower functionThe toupper function

Page 50: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Analyzing Data: A Textual Example

The strsplit functionThe character vector argumentThe split argument

regex (\\W)

Page 51: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

[Rr]eg([Ee]x(en)?|ular)( [Ee]xpressions)?

Regular ExpressionsRegex (pl. Regexen)

Literals:cat finds “cat,” “catalogue,” “scatology”

Metacharacters:c.t finds “cat,” “cot,” “facet”

e.g., to find 4-letter words:\b\w{4}\b

Page 52: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

[Rr]eg([Ee]x(en)?|ular)( [Ee]xpressions)?We will use an R function gsub, which uses a regex to find and replace text.>?gsub

Here is the pattern argument I am using:<w.+?((?<=lemma=\").+?(?=\")).+?/w>

The replacement argument is also a regex:\\1

So this:<w lemma=\"mon1\" type=\"pron/adjposs\">ma</w>

Becomes:mon1

Page 53: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship
Page 54: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Authorship Attribution:The New Stylometry

StylometryA Long (and not so glorious) HistoryRecent ExamplesRobert Galbraith, Cuckoo’s CallingAdversarial Stylometry

Page 55: for the Digital Humanities R - brianjreilly.combrianjreilly.com/Brian_J._Reilly/RDH_files/R for the Digital... · Robert Galbraith, Cuckoo’s Calling Adversarial Stylometry. Authorship

Authorship Attribution:Cuckoo’s Calling