for the digital humanities r - brianjreilly.combrianjreilly.com/brian_j._reilly/rdh_files/r for the...
TRANSCRIPT
Rfor the Digital Humanities
Brian J. Reilly21 September 2014
Overview
What is R?
Why use R?
Getting Started with R
A Taste of R Programming (Exploration)
A Taste of Humanities Statistical Analysis
What is R?
R is a programming languageIt is the open-source dialect of S
R is an environment for statistical computingR is the rating of the new pirate movie
So R you ready for R?
Why use R?
R is FREE !vs SPSS (IBM)vs STATA
R now has an expanded GUI (RStudio)R runs on Mac, Windows, UnixR has an AMAZING community
FAQs, fora, etc.Packages!
R is becoming THE program of choice
Why use R?
R can do anythingorder a pizza over the internettext parsing (Perl, Python, php, etc)database (Excel)data analysis (Excel, SPSS, STATA, etc)
Why not to use R?
R does require programmingAnd programming means typing
… and typos… and writer’s (coder’s?) block
SPSS does notSTATA is flexible$
An oft quoted analogyWhen talking about user friendliness of computer software I like the analogy of cars vs. busses: [...] Using this analogy programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that is not already preprogrammed. R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in the passenger seat, and mountain climbing and spelunking gear in the back. R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning where the bus stops are in SPSS.
Greg Snow, R-help (May 2006)
Getting Started with R & RStudio
Rhttp://www.r-project.org
RStudio (Integrated Development Environment)http://www.rstudio.com
Shiny (Web-based Display)http://shiny.rstudio.com
Getting Started with R & RStudio
For today, navigate to:http://home.comcast.net/~brian.j.reilly/Brian_J._Reilly/RDH.html
and download the script.
Open that script. RStudio should launch and we’re ready to go.
Alternatively, open RStudio and then go to the File menu and find and open the downloaded file.
Getting to know RStudio
console
script
Getting to know RStudio
R is an interpreted language
That means you type your code and run it directly. No need for a compiler.
Usually, this means that the programs run more slowly. You won’t be using R to write the next big game, for example.
Let’s begin!
The traditional beginning. Type into your console what you see in light blue below:
>print("Hello world.")
Since R is an interpreted language, that was rather underwhelming.But notice: R finished your parens for you!
R Scripts(Upper Left Pane in RStudio)Edit, Save, and Retrieve your ScriptsDid I mention save?Syntax Highlighting# indicates a comment (Use before each line!)control + shift + c to (un)commentYou can run the script by line or highlighted section. On a Mac:control + return (or enter)
Or click “Run” in the GUI.
R is a Fancy Calculator
At the command-line prompt, type:>2+2>2*3>2^4
Order of operations! What is -32?>-3^2>(-3)^2
R is a Fancy Logic Board
At the command-line prompt, type:>2==3>2!=3>TRUE & FALSE>TRUE | FALSE>2<3>2<=3>2>=3>isTRUE((3^4)==(9^2))
R is a Not-So Fancy Calculator
At the command-line prompt, type:>5!Error: unexpected '!' in "5!"
The problem is that ! means “not”. We need to specify a function (more on this in a bit):>factorial(5)
R is a Fancy Calculator
>factorial(5)>choose(8,3)
>"*"(10,c(2,3,4))
>is.infinite(10^(305:310))
Command-Line Notes
> indicates the command-line promptIf you press enter and do not see “>”, then you are still in your command:> (2+3+
Press ESC to start overR does not care about spaces:> (2+3)> ( 2 + 3)
Command-Line Notes
R is case sensitive!UP/DOWN arrow keys save you from retypingStart typing and press TAB for function namesI don’t know why you would want to, but the semi-colon ; allows you to enter more than one command on a single line:> (2+3)> (2+5)> (2+3);(2+5)
HELP!!!
>help(mean)>?mean>help.search("piechart")>??piechart
R’s community is getting nicer (?)http://badhessian.org/2013/04/has-r-help-gotten-meaner-over-time-and-what-does-mancur-olson-have-to-say-about-it/
Google (surprisingly effective despite “R”)
HELP!!! (e.g. >?pie)Description
Gives a descriptionUsage
Displays syntax and possible argumentsArguments
What inputs you can/need to enterNoteReferencesSee AlsoExamples
ExamplesThere is an example function that allows you to have R run the examples for you.>example(mean)
NB: You should look at the function’s description first, since the example will create some variables. (And you should probably not call your variables x, y, z, etc. To see what variables, etc., you have loaded, see the Environment pane or type:>ls()
Use rm() to remove a variable.
FunctionsBuilt in Functions (depends on Package)>sqrt(64)>abs(-35)>round(pi)>?round>round(pi,digits=4)>round(pi,4)>round(34.5)>round(35.5)
FunctionsR allows you to create your own functions Use the function functionGeneral form:function(arg1,arg2,…){operation}
>MyFunction.f <- function(x,y){x+(2*y)}
>MyFunction.f(3,7)Advanced: R has some help if you need to debug a function, using the debug function
FunctionsEach argument of a function is named:>?seq
If you follow the order you do not have to name the value for the argument.seq(from=3,to=27,by=3)seq(3,27,3)seq(by=3,to=27,from=3)
Advanced: There are generic functions that look for you for the appropriate method.
PackagesPackages extend R’s functionality.They are collections of code for new functions, often with help files and even datasets.
Other people out there have saved you a lot of coding, so use them! But if you do, then you should definitely cite them.How?>install.packages(‘packagename’)>library(packagename)
PackagesWhere to get them:Comprehensive R Archive Network (CRAN)http://cran.r-project.org
See what others use:http://www.r-statistics.com/2013/06/top-100-r-packages-for-2013-jan-may/
Variables
Assigning values:>x<-5 #old-school; <<- (global)>x=6 #easily confused>7->x #useful if you forget; ->>
Always points to the variable.Unless reassigned, value stays the same!>x - 3>x>x <- x-3
Variables
Notice that, when you define a variable, you do not see the result. To see the result at the same time, simply place parens around the whole expression:>x <- 5>x>(x <- 7)
Vectors
R is a vector-based languageA vector is a sequence of components
sequence: so it’s orderedcomponents: elements
all of the same typecharacter stringslogical valuesnumeric values
A sequence of different types is a list
Vectors>Number.v <- c(1,2,3)>Number.v>class(Number.v)>Logic.v <- c(T,T,F,T,F)>Logic.v>class(Logic.v)>Character.v <-
c("Red","Yellow","Blue")>Character.v>class(Character.v)
VectorsWhat if you combined vectors?>Number2.v <- seq(5,10,.5)No problem!>Number3.v <- c(Number.v, Number2.v)What if they are different classes?>Mix.v <- c(Number.v, Character.v)>Mix.v>class(Mix.v)R forces the numeric values to be treated like strings. CAVEAT PROGRAMMATOR
VectorsR allows you to operate on across an entire vector without an explicit loop.This is really cool!>Number.v <- 1:10>Number.v*5
Again notice that this change was not assigned to a vector.>Number.v
Recycling: Uses and DangersR recycles:>Number.v <- 1:10>Number2.v <- c(10,20)>Number.v + Number2.v
R recycled Number2.v five times. >length(Number.v)>length(Number2.v)
Notice that the length of Number.v is 10, which is a multiple of the length of Number2.v, 2.
Recycling: Uses and DangersR will tell you when (and why) you are wrong:>x <- 1:10>z <- 1:17>w <- x+z
Warning message:In x + z : longer object length is not a multiple of shorter object length
VectorsSince vectors are ordered you can call up:>Character.v[2]
You can also call up more than one:>Character.v[2:3]
And not necessarily sequentially:>Character.v[c(1,3)]
DataFramesA DataFrame is your basic spreadsheet.It is a list of vectors of equal length.The vectors do not have to be of the same type.
Loading Data
Working DirectoryMenu: Session/Set Working Directory …Command-line:>getwd()>setwd("~/Path/Folder")(R uses forward slashes for all OSs.)
Import DataMenu: Tools/Import DatasetCommand-line: …depends on file type
Loading Data
.CSV>DataFrameName <-read.csv("~/Path/File.csv")
You can also save/export from the command line:>write.csv(Data,file="~/Path/File.csv")
Loading Data
.R (or .r ; also .RData or .rda)or, using the foreign package:
STATA: .datSPSS:etc…
But Excel sheets …Google!Need a special package
Loading Data
Direct from the web:>Data.Name.df <- read.csv("URL")
Always check to make sure the resulting class:>class(Data)
Analyzing Data: A Textual Example
A tour of some R codeDoes Chrétien de Troyes’s Lancelot follow
Zipf’s Law?What is Zipf’s Law?
« une loi souvent nommée Estoup-Zipf »Word FrequencyPn~1/n^aNow you get to use the lovely phrase:
Zipfian Distribution
Analyzing Data: A Textual Example
Load the text into a new variable>Lancelot.v <- scan(file="file",+ what = "character",+ fileEncoding = "encoding",+ sep= "\n")>Lancelot.v <- scan(file="URL",+ what = "character",+ fileEncoding = "encoding",+ sep= "\n")
Analyzing Data: A Textual Example
But first, know your data!
http://www.atilf.fr/dect/
Encoding = ISO-8859-1
This is essential because otherwise the accents are off. (You can try to load it without the encoding and take a look at the text to see.)
Analyzing Data: A Textual Example
The scan functionThe file argumentThe what argument
(character, logical, numeric, etc)The sep argument
default is white-space
Analyzing Data: A Textual Example
The tolower functionThe toupper function
Analyzing Data: A Textual Example
The strsplit functionThe character vector argumentThe split argument
regex (\\W)
[Rr]eg([Ee]x(en)?|ular)( [Ee]xpressions)?
Regular ExpressionsRegex (pl. Regexen)
Literals:cat finds “cat,” “catalogue,” “scatology”
Metacharacters:c.t finds “cat,” “cot,” “facet”
e.g., to find 4-letter words:\b\w{4}\b
[Rr]eg([Ee]x(en)?|ular)( [Ee]xpressions)?We will use an R function gsub, which uses a regex to find and replace text.>?gsub
Here is the pattern argument I am using:<w.+?((?<=lemma=\").+?(?=\")).+?/w>
The replacement argument is also a regex:\\1
So this:<w lemma=\"mon1\" type=\"pron/adjposs\">ma</w>
Becomes:mon1
Authorship Attribution:The New Stylometry
StylometryA Long (and not so glorious) HistoryRecent ExamplesRobert Galbraith, Cuckoo’s CallingAdversarial Stylometry
Authorship Attribution:Cuckoo’s Calling