introduction to r - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/intro.pdf ·...

25
Using R Basics Data Manipulation Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical Practice in Epidemiology, Tartu 2006

Upload: others

Post on 29-Jul-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Introduction to R

Peter Dalgaard

Department of BiostatisticsUniversity of Copenhagen

Statistical Practice in Epidemiology, Tartu 2006

Page 2: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Outline

Using R

Basics

Data Manipulation

Page 3: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

What is R?

I R is an “enviroment for statistical computing and graphics”I Highly flexible graphics routinesI Statistical functions (standard tests, modelling)I Controlled by a programming language

I In this course we use R exclusivelyI The first practical is a workbook exercise designed to help

you getting started with RI This lecture is intended to give you the broader picture

Page 4: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Basics of R

I What is R?I Interacting with RI Extended user interfacesI Later: Dealing with R’s workspace

Page 5: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Key Points about R

I Environment built around the programming language R,(an Open Source dialect of the S language).

I R is Free Software, and runs on a variety of platforms (I’llbe using Linux. Computer labs run on Windows.)

I Command-line execution based on function callsI Extensible with user functionsI Workspace containing data and functionsI Graphics devices

Page 6: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Interacting with R

I Command line interface (CLI)I The basic mode of interaction is “read – evaluate – print”I User types an expression at the command line,I R evaluates itI . . . and prints the resultI Batch variation: read commands from a file

Page 7: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Extended Interfaces

I Windows, Macintosh GUI: Fairly simple extensions of CLI,mostly offloads some tasks to menu interface, and addscommand recall

I Script editing: The ability to work with multiple lines of Rcode, save them to a file for later use, etc. A simple scripteditor is built into the R GUI in recent versions.

I External editor interfaces: TINN-R, R-WinEdt adds syntaxhighlighting. Highly recommended.

I R embedded in a text editor (ESS – Emacs SpeaksStatistics). Popular on Unix/Linux systems.

Page 8: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Demo 1

2+2log(10)help(log)summary(airquality)demo(graphics) # pretty pictures...

Page 9: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

R packages

I An important new thing in R has been its handling ofadd-on packages

I Standard formatI Easy end-user handlingI Quality control system (portability, version dependency)

I CRAN – Comprehensive R Archive Network, modelled onCTAN (TeX), CPAN (Perl). Kurt Hornik, Fritz Leisch,TU-Vienna

I Currently over 500 packages on CRAN.

Page 10: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Language

I R is a programming language – also on the command lineI Basic structure: Functions acting on objectsI (Functions are also a kind of object, operators a kind of

function)I Print an object by typing its nameI Evaluate an expression by entering it on the command lineI Call a function, giving the arguments in parentheses –

possibly emptyI Notice ls vs. ls()

Page 11: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Objects

I The basic object type is the vectorI Modes: numeric, integer, character, generic (list)I Operations are vectorized: you can add entire vectors witha + b

I Recycling of objects: If the lenghts don’t match, the shortervector is reused

Page 12: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Demo 2

x <- round(rnorm(10,mean=20,sd=5)) # simulate dataxmean(x)m <- mean(x)x - m # notice recycling(x - m)^2sum((x - m)^2)sqrt(sum((x - m)^2)/9)sd(x)

Page 13: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Smart indexing

I R has several unusual but highly useful indexingmechanisms:

I a[5] single elementI a[5:7] several elementsI a[-6] all except the 6thI a[b>200] logical indexI a["name"] by name

Page 14: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Lists

I Lists are vectors where the elements can have differenttypes

I Functions often return listsI lst <- list(A=rnorm(5), B="hello")

I Special indexing:I lst$A

I lst[[1]] first elementI lst[1] list containing the first element

Page 15: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Functions

I logit <- function(p) log(p/(1-p))

I logit(0.5)

I Formal argumentsI Actual argumentsI Positional matching: plot(x,y)I Keyword matching: t.test(x ~ g, mu=2,alternative="less")

I Partial matching: t.test(x ~ g, mu=2, alt="l")

Page 16: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Compound objects

I Attributes (dimensions, dimnames)I Allows you to define complex datastructures

I Matrices, arrays, tablesI Factors (categorical variables)I Data framesI Return values from tests, model fits

Page 17: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Classes, generic functions

I R objects have classes (there are two different classsystems, but ignore that for now)

I Functions can behave differently depending on the class ofan object

I E.g. summary(x) or print(x) does different things if xis numeric, a factor, or a linear model fit

Page 18: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Data Manipulation Functions

I Constructors of simple objectsI Single-column modificationsI Modifying and subsetting data frames

Page 19: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Constructors

I R deals with many kinds of objects besides data setsI Need to have ways of constructing them from the

command lineI We have (briefly) seen the c and list functionsI Notice the naming forms c(boys=1.2, girls=1.1)

I Extracting and setting names with names(x)

I For matrices and arrays, use the (surprise) matrix andarray functions. data.frame for data frames.

I It is also fairly common to construct a matrix from itscolumns using cbind

Page 20: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Demo 3

x <- c(boys = 1.2, girls = 1.1)xnames(x)names(x) <- c("M", "F")xmatrix(1:4,ncol=2)cbind(x=0:3,"exp(x)"=exp(0:3))

Page 21: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

The factor Function

I This is typically used when read.table gets it wrongI E.g. group codes read as numericI Or read as factors, but with levels in the wrong order (e.g.c("rare", "medium", "well-done") sortedalphabetically.)

I Notice the slightly confusing use of levels and labelsarguments.

I levels are the value codes on inputI labels are the value codes on output (and become the

levels of the resulting factor)

Page 22: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Demo 4

aq <- airqualityaq$Month <- factor(aq$Month, levels=5:9,

labels=month.name[5:9])aq$Monthlevels(aq$Month) <- month.abb[5:9]aq$Month

Page 23: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

The cut Function

I The cut function converts a numerical variable into groupsaccording to a set of break points

I Notice that the number of breaks is one more than thenumber of intervals

I Notice also that the intervals are left-open, right-closed bydefault (right=FALSE changes that)

I . . . and that the lowest endpoint is not included by default(set include.lowest=TRUE if it bothers you)

Page 24: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Demo 5

library(ISwR); data(juul)age <- subset(juul, age >= 10 & age <= 16)$agerange(age)agegr <- cut(age, seq(10,16,2), right=FALSE,

include.lowest=TRUE)length(age)table(agegr)agegr2 <- cut(age, seq(10,16,2), right=FALSE)table(agegr2)

Page 25: Introduction to R - staff.pubhealth.ku.dkstaff.pubhealth.ku.dk/~pd/tartu/pdf/Intro.pdf · Introduction to R Peter Dalgaard Department of Biostatistics University of Copenhagen Statistical

Using R Basics Data Manipulation

Working with Dates

I Dates are usually read as character or factor variablesI Use the as.Date function to convert them to objects of

class "Date"I If data are not in the default format (YYYY-MM-DD) you

need to supply a format specification> as.Date("11/3-1959",format="%d/%m-%Y")[1] "1959-03-11"

I You can calculate differences between Date objects. Theresult is an object of class "difftime", with a unit ofdays. You need as.numeric to get the actual number.