math580: introduction to r - lancaster universityfearnhea/rintro/r.course.pdf · 2013. 9. 23. ·...

Math580: Introduction to RPaul Fearnhead

Contents

1 Introduction 4

2 Getting started (Windows) 4

3 R as a calculator 6

3.1 Scalar calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

3.2 Vector calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

4 Help pages and function arguments 9

4.1 Help pages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.2 Function arguments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

4.3 More help . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

5 R Objects (1) - scalars and vectors 11

5.1 Scalar objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.2 Vector objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

5.3 Housekeeping . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

6 Random variables 13

6.1 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6.2 Simple statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

6.3 Quantiles, distribution functions, and density . . . . . . . . . . . . . . . . . 14

7 Data: reading from and writing to the file system 16

7.1 save() and load() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

7.2 dump() and source() . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

1

7.3 Importing and exporting data * . . . . . . . . . . . . . . . . . . . . . . . . . 16

8 Graphics (1) - plot 17

8.1 Creating a plot . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

8.2 Lines and points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

9 Functions (1) - introduction 19

9.1 A first example function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

9.2 More general “linear” functions . . . . . . . . . . . . . . . . . . . . . . . . . 20

10 Subscripts 22

10.1 Slicing more... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

10.2 Don’t bite off more than you can chew... . . . . . . . . . . . . . . . . . . . . 22

10.3 Negative Subscripts * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

10.4 Some more examples * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

11 R Objects (3) - matrices and logicals 24

11.1 Matrices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

11.1.1 Matrix subscripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

11.2 Logicals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

11.2.1 Scalar logicals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

11.2.2 Vector logicals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

11.2.3 Logical Subscripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

11.2.4 True and False as Numbers... . . . . . . . . . . . . . . . . . . . . . . 27

12 Graphics (2) - multiple graphs 29

12.1 Multiple graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

13 Functions (2) - debugging and looping 29

13.1 The for loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

14 Functions (3) - decisions 36

2

14.1 The if test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

14.2 The while loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

A Scripts and packages 41

A.1 Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

A.2 Packages * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

B R Objects (2) - characters, factors, and dataframes 43

B.1 Characters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

B.2 Factors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

B.3 Data frames . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

C Graphics - Legends, outputting graphs 47

C.1 Legends . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

C.2 Outputting graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

C.3 More lines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

D R Objects (4) - lists 51

E Functions - fast looping, local variables, vector decisions 53

E.1 Looping versus vectors * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

E.2 Local variables * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

E.3 More complex if tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

E.4 Vector logical tests * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

3

1 Introduction

R is a programming language and environment for statistical computing and graphics. Youwill use R in computer labs on many of the courses in this MSc and you will probably useR for your dissertation.

R provides a wide variety of statistical and graphical techniques. One of R’s strengths isthe ease with which well-designed publication-quality plots can be produced. Another is itsflexibility: if a particular technique is not implemented to your liking (or at all, if it hasjust been invented!) then since R is a programming language you can simply implement ityourself.

This course aims to introduce you to the R environment, and to R as a tool for calculation,simulation, creating graphics, and programming.

2 Getting started (Windows)

First of all, log into Windows with your username and password.

Access your home directory (H: drive) e.g. via My Computer, and create a folder for thiscourse using Windows Explorer. This is the location where you will save all your files.

Simply find R (not Tinn-R) and click to start it. You should now have a window entitled“RGui” and inside it another window entitled “R Console”. You can type R commandsdirectly at the console window. For example type 5*6 and press return. The followingshould appear:

> 5*6

[1] 30

The > sign is a prompt - it tells you that R is waiting for your command - you donot need to type it! You can use the up and down arrows to call up previous commands,and the left and right arrows with the backspace and delete keys to edit these. Or you cansimply type each line from scratch. Experiment!

* In the RGui, go to the File menu and choose Change dir.... Using the Browse button,navigate to, and select, the directory that you have created to store all your R files.

Keeping notes

You should keep a record of your R work. Open a new R script file by choosing File

− > New Script. A new script window will appear in the R GUI. You can open as many

4

separate scripts as you like. Perhaps one for each lab or one for the whole course? In thisscript window type

5+3

# this does nothing

99*88

(55-10)*2

Nothing should have happened yet in your console window. Click the cursor back on to thethird line (99*88) and in the R GUI menu select Edit − > Run line or selection. Thecalculation should appear in the console window, together with the answer. Now highlight thefirst three lines in the script window and again choose Edit − > Run line or selection.All the highlighted commands should appear and run in the console window. Any commandthat starts with # does nothing - use this to write comments explaining any commands thatfollow to yourself. Use this facility, because you will forget!

Place your cursor on the R Gui symbol that looks a bit like D → D, but squashed. This isa short-cut for Edit − > Run line or selection. Try it out.

Sometimes it is easier to type commands directly into the console window, sometimes it isbetter to type them into the script file and then run them. Feel free to do both.

Saving your files

You can call your files whatever you like. Save your R files often and make sure yousave them to your H: drive! Simply bring your script file window to the top (by clickingon it), and use File − > Save (as).

When you have finished...

Before you leave R, check that you have copied all the relevant commands from the R windowinto your script file. Save the script file using File − > Save.

To stop running R, either type ‘q()’ at the R console, or simply click the close button atthe top right of the R GUI. At this point you will be asked whether you want to save thedata from your R session. You can respond “yes”, “no”, or “cancel” to save the data beforequitting, quit without saving, or return to the R session. Data which is saved will be availablein future R sessions, so (unless your session has actually taken steps backwards...) you willusually wish to click “yes”.

5

3 R as a calculator

At its most basic level, R can be used as a calculator.

3.1 Scalar calculations

> 1+2+3 Type this[1] 6 R gives you the answer, labelled with a [1]

> 2*pi R knows that π = 3.1415926 . . .[1] 6.283185

Besides the basic arithmetic operators (+,−, ∗, /, and ˆ), R contains a number of arithmeticfunctions. These are:

Function Usesqrt(x) Compute the square root of xabs(x) Absolute (positive) value of xsin(x),cos(x), tan(x) Trigonometric functionsasin(x),acos(x),atan(x) Inverse Trigonometric functionssinh(x),cosh(x),tanh(x) Hyperbolic functionsasinh(x),acosh(x), atanh(x) Inverse Hyperbolic functionsexp(x),log(x) Exponential and natural log of xlog10(x) Base-10 log of xgamma(x), lgamma(x) The gamma function and its log

Exercises

Evaluate the following expressions:

1) 11 + 22 + 33 + 44 + 55 + 66 + 77

2)11× 22− 33× 44

55× 66

3) 2 sin2(π/3) Hint: how would you do this on a calculator?

4)1√2π

exp

(−1

2

)

Answers: 1) 308, 2) -0.3333333, 3) 1.5, 4) 0.2419707

6

3.2 Vector calculations

Vectors in R are analogous to vectors in mathematics (often called arrays by programmers).We use the column (c()) command to tell R that we want our values to be represented asa vector. The following tells R about a vector, but does nothing with it!

> c(1,2,4,8,16)

[1] 1 2 4 8 16

Vectors may be combined with scalars and/or with other vectors in arithmetic operationsand may also be supplied as arguments to functions.

> c(1,2,4)+2

[1] 3 4 6

> c(1,2,4)*c(3,3,2)

[1] 3 6 8

> sqrt(c(1,2,3,4,5))

[1] 1.000000 1.414214 1.732051 2.000000 2.236068

Because of this, repeated evaluation of a function is much easier in R than in other languagessuch as C or FORTRAN. For now we will simply use it to produce a graph:

> plot(c(1,2,3,4,5,6,7),log(c(1,2,3,4,5,6,7)))

There are several short cuts to creating vectors:

> rep(3,5)

[1] 3 3 3 3 3

> seq(1,11,by=2)

[1] 1 3 5 7 9 11

> seq(0,10,len=5)

[1] 0.0 2.5 5.0 7.5 10.0

> 1/(1:5)

[1] 1.0000000 0.5000000 0.3333333 0.2500000 0.2000000

Exercises

1. Evaluate the square root of all the odd numbers between 1 and 25.> sqrt(seq(1,25,by=2))

2. Create a vector that repeats 10 times a sequence of integers between 1 and 5.> myReps <- rep(1:5,10)

7

3. The function sum() adds the numbers in the vector supplied. for example> sum(c(1,2,4,9))

[1] 16

Find the sum of the numbers from 1 to 50. Check your answer (again using R) againstthe formula S = n(n + 1)/2. Now find the sum of the reciprocals of the first fiftyintegers.

> sum(1:50)

[1] 1275

> 50*(50+1)/2

[1] 1275

> sum(1/(1:50))

[1] 4.499205

8

4 Help pages and function arguments

4.1 Help pages

Every R function has a corresponding help page. To access the help page for function func

type ?func or help(func).

> ?atan

> ?seq

4.2 Function arguments

The function rnorm() generates a vector of normal random variables. Look at its help page:

> ?rnorm

A few lines down, under Usage, the following appears

rnorm(n, mean=0, sd=1)

Still further down, under Arguments, the following appears

n: number of observations. If ’length(n) > 1’, the length is

taken to be the number required.

mean: vector of means.

sd: vector of standard deviations.

• The arguments of a function are the numbers you supply. For example in the call> rep(3,5)

the arguments are 3 and 5.

• The first piece of text tells us that rnorm() takes 3 arguments: n, mean, and sd.

• The second piece of text tells us what these correspond to (the number of observations,means, and standard deviations) and that they are vectors. So> rnorm(3,c(1,0,5),c(1,2,3))

produces a vector of 3 normal random variables, N(1, 1), N(0, 22), and N(5, 32).

9

• If you supply a scalar when a vector is requested, the scalar will just be repeated asmany times as is required. So> rnorm(3,c(1,0,5),3)

produces 3 normal variables: N(1, 9), N(0, 9), and N(5, 9).

• Any argument which in the help pages has an = and then a number takes that numberby default if you fail to specify it. So> rnorm(3,c(1,0,5))


• You can specify arguments out of order only if you refer to their names explicitly. So> rnorm(3,sd=c(1,2,3))


• Any argument without a default must be specified. So> rnorm(mu=2)

produces an error.

• When naming a parameter you often need only specify the first few letters of a itsname. You must specify enough letters so that R can decide, out of all the possibleparameters for the function, exactly the one to which you are referring.

If you already know the arguments which a function requires but just cannot remember theorder in which to supply them then use the args() function:

> args(rnorm)

4.3 More help

If you do not know the name of the function you require then you can search on keywordsor a phrase using help.search(‘‘phrase of interest’’). Functions are listed by thiscommand as “package::function”. For more about packages see Section A.2.

You can also browse the online help pages by typing> help.start().

Exercises

What will the following do? Think about it, decide, and then try it out.> rnorm(10)

> rnorm(10,s=0.01)

> rgamma(10,s=0.1)

What is the name of the function that plots a pie chart? Try it out!

10

5 R Objects (1) - scalars and vectors

An object is a thing. In R, objects can store data, and perform functions on data. Besidesthe data it contains, one of the most important aspects of an object is its name. Always tryto give your objects meaningful names, that way you won’t forget what they contain!

5.1 Scalar objects

Type the following to store the value 10 in the scalar object x and retrieve it, overwrite it,and then retrieve the new value:

> x <- 10

> x

> x <- 5

> x

Objects can store the results of operations on other objects:

> y <- x^3 * 4 + 20

> y

> y<-log(x)

> y

The special symbol <- means “gets”. Literally, x gets 10.

Object names in R can only contain alphanumeric characters and ‘.’, with the conditionthat names can only begin with an alphabetic character. eg. t, Temp, and high.5 are allvalid, whereas 7up, Barrington-Smythe, and Enormous!Hotel are not. Try storing differentvalues in scalar objects of different names.

So far, you’ve learnt how to store a value in a scalar object. Make sure you have written anyexamples you have tried out in your script file - they will be useful to refer to later.

5.2 Vector objects

Create a short vector, use it, store it, and then produce a graph:

> myVec <- 0:6

> myVec^2+1

[1] 1 2 5 10 17 26 37

> myNewVec <- myVec^2 -4*myVec +2

11

> plot(myVec,myNewVec)

Exercise

1. What will the following give? Decide before you try them out and use the help pagesif necessary.> sum(myVec)

> min(myNewVec) 21 and -2

2. Create a vector theta of length 100 with values between 0 and 2π. Plot a graph ofsin(theta) + cos3(2*theta) (y-axis) against theta (x-axis). > theta <- seq(0,2*pi,len=100)

> plot(theta,sin(theta)+(cos(2*theta))^3)

3. Create a vector x of length 200 with values between -2 and 2. Create a vector y =x + 2x2 − x4 and plot a graph of y vs x. Use the function max() to estimate themaximum value of the function x+ 2x2 − x4. How could you improve the accuracy ofthis estimate?

> x<-seq(-2,2,len=200)

> y<-x+2*x^2-x^4

> plot(x,y)

> max(y)

[1] 2.055791

5.3 Housekeeping

Your R workspace will soon contain a large number of objects. What are they? The followingtwo functions both produce a list of all your objects:

> objects()

> ls() “list”

Stop your workspace getting too cluttered by deleting any objects which you are sure youwill not need again using the function rm() (“remove”):

> rm(myVec)

> ls()

12

6 Random variables

6.1 Simulation

In many of the courses on this MSc you will need to simulate random variables. R providesvarious functions for generating random numbers from the common probability distributions.All functions return a vector of length n of realisations from whichever distribution youchoose, parameterised by values of your choice. Here are some of the distributions you cansimulate from directly:

Function Distributionrunif(n,min,max) Uniformrnorm(n,mean,sd) Normalrt(n,df) Student-trexp(n,rate) Exponentialrgamma(n,shape,rate) Gammarpois(n,lambda) Poissonrbinom(n,size,prob) Binomialrchisq(n,df) Chi-squarerweibull(n,shape,scale) Weibull

WARNING: R does not necessarily use the same parameterisation for these probabilitydistributions as you will in your course. Until you get used to using these functions on aregular basis, always check the help pages - they give you the form of the distribution.

The help pages also tell you of any defaults for the parameters using the = sign (Section 4.2).

Exercises

1. What are the default parameters for rnorm? rexp? runif?

µ = 0, σ = 1 λ = 1 min = 0, max = 1

2. Simulate a vector x.norm of 1000 independent N(2, 4) variables (recall that 4 is thevariance). Plot a histogram of the simulated data using the function hist().

> x.norm<-rnorm(1000,2,2)

> hist(x.norm)

3. Repeat the above (using different variable names!) for an exponential distribution withrate 0.5 and a uniform distribution bounded between 0 and 1.

6.2 Simple statistics

R has a variety of simple functions to obtain basic statistics on a vector of data. Theseinclude min(), max(), mean(), sd(), var(), median() and summary(). If it is not ob-

13

vious what each of these does then use the help pages or simply try them out!

Exercises

1. For each of your three simulated datasets check that the mean and variance are as youwould expect. Use the summary() function to write down the first and third quartiles.

Using summary() I obtained: (0.6087,3.3390), (0.6497,2.800) and (0.2423,0.7349).

2. Let Z ∼ N(0, 1). Use simulation to estimate E [Z4] and Var [Z4]. Try this severaltimes to get an idea of the repeatability of your results.

mean(rnorm(10000)^4) should give an answer of about 3, varrnorm(10000)^4 shouldbe about 90.

6.3 Quantiles, distribution functions, and density

The first letter ”r” (as in rnorm) stands for ”random”. For each function that simulatesa vector of random variable there are three further functions that respectively return thedensity (”d”), quantiles (”q”) and the cumulative distribution function (”p”). You will needall of these in later courses.

Let Z ∼ N(1, 9). To find the value q (quantile) such that P (Z ≤ q) is 0.05 type:

> qnorm(0.05,1,3)

[1] -3.934561

These functions also take vector aruments. For example to find P (Z ≤ −3.934561) , P (Z ≤ 0)and P (Z ≤ 7) type:

> pnorm(c(-3.934561,0,7),1,3)

[1] 0.0500000 0.3694413 0.9772499

To find the density of each of 3 observations (1.2,2,7,0.8) from a N(1, 0.52) distribution

> dnorm(c(1.2,2.7,0.8),1,.5)

[1] 0.736540281 0.002464438 0.736540281

Exercises

1. Using the parameters and distributions from each of your three simulated data setsfind the true first and third quartiles of the underlying population and compare thesewith quantiles from your simulated data.

14

True quartiles are approximately: (0.651,3.349), (0.575,2.773) and (0.25,0.75).

2. Using the parameters, distributions, and quartile estimates from each of your threesimulated data sets find the probability that a random variable (from the correctdistribution) would be less than each of the respective quartile estimates.

Values should all be around (0.25,0.75).

15

7 Data: reading from and writing to the file system

As you will have noticed, any objects you create will be saved in your R workspace at theend of each session provided you ask R to do this (and provided you don’t delete them!).However you may wish to save your data more permanently, or pass it on to someone elseand you will often need to load R data sets which you have not created yourself.

7.1 save() and load()

You can save groups of variables to a file using R’s special internal storage format. Theresulting file can only be read by R.

> save(X,y,file="small.matrix.and.vector.rdata")

If you wished to load these variables into a different R workspace you would type (havingstarted R in the new workspace!)

> load("small.matrix.and.vector.rdata")

7.2 dump() and source()

You can save R functions and objects using dump(). This can be read in using source().Using dump() the first argument is a vector with the names (in inverted commas) of theobjects you want to save.

> dump(c(’’X‘‘,’’y‘‘),file="small.matrix.and.vector.r")

> source("small.matrix.and.vector.r")

You can use source to input a list of R commands.

7.3 Importing and exporting data *

You will sometimes wish to import/export data from/to a database or spread sheet to/fromR. This is best done via a text file which (in its simplest format) has one row per row ofdata, with variables separated by a comma (known as CSV), a space or a tab. Most softwareallows import and export of such data, and R is no exception. Read the help pages forread.table() and write.table().

16

8 Graphics (1) - plot

In this section, we will use the animals data frame you created in Section B.3 to illustratehow to use some of the graphics functions in R. The functions we will be using are:

Function Useplot() Creates a new plotplot.ts() Creates a time-series plotlines() Adds a line to an existing plotpoints() Adds points to an existing plotlegend() Adds a legend to a plot

8.1 Creating a plot

The basic plot() command can be used as follows:

> plot(x,y,main=’My Title’, xlab=’The x-axis’, ylab=’The y-axis’, xlim=c(0,10),

ylim=c(0,15))

where x and y are vectors of x and y coordinates respectively. The other parameters set theplot labelling, and axis limits (x-axis between 0 and 10, and y-axis between 0 and 15) andwill be decided by R if you do not specify them explicitly.

Exercise

Load in data of the FTSE100 index (download file ’’FTSE.R‘‘ from Moodle).

>source("FTSE.R")

This inputs two vectors FTSE.close and FTSE.date. These are the closing value of theFTSE100 share index, and the corresponding date. At the latter is a set of characters wewill produce a vector of numbers to plot the closing price against

>n=length(FTSE.close)

>time=1:n

Here we find the length of the FTSE.close vector, and create a time vector (1, . . . , n).

We will plot this data. Try the following alternatives

>plot(time,FTSE.close,pch="+",col="red")

>plot(time,FTSE.close,type="l",ylab="FTSE100",xlab="Time",col=2)

17

The function plot.ts is specifcially designed to plot time-series data. Try:

>plot.ts(FTSE.close)

You can add labels to the axes, and change the colour of the line as with standard plotcommands.

8.2 Lines and points

R will plot points individually (the default), as part of a smooth curve (line), or both:

> plot(x,y,type="p")

> plot(x,y,type="l")

> plot(x,y,type="b",pch="x")

It will also allow you to add points or lines to an existing graph via the points() andlines() functions:

> x<-seq(0.1,2,.05)

> y1<-x*x

> y2<-x

> y3<-1/x

> plot(x,y1,type="b")

> points(x,y2,pch="*")

> lines(x,y3,lty=2,lwd=3)

The lty and lwd parameters are used in either plot() or lines(). What do they do? Trychanging the values.

18

9 Functions (1) - introduction

The real strength of R is the ease with which a user can write new functions that can thenbe accessed just like the built-in functions. Code for many of the functions in this sectionand the other “function” sections can be found in examples.r .

9.1 A first example function

Here is an incredibly simple function - but it serves to demonstrate the key aspects of afunction. The code for this particular function is not is examples.r - it will be instructivefor you to type it in to the script window yourself.

my.square<-function(a) {

b<-a^2

return(b)

}

What do you think this function should do? Type it, then run it in R.

When you “run” this function from your script window there is no visible result. Howeverif you type ls() in the terminal window you will see that there is an extra object in R’smemory - the function my.square(). “Running” the function simply caused R to absorbit and check that it “makes sense” (see later sections on debugging for more on what issensible to R). To get the function to actually “do some work” you must call it. You cancall my.square() as often as you like, with any values you wish.

> my.square(7)

[1] 49

> x<-10

> y<-my.square(x)

> y

[1] 100

This simple function demonstrates 4 key ideas

1. Every function has a name. We have named the function in this example my.square().

2. A function generally takes values supplied by the user (the person who calls them);these are called the arguments to the function. my.square() takes a single argument,which it calls a.

3. The main portion of a function (its body) lies between the curly brackets { } and“does something useful”. This usually involves whatever arguments were supplied.my.square() squares the argument and calls it b.

19

4. A function generally returns an object which was the result of “doing somethinguseful.” my.square() returns b.

Important 1: When you call a function the argument can be an existing variable, which inthe example above was called x; similarly you can put the result of calling the function intoanother variable (y in the example above).

If all of the instructions inside a function can be used with vectors then the function willwork with vector arguments. Here the only instruction in the body of the function is b<-a^2,which can be used with vectors, and so

> x<-1:10

> my.square(x)

[1] 1 4 9 16 25 36 49 64 81 100

Important 2: the return() statement must be the last line in the body of a function.

Alter your function to the following

my.square<-function(a) {

b<- -100

return(b)

b<-a^2

}

Re-run it in R and then call it with several different values. Explain the result.

9.2 More general “linear” functions

Here is an example of a function hypotenuse() that uses Pythagoras’ Theorem to calculatethe length of the hypotenuse of a right-angled triangle given the lengths of the other twosides. Copy and paste it from the file examples.r into a script file, or type it directly intoyour script window. Assimilate and run the code. The function should be ready to use.

hypotenuse <- function(side1,side2) {

print(c("Two sides at right angles: ",side1,side2))

side3 = sqrt(side1*side1 + side2*side2)

print("Returning the hypotenuse...")

return(side3)

}

How many arguments does hypotenuse() have? What are they called inside the function?How many values does it return?

20

A function may have any number of arguments (0,1,2,...) but may only return at most oneobject.

The print() function takes a vector (or list - see later) and prints out the elements.

Try calling the function.

> hypotenuse(3,4)

[1] "Two sides at right angles: " "3" "4"

[1] "Returning the hypotenuse..."

[1] 5

> hypotenuse(side1=3,side2=4)

[1] "Two sides at right angles: " "3" "4"

[1] "Returning the hypotenuse..."

[1] 5

The following function is also in examples.r has; it takes no arguments and returns nothing.Before you call the function, try to work out what will happen when you do; then call itseveral times. To call the function, type fancy.plot() at a script window. Suggest apossible use for this function.

fancy.plot<-function() {

x<-rnorm(100,sd=4)

z<-rnorm(100,sd=3)

y<-x+z

plot(x,y,main="Bivariate normal sample; mu1=mu2=0, sig1=4,sig2=5, rho=4/5")

print(cor(x,y))

return()

}

Why does NULL appear as part of the output?If there are no arguments to the function return() then you may omit that line. Tryremoving the line and rerunning then calling the function fancy.plot().

If time permits, explain why the graph title is statistically appropriate.

21

10 Subscripts

Suppose we want to extract a part of a vector. R allows us to do this using square brackets,the subscripts starting at 1 (rather than 0 if you’re a C programmer!).

> x <- 10:1

> y <- x[2]

> y

[1] 9

Here y gets the value of the second element of x.

10.1 Slicing more...

You can extract parts of a vector by subscripting with another vector.

> y <- x[c(1,3,5)]

> y

[1] 10 8 6

> x[4:7]

[1] 7 6 5 4

Here the returned value is a vector.

10.2 Don’t bite off more than you can chew...

If you try and subscript a vector with a number out of range...

> x[12]

[1] NA

you get NA. This is the symbol R uses for missing data, or other numeric nonsense.

10.3 Negative Subscripts *

Negative subscripts? What’s the -Nth element of a vector? R defines this as being all thevector except the Nth element. This gives us an easy way of removing some elements froma vector.

22

> x <- 1:10

> x[-3]

[1] 1 2 4 5 6 7 8 9 10 missing 3

> x[c(3,4,7)]

[1] 3 4 7

> x[-c(3,4,7)] a vector of negatives[1] 1 2 5 6 8 9 10

You can’t mix positive and negative subscripts. Why would you want to?

10.4 Some more examples *

> x <- 1:10

> x[-length(x)] all but the last[1] 1 2 3 4 5 6 7 8 9

> x[length(x):1] reverse x[1] 10 9 8 7 6 5 4 3 2 1

There is a function to do this last one: rev(x) - most of the simple things you may want todo tend to have their own functions! Try using help.search() to find more functions.

23

11 R Objects (3) - matrices and logicals

11.1 Matrices

Creating and manipulating matrices in R is much easier than in lower level languages suchas C or FORTRAN. The following creates a 2x2 matrix then finds its inverse, its square, itstranspose, and its determinant:

> X<-matrix(c(1,2,3,4),nrow=2,byrow=T)

> X

[,1] [,2]

[1,] 1 2

[2,] 3 4

> solve(X)

[,1] [,2]

[1,] -2.0 1.0

[2,] 1.5 -0.5

> X %*% X # matrix multiplication

[,1] [,2]

[1,] 7 10

[2,] 15 22

> t(X)

[,1] [,2]

[1,] 1 3

[2,] 2 4

> det(X)

Many R functions can take and return matrices:

> Y<-log(X)

> Y

[,1] [,2]

[1,] 0.000000 0.6931472

[2,] 1.098612 1.3862944

Exercises: Use the help facility (if necessary) to answer the first three questions:

1. How does R know how many rows and columns the matrix should have?

24

2. What does the byrow parameter do?

3. solve() can take more than one argument; what are the first two? Hence find thevector w such that

Xw =

[18

]> solve(X,matrix(c(1,8),ncol=1))

[,1]

[1,] 6.0

[2,] -2.5

> solve(X,c(1,8))

[1] 6.0 -2.5

4. The %*% symbol tells R to perform matrix multiplication. What happens with X * X

or X^2? Hint: first try X+X, X-X, and X/X.

5. R allows matrix multiplication between matrices and vectors. Create a vector

y =

[35

]and hence evaluate (XtX)−1Xty.

> solve(t(X) %*% X) %*% t(X) %*% y

[,1]

[1,] -1

[2,] 2

11.1.1 Matrix subscripts

Just as for data frames, you can access entire (or partial) rows or columns of a matrix.

> X[1,2]

[1] 2

> X[,2]

[1] 2 4

> X[2,]

[1] 3 4

11.2 Logicals

11.2.1 Scalar logicals

A logical variable can either be TRUE or FALSE and results from one or more of theconditional tests (such as > and < ). R understands the full names (TRUE, FALSE) as

25

well as the short-cuts T and F. You can store logicals in objects just as you can numeric orcharacter data. Try the following to get a feel for logical variables.

> x<-5

> x<10

[1] TRUE

> x>10

[1] FALSE

> y<-x==10

> y

[1] FALSE

> x==5

[1] TRUE

> x !=5

[1] FALSE

ATTENTION! To test whether or not two quantities are equal you must use the doubleequals sign ==. The single equals sign is for assigning parameters in functions, or (in oldversions of R) for assigning variables.

Other simple logical operators are <= and >=.

Logical operations may be combined using the && (AND) and || (OR) operators and negatedusing the ! (NOT) operator:

> y<-10

> (x==5) && (y==3)

[1] FALSE

> (x==5) || (y==3)

[1] TRUE

> !(x==5)

[1] FALSE

11.2.2 Vector logicals

Logicals can also be arranged in vectors:

> x<-1:10

> x<=5

[1] TRUE TRUE TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE

To combine vectors of logicals we must use the vector AND and OR operators, respectively& and |

26

> (x>3) & (x<=7)

[1] FALSE FALSE FALSE TRUE TRUE TRUE TRUE FALSE FALSE FALSE

11.2.3 Logical Subscripts

You can use a vector of logical true/false values as a subscript. Example:

> x <- c(6,5,6,4,4,3,4,2,3,4)

> y <- c(5,3,4,2,6,5,4,5,4,3) couple of random vectors

> xeq4 <- x == 4 which of x equals 4?> xeq4

[1] FALSE FALSE FALSE TRUE TRUE FALSE TRUE FALSE FALSE TRUE

> y[xeq4] which of y correspond to[1] 2 6 4 3 x==4?> y[x == 4]

[1] 2 6 4 3 or simply...

11.2.4 True and False as Numbers...

If you try and do anything numeric with logical values, R converts true to 1 and false to 0.This can be usefully employed to count the number of true values in a vector. Examples:

> x <- rnorm(1000)

> y <- x > 2

> sum(y)

> sum(y)/length(y)

Exercise

1. The above estimates the probability that a N(0,1) random variable is greater than 2(can you see why?). Check your answer by using the pnorm() function. Which valuedo you trust more? Why?

> pnorm(2)

[1] 0.9772499

2. Let Z ∼ N(0, 1). Use simulation to estimate P (1/(1 + z2) > 0.5). (Can you also dothis using pnorm()? )

> sum(1/(1+z2) > 0.5)/length(z)[1] 0.6856

27

1/(1+z2) > 0.5⇔ Z2 < 1 so use

> pnorm(1)-pnorm(-1)

[1] 0.6826895

28

12 Graphics (2) - multiple graphs

12.1 Multiple graphs

R allows you to place multiple graphs in the same window. Type these commands:

> par(mfrow=c(2,3))

> hist(rnorm(100),main="Normal")

> hist(runif(100),main="Uniform")

> hist(rexp(100),main="Exponential")

> hist(rgamma(100,3),main="Gamma, shape=3")

> hist(rpois(100,2),main="Poisson, mean=2")

> hist(rt(100,df=1),main="Cauchy")

What happens if you replace mfrow with mfcol?

13 Functions (2) - debugging and looping

When a function produces the wrong output or does not even make sense to the R interpreteryou need to debug it - i.e. find the gremlins in the code. This is one of the most frequentlyused programming skills and so we will practice it extensively on this course.

Debugging (1): syntax errors

“Bugs” come in many different varieties. The easiest to identify and deal with are syntaxerrors. These occur when the punctuation in your code does not make sense to the R checker,and R is unable to even process your code let alone run it. For example:

• > myvar1<-1

> myvar2<-2

> print(c(myvar1;myvar2)

Here we have used a semi-colon when we should have used a comma, and we havemissed the closing bracket from the function call.

• > 2 pi

Here we have omitted the symbol * .

Try typing the above examples. The error messages produced by R are rarely informativein themselves; however in longer segments of code they are of some help in pin-pointing the

29

mistake since the error has usually occured before or on the line of code which R cannotinterpret.

Debugging Exercise 1

This exercise is centered around a very short function which produces a “times table” boxsuch as you will have seen in primary school. The idea is that if v is the vector [1, 2, . . . , 10]t

then vvt is exactly the matrix we require.

The following function is in your R-file examples.r . Paste it into your editor and tryrunning it. Then try correcting all of the syntax errors until it works.

times.table <- function() {

one.to.ten<-matrax(data=(1;10) ncol=1)

print("Times table"

print(one.to.ten %*% t(one.to.ten)

}

Debugging (2): run time errors

Once you have corrected all the syntax errors the R checker should accept your code. Howeverit is often the case that when you then come to actually run your code or call the functionwhich you have created, error messages will be returned. Such problems usually (but notalways) arise from incorrect spellings of variable and function names. Examples include

• Mis-spelling a variable name or function name e.g.

> myvar<-1

> prunt(myvr)

Here we have mis-spelled the function print() and the variable myvar.

• > x<-5

> xsin(x)

This initially looks correct to R - it is a call to the function xsin(). However whenR tries to run the code and actually call the function it realises that no such functionexists! What should the user have written?


This exercise is centered around Stirling’s approximation to a factorial

n! ∼√

2π e−n nn+1/2

As before, the code is available from the R-file examples.r . Paste it into your editor andtry running it. Then try correcting all of the run-time errors until it works.

30

stirling<-function(x) {

approx<-sqt(2*pi)*e^(-n) * n^n+.5

retrun(approx)

}

One of the errors is in the line of algebra; if you have also corrected this then the functionwill produce the output> stirling(4)

[1] 23.50618

If you have not corrected the algebra, your function will produce the (incorrect!) output:

> stirling(4)

[1] 12.25309

Read the next section!

Debugging (3): errors in your algebra

R algebra is subject to the usual rules of mathematics e.g. that 5+2*4 is 13 and not 28. Itis easy to forget to bracket terms, so that even though your R code runs and produces ananswer, it produces the wrong answer.

This is in fact the case with your function stirling().


Find the algebraic error in your stirling() function and correct it.

approx<-sqrt(2*pi)*exp(-n)*n^(n+0.5)

You should now be in a position to write (and debug) your own functions!

Coding Exercises

1. Directly relevant to the project The Poisson distribution is often used to modelcount data (e.g. number of days off sick each employee has had in the space of a year,or number of asthma attacks in a month). The likelihood of a set of Poisson randomvariables with mean λ is L ∝

∏λyi e

−λ and so the log-likelihood (up to a constant)is l = (log λ)

∑yi − nλ . Create a function called poisson.log.like() that takes

two arguments: lambda,y. These will be respectively the mean parameter and a datavector of non-negative integers (counts). Your function poisson.log.like() shoulduse the length() function to find n and the sum() function to find

∑yi, and should

return the log-likelihood.

Now create a vector of Poisson variables and try finding the likelihood for differentvalues of λ (NB: you will obtain different answers to those below - why?).

31

> po<-rpois(20,3) # vector of 20 Poisson variables

> poisson.log.like(3,po) # your answers will be different: why?

[1] -2.534693

> poisson.log.like(2.5,po)

[1] -2.092732

> poisson.log.like(2.0,po)

[1] -2.671320

poisson.log.like<-function(lambda,y) {# lambda is the mean parameter

# y is a vector of count data

s<-sum(y)

n<-length(y)

l<-s*log(lambda)-n*lambda

return(l)

}

2. Create a function called interval.plot() that takes two arguments: lo,hi. It shouldcreate a vector x of 100 elements between lo and hi and then calculate y = x5e−0.5∗x2

over this range. It should then draw a (line) graph of y against x with suitable labels.

3. Start with lo = 0 and hi = 20 and use repeated calls to interval.plot() withdifferent values of lo and hi to ”zoom in” on the maximum value. At what value of xis y maximised? Check that this agrees with the answer you obtain by differentiating!

interval.plot<-function(lo,hi) {x<-seq(lo,hi,len=100)

y<-x^5*exp(-0.5*x*x)

plot(x,y,main="y=x^5 exp(-0.5*x*x)",type="l")

}Q: Why is 100 points a reasonable number? Hint : what would happen if you chose 3points? What if you chose 3x10ˆ50?

Note: Use the # marker and write comments. It might be obvious to you right now whatyour code does, but in 2 months time when you come back to it your code may seemimpenetrable.

13.1 The for loop

Sometimes we wish to repeat one task over and over - for example to find 5! we could type

> 5*4*3*2*1

But this would become tedious if we wanted to find 38!

32

The for loop allows us to automate repetitive tasks by repeating the task for each elementof a vector. Copy it from the examples file and assimilate it:

# Print a numbered shopping list

shopping<-function(to.buy) {

print("My Shopping List")

for (i in 1:length(to.buy)) {

print(paste(i,to.buy[i],sep=". "))

}

}

Call the function with a shopping list, e.g.

> shopping(c("bread","milk","a large cow"))

Make sure you understand how the function works. The line for (i in 1:n) repeates thecode within the curly braces once for each element of 1:n. At the first iteration, i takesthe value of the first element (i.e. 1); at the second iteration i takes the value of the secondelement (i.e. 2) etc. Here now is a function to calculate factorials:

# calculate n!

myfac0<-function(n) {

product<-1

for (i in 1:n) {

product<-product*i

}

return(product)

}

Q: What does the following line do? product<-product*i

Make sure you understand exactly what it is doing, then try it out.

The for loop in myfac0() ends with only a single value remembered, product. It is of-ten necessary or useful to store one value at each iteration of a loop. The following is inexamples.r . Try to work out what it will do and what it will return when you call it withtriangle(5); then call it.

# Triangular numbers

triangle<-function(d) {

partial.sum<-rep(0,d)

sum<-0

for (i in 1:d) {

33

sum<-sum+i

partial.sum[i]<-sum

}

return(partial.sum)

}

Debugging (4): logical errors

Even once your function works and does something it often does not perform as expectedor intended! A simple example of this was the algebraic mistake in the previous debuggingexercise but there can also be logical mistakes - where your code is performing calculationscorrectly but may be using the wrong variables or putting the results in the wrong variable,for example. The simplest tool for finding such mistakes is the print() statement. Addprint() statements within your function to print out any scalar or vector which you think haseven a chance of not behaving as expected, run your program again with short vectors!!! andsee where your function starts producing unexpected output. Once you know the “where”,the “why” is often obvious.


This exercise makes use of your stirling() function and your new myfac0() function toassess the accuracy of approximation as n increases. It should take a single input ntop (thehighest value you wish to look at) and return a vector of length ntop, the ith element ofwhich is the ratio of the Stirling approximation to the true value for i!. Copy the code fromexamples.r to a script window. Helpful print() statements have already been added!

compare.stirling(ntop) {

print(c("ntop: ",ntop))

rat<-rep(0,n)

for (i in 1:topn) {

print(c("i is: ",i))

print(c("rat is: ",rat))

approx<-stirling(ntop)

tru<-myfac0(n)

rat[topn]<-appr\tru

return(ratio)

}

The above code contains an enticing mixture of syntatictic, variable, and logical mistakes.Use the errors returned by the R interpreter along with the output from the print statementsto correct the function. When it is working you can remove the print statements and thefunction should produce the following

> compare.stirling(6)

[1] 0.9221370 0.9595022 0.9727016 0.9794240 0.9834931 0.9862197

34

compare.stirling<-function(top.n) {rat<-rep(0,top.n)

for (i in 1:top.n) {approx<-stirling(i)

tru<-myfac0(i)

rat[i]<-approx/tru

}return(rat)

}

Exercises

1. Write a function called sum.uniforms which takes a single argument n and returns asample of size 1000 from the distribution of the sum of n uniform(0,1) random variables.Try it out via

> us<-sum.uniforms(2)

> hist(us)

Hint: you will need to start the function by setting up a variable usum<-rep(0,1000).There are two ways to write the main body of the function. Most people find itconceptually simpler to use a for loop which iterates 1000 times, and to repeatedlycall runif(n).

sum.uniforms<-function(n) {u.sum<-rep(0,1000)

for (i in 1:1000) {u<-runif(n)

u.sum[i]<-sum(u)

}return(u.sum)

}

2. Modify your function so that it plots a histogram of the sample automatically, anddoesn’t return any value.

3. Try out the function for n=1,2,3 and n=1000. What do you notice? This is the centrallimit theorem at work.

35

14 Functions (3) - decisions

14.1 The if test

R can also make decisions e.g.

temp.interpret<-function(temp) {

print("I will now interpret the temperature:")

if (temp<10) {

print("it is cold")

}

print("Temperature interpreted!")

}

Type the function in to your script window and run it. How does it work? First take a fewminutes to recall Section 11.2.1 on scalar logicals.

When the R interpreter (the part of the computer that is running your program) encountersan if statement it looks in the following brackets ( ) for a logical variable (here temp<10)which is either TRUE or FALSE. If it is TRUE then the interpreter moves on to the R codein the curly brackets, if it is false then the interpreter moves to the code at the end of thecurly brackets.

The program does not tell you when it is warm! Modify your script file so the functiondefinition reads as follows:

temp.interpret<-function(temp) {

print("I will now interpret the temperature:")

if (temp<10) {

print("it is cold")

}

else {

print("it is warm")

}

print("Temperature interpreted!")

}

The else statement can also be thought of as an “otherwise” statement. When the if testis FALSE and there is an else statement then the interpreter runs the code in the curlybrackets following the else. However if the if test is TRUE then the else code is not run.

Return to your factorial function; it is not complete - for three reasons. Firstly the mathe-matical operation of factorial is not defined for negative numbers, secondly 0! is defined as1, and finally non-integer factorials can only be defined via the gamma function (not coveredhere). Yet the function returns

36

> myfac0(-3)

[1] 0

> myfac0(0)

[1] 0

> myfac0(4.3)

[1] 24

Why does it do this? Hint: what is 1:-3 ? What is 1:4.3 ?

A well written function should cope with any argument the (potentially ignorant) user mightthrow at it. The following makes the myfac0() function more robust - but not foolproof!

# Robust (but not perfect) factorial function

myfac1<-function(n) {

if (n<0) {

print("Factorial is not defined for n<0")

product<-NaN

}

else if (n==0) {

product<-1

}

else if (n==trunc(n)) {

product<-1

for (i in 1:n) {

product<-product*i

}

}

else {

print("Cannot cope with non-integer n!")

product<-NaN

}

return(product)

}

Several new concepts have been introduced here, the most important of which is the if ...else if ... else ... structure. Type in the code and run the program. Make sure youunderstand why it does what it does.

The symbol NaN stands for Not a Number. R returns this when an operation is undefinede.g. 0/0 or (−1)1/2.

The logical n==trunc(n) is TRUE if n is an integer. Why? Look up the help page for thefunction trunc or simply try it out.

37

Directly relevant to the project The if test can also be used to keep a running max-imum (or minimum). In section 13 exercise 2 you visually approximated the maximum ofx5e−0.5∗x2 ; the following function (which is in the examples file) numerically approximatesthis maximum.

interval.max<-function(hi,lo) {

f.max<--Inf # minus infinity - bound to be exceeded by the function!

x.seq<-seq(lo,hi,len=100)

for (x in x.seq) {

f<-x^5*exp(-0.5*x*x)

if (f>f.max) { # new value exceeds the current max

f.max<-f

x.max<-x

}

}

return(list(x.max=x.max,f.max=f.max))

}

Check that you understand how the function works and then use it to gradually zoom inon the maximum. NB There are much more efficient ways to find the local maximum, andseveral of these are options in the R function optim(); one might also create a vector of fvalues and use the R function max(). The point of the exercise above is to aid your learningof the R programming language and to prepare you for the project!

Exercises

1. Modify your function poisson.log.like() from Section 13 so that if lambda is 0 orless it prints an error message and returns without trying to calculate the log-likelihood(and in particular, log λ).

2. Directly relevant to the project! Create a function interval.max.like(), basedon the example function interval.max(). It should take three arguments hi,lo,y. Itshould create a vector, lambda.seq, of length 100 between lo and hi. It should thenloop through this sequence and for each value it should calculate the log.likelihood forthe Poisson data y using the function that you created in Section 13. It should returnthe maximum likelihood estimate of lambda and the log-likelihood at this value.

Use the function to zoom in on the MLE for λ.

interval.max.like<-function(lo,hi,y) {l.max<--Inf

lam.seq<-seq(lo,hi,len=100)

for (lam in lam.seq) {l<-poisson.log.like(lam,y)

if (l>l.max) {l.max<-l

38

lam.max<-lam

}}return(list(lam.max=lam.max,l.max=l.max))

}

3. Modify your function sum.uniforms() so that it returns an error message if the userinputs n < 1. In such cases it should return NaN or NULL (R’s equivalent of a “blank”)and should not draw a histogram. Check that it works!

Note: Always build functions up in stages, as in Qn2. Write a small function to do part ofwhat you want to do, get this working, and then add to it.

Debugging (5): browser()

For your R assignment, for projects in other modules, and perhaps in your dissertation youwill be writing functions which call other functions and which also have for and while loopsas well as if tests. There are many ways in which such programs can appear to run correctlyyet produce the wrong output. R provides an extremely useful tool for such circumstances:browser().

Place a call to browser() in your code at a point where you would like to know “what isgoing on”, then run your program as normal. The program will pause at the line where youhave called browser. You may now use standard R commands to ascertain whether yourcode is acting sensibly; for example if you type in a variable’s name then R will tell the valueof that variable. Once you are satisfied that you understand the state of your code thensimply press ENTER and the R code will resume running.

Warning if your call to browser() is within a loop then the R interpreter will pause inits running of your code at every iteration of the loop; this may try your patience! To exitcompletely from all the running code, type Q .

Debugging Exercise 5 *

One very simple way to check whether or not a particular number, k, is prime is to loopthrough all the integers that are less than or equal to

√k. If any of them (except 1!) divide

exactly into k then k is not prime. [Why do we only need to check up to√k?].

The following function accepts a single argument, k, and should return TRUE if k is primeand FALSE otherwise. It does not! There are, in fact, several problems with the code. Usebrowser within the for loop to discern the main problem, then get the function working!

is.prime<-function(k) {

highest<-floor(sqrt(k)) # only check up to [k^(1/2)]

39

for (i in 1:highest) {

factor<-(k/i==floor(k/i)) # divides exactly?

}

return(factor)

}

Solution:is.prime<-function(k) {highest<-floor(sqrt(k))

factor.found=FALSE

for (i in 2:highest) {factor<-(k/i==floor(k/i))

if (factor) {factor.found=TRUE

}}return(!factor.found)

}

14.2 The while loop

Incorporate if tests into loops through the while statement. What does the following do?See if you can work it out before you run it, then try out the function to see.NB1: Inf is R’s symbol for infinity.NB2: so that the function runs quickly, choose z < 2.

# A comment should go here explaining what the function does!

exceed<-function(z) {

n.goes<-0

z.sim<- -Inf

while (z.sim <= z) {

n.goes<-n.goes+1

z.sim<-rnorm(1)

print(c("This z: ",z.sim))

}

return(n.goes)

}

The function counts the number of standard normals it needs to simulate before one of themexceeds the number supplied by the user.

40

A Scripts and packages

A.1 Scripts

Script files are useful things that allow us to build libraries of code. For example, supposeyou wanted to use some of the functions you have written above in many different projects?There is an easy way to do this:

1. Create a new R script file by going to File -> New script.

2. Copy your functions into this file using cut and paste

3. Save this file with a .R file extension

Now all you need to do to read this function back into R is:

> source("myFile.R")

remembering to replace myFile.R with the filename you gave your script file.

Note that it is not only functions that can be read in, but raw R commands. What happensif you use source() with the file you have been saving your work in (you have, haven’tyou?).

Note: if you get errors telling you that R cannot find a function, it is likely that you haveforgotten to comment out any explanatory notes you have included in a script file. Recallthat comments should be preceeded with a #, like so:

# This is a comment

It is good practice to make liberal use of comments so that when you come to re-read youcode, you will know what is going on.

A.2 Packages *

There are many functions available to you as soon as you start R (e.g. sin() and rnorm()).You have now learned how to write your own, using scripts, and how to include functionswhich you may have written some time ago, using source().

Other people have written many useful functions in R and these are grouped into packages.For example there is a package, mvtnorm, which contains functions to calculate the density,quantiles, and simulate from the multivariate normal and multivariate t distributions.

41

First type> ?rmvnorm

You should find that R is currently unaware of this function. Now type> library(mvtnorm)

This “loads the library” into R - all the functions in mvtnorm should now be available toyou. Again type> ?rmvnorm

You should now find that the help page for this (and dmvnorm) appears. Let us end by usingthe function:> mu<-matrix(data=c(1,10),ncol=1)

> sigma<-matrix(data=c(3,2,2,2),ncol=2)

> rmvnorm(5,mu,sigma)

All R functions are part of a package. The “standard” functions such as sin() and rnorm()

are part of the base package, which is automatically loaded by R on start-up. In many ofthe specialist modules on this course you will encounter one or more specialist packages e.g.for survival analysis, geostatistcs, and environmental epidemiology.

The library() command will only work if the package has been installed on your computervia the install.packages() command. This command downloads the required packagefrom one of the official R server sites (or “mirrors”) in the world. All the packages you arelikely to require should have been installed on the lab servers.

42

B R Objects (2) - characters, factors, and dataframes

B.1 Characters

R can store character data in the same way as we have been storing numeric data. Characterdata is enclosed in either single quote marks ’ or double quote marks ”. Use either, as longas they match! Examples:

> c1 <- "Hello" A scalar character string> c1

[1] "Hello"

> c2 <- c("Yes",’Maybe’,"No") A vector of characters> c2

[1] "Yes" "Maybe" "No"

> c3 <- c("Is","Could Be","Isn’t") Note single quote inside doubles

You can also use some of the functions we used to construct numeric vectors with characterdata:

> c3 <- rep("Monkey",4) Using rep()> c3

[1] "Monkey" "Monkey" "Monkey" "Monkey"

Of course, don’t expect to be able to sum() characters!

B.2 Factors

A factor is used for storing categorical data. Suppose you have data on people and you wantto store their sex. You could use a numeric code - 0 for male, 1 for female, or you could usea character code - ‘M’ for male and ‘F’ for female. But you should really use a factor.

Factors are easy to construct from character or numeric vectors. Use the function as.factor():

43

> c5 <- c(’M’,’F’,’F’,’F’,’M’,’M’)

> n5 <- c(0,1,1,1,0,0)

> f5.c <- as.factor(c5)

> f5.n <- as.factor(n5)

> f5.c

[1] M F F F M M

Levels: F M

> f5.n

[1] 0 1 1 1 0 0

Levels: 0 1

Notice that f5.c is printed out like a character vector, but without the quote marks. Thelevels are also printed out.

The categories in a factor can also be seen, and (if desired) changed, by using the levels()

function:

> levels(f5.c)

[1] "F" "M" The result is a character vector> levels(f5.n)

[1] "0" "1" The result is still a character vector> levels(f5.c) <- c("Female","Male")

> f5.c

[1] Male Female Female Female Male Male

Levels: Female Male

With a factor you can tabulate the counts of each category with the table() function:

> table(f5.c)

Female Male

3 3 3 of each!

B.3 Data frames

A typical line (”record”) of data might for example correspond to one subject (or locationor event...) and record several attributes using several different data types. For example

locn temp hum status

1 Lancaster 13.4 68 urban

2 Bentham 15.8 35 rural

3 Manchester 14.0 50 urban

4 Giggleswick 12.5 87 rural

44

Here locn is a character variable, temp and hum are numeric and status is a factor. Thewhole collection of information is called a data frame.

The simplest way to create such a dataset from scratch is by joining together vectors for eachcolumn using the data.frame() command. Run the following code (it is in examples.r ).

> location<-c("Lancaster","Bentham","Manchester","Giggleswick")

> temperature<-c(13.4,15.8,14.0,12.5)

> humidity<-c(68,35,50,87)

> type<-c(1,0,1,0)

> type.f<-as.factor(type)

> levels(type.f)<-c("rural","urban")

> small.data<-data.frame(locn=location,temp=temperature,hum=humidity,status=type.f)

> small.data

Make sure you understand why the column names are as they appear. If you are given adata frame and you do not know the column names then you can find out (without printingout the entire data set!) by using the names() command:

> names(small.data)

[1] "locn" "temp" "hum" "status"

Access an individual column using the dollar ($) symbol:

> small.data$temp

[1] 13.4 15.8 14.0 12.5

> plot(small.data$temp,small.data$hum)

In practice data frames are normally created in R as data is imported from the text formattedoutput of a spreadsheet or database (see Section 7).

45

Exercises

1. Create a dataframe called animals with columns names and values as follows; makesure medium is a factor. We will be using this data frame again later. There is somecode in examples.r to start you off.

species speed medium weight

1 swift 200 air 0.02

2 falcon 70 air 0.70

3 goose 70 air 2.20

4 starling 50 air 0.05

5 cheetah 70 land 50.00

6 horse 50 land 450.00

7 deer 40 land 50.00

8 man 25 land 80.00

9 squirrel 12 land 0.60

10 bear 35 land 150.00

There are many ways of doing this, e.g.> species<-c("swift","falcon","goose","starling","cheetah","horse",

"deer","man","squirrel","bear")

> top.speed<-c(200,70,70,50,70,50,40,25,12,35)

> weight<-c(0.02,0.7,2.2,0.05,50,450,50,80,0.6,150)

> medium<-c(0,0,0,0,1,1,1,1,1,1)

> medium<-as.factor(medium)

> levels(medium)<-c("air","land")

> animals<-data.frame(species=species,speed=top.speed,medium=medium,

weight=weight)

2. Try out the following two commands. What was the result of the second command?This works because medium is a factor.

boxplot(animals$speed)

boxplot(animals$speed∼ animals$medium)

46

C Graphics - Legends, outputting graphs

C.1 Legends

Now we want to add a legend (i.e. a key) to our plot. The legend() function can take ahuge number of arguments (see the online help if you are interested). The following 6 areprobably the most useful:

• x and y - the coordinates on the plot of the top left of the box

• legend - a vector of names of your data series and/or lines on the plot

• pch - (if needed) a vector of point characters you used to plot your data points

• lty - (if needed) a vector of line types which you used to plot your line graphs

• col - (if needed) a vector of colours that correspond to the data series, in the sameorder as you supplied for legend

Exercise

Produce a graph that looks exactly like this:

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

0.0 0.5 1.0 1.5 2.0

01

23

45

Functions of x

x

y

****************************************************************************************************

+

*−

x^2x1/x

>x=seq(0,2,length=100)

>y1=x2̂

47

>y2=x

>y3=1/x

> plot(x,y1,type="b",xlab="x",ylab="y",main="Functions of x",ylim=c(0,5),

pch="+")

> points(x,y2,pch="*")

> lines(x,y3,lty=2)

> legend(.75,4,c("x2̂","x","1/x"),pch=c("+","*","-"))

C.2 Outputting graphs

You’re probably wondering how to get your newly plotted graph into some kind of document,right? There are two methods to do this: a) ’print’ an existing graph to a file, or b) use yourplotting functions to write directly to a file.

a. Print an existing graph to a file: First note the text in the title bar of yourplot. It should say something like “R Graphics: Device x (ACTIVE)” where “x”is a number. If the window is marked as “(INACTIVE)”, you will need to use thecommand:

> dev.set(x)

where x is the number you see in the title bar. This command allows you to tell R thatyou want the plot active. Now we are ready to output to a file. To output a graph inPDF format to the file myFile.pdf, I would type:

> dev.print(device=pdf, file="myFile.pdf")

In this case, myFile.pdf will be written to R’s working directory (see Section 2). Tocreate a PostScript graphics file type

> dev.print(file="myFile.ps",horizontal=FALSE)

There are other options that you can pass to dev.print() documented in the help,as well as a whole raft of other devices you can use to output your graph in otherformats (eg png and jpeg - see the online doc for device). Note that you can use afull filesystem path to specify your output file, so that you don’t necessarily have towrite to R’s working directory.

Exercise

Save your plot of 6 histograms as a pdf file and access that file outside of R. Alsosave it as a ps file. You will use both of these in your LaTeX course. NB Windowsstubbornly refuses to acknowledge .ps files, despite their usefulness, and often simplyidentifies them as a ‘bitmap’ file.

48

b. Write directly to a file: Here, we sandwich our graphics commands between twographics functions. The first opens the graphics device, the second closes it. Here isan example of how to produce a PDF of a graph:

> pdf("myGraphDirect.pdf") Open the pdf device to write to myGraphDirect.pdf

> ...graphics commands go here...> dev.off() Turn the graphics device off again

Again, we can use a variety of different graphics devices found in the online documen-tation for device.

C.3 More lines

The function abline() allows you to add lines to an existing plot.

Exercises

1. Use the help page for abline to produce a graphic exactly like that below. Hint: startwith > plot(c(1,5),c(1,5)).

●

●

1 2 3 4 5

12

34

5

c(1, 5)

c(1,

5)

> abline(h=3)

> abline(v=3)

> abline(a=0,b=1)

> abline(a=6,b=-1)

2. Simulate a sample of size 100 from a Gamma distribution with shape 3 and rate 1.Use the hist() function to plot a histogram of the variable, adding a suitable title andaxis labels in the same way as you did for plot().

> v<-rgamma(100,3,1)

> hist(v,main="Gamma(3,1)",xlab="value",ylab="Freq")

49

3. Using the qgamma(), find the true upper and lower quartiles. Use the abline() func-tion to draw vertical lines on the histogram representing the quartiles. Perhaps youshould also make the lines a different style so that they stand out?

> lower <- qgamma(0.25,3,1)

> upper <- qgamma(0.75,3,1)

> abline(v=lower,lty=2)

> abline(v=upper,lty=2)

50

D R Objects (4) - lists

A list is the most general object available in R. Just like a data frame, a list can containcharacter and numeric data. But list elements may also be vectors, matrices, or even otherlists! A list is created from existing objects as follows:

> a<-5

> b<-"hello"

> c<-1:4

> d<-matrix(1:6,ncol=3)

> l<-list(sc=a,ch=b,vec=c,mat=d)

> l

$sc

[1] 5

$ch

[1] "hello"

$vec

[1] 1 2 3 4

$mat

[,1] [,2] [,3]

[1,] 1 3 5

[2,] 2 4 6

As for data frames, each element can be accessed (and altered) via the $ sign:

> l$vec

[1] 1 2 3 4

Unlike data frames, however, lists cannot be accessed by matrix like subscripts. Insteadindividual elements may be specied numerically using the [[ ]] symbol:

> l[[3]]

[1] 1 2 3 4

Finally, blank lists can be created and filled in later.

> l2<-vector("list",3)

> l2

51

[[1]]

NULL

[[2]]

NULL

[[3]]

NULL

> l2[[2]]<-"MSc"

> names(l2)

NULL

> names(l2)<-c("year","course","name")

> l2

$year

NULL

$course

[1] "MSc"

$name

NULL

There are no specific exercises here for lists - you will be using them again and again whenyou come to write your own functions!

52

E Functions - fast looping, local variables, vector de-

cisions

E.1 Looping versus vectors *

Sometimes it is the case that an algorithm could be carried out using either a for loop orvector arithmetic. For example the factorial function could also have been written using thevector function prod() as

myfac.v<-function(n) {

vec<-1:n

product<-prod(vec)

return(product)

}

The function prod() simply executes a for loop over its vector argument. But it does thisin the C programming language which is much much faster than R.

Exercise

Write a function to add the cubed roots of the numbers from 1 to ten million using a forloop, and write a second function which adds these numbers using a vector and the functionsum(). Compare the speeds.

E.2 Local variables *

Type and load the following

silly<-function() {

zoopzoop<-10

print(zoopzoop)

}

Now try> zoopzoop

> silly()

> zoopzoop

You should find that both of the lines > zoopzoop produced errors. The variable zoopzoopis defined within the function silly(). Variables defined within a function can generallyonly be used within that function; they are therefore local to that function and are calledlocal variables. If you wish to use the value of zoopzoop within your main R workspaceyou should change the function definition to

53

silly<-function() {

zoopzoop<-10

print(zoopzoop)

return(zoopzoop)

}

You can assign any variable name you like to the output of silly() e.g.> myzoopzoop<-silly()

> myzoopzoop

E.3 More complex if tests

You can use the logical AND (&&), OR (||) and NOT (!) operators in if tests. For example:

scary.date<-function(day,date) {

if ((day=="Friday") && (date==13)) {

print("It’s a scary day!")

}

else if ((day=="Friday") || (date==13)){

print("Phew, that was close!")

}

else {

print("All clear!")

}

}

The above silly program contains a slight subtlety. At a console window type

> ("Friday"=="Friday") || (13==13)

The result is TRUE - so why doesn’t scary.date("Friday",13) print out both of the firsttwo messages?

R also allows nested if tests. For example scary.date() could also have been written as

scary.date<-function(day,date) {

if (day=="Friday") {

if (date==13) {

print("It’s a scary day!")

}

else {


54

}

}

else {

if (date==13) {


}

else {

print("All clear!")

}

}

}

Make sure you understand why this produces exactly the same functionality.

Sometimes it is simpler to use nested tests and sometimes it is simpler to use else if , &&,or ||.

Exercise* Your recently debugged is.prime() function currently checks through all num-bers between 2 and

√k; if any divide exactly into k then it knows that the number is not

prime. Clearly we actually know that k is not prime as soon as it finds the first number thatdivides exactly into it - the loop does not need to continue. When checking many numbers kto decide whether or not they are prime, making this change will lead to a massive efficiencysaving. Alter your function so that it uses a while loop instead of a for loop. Hint : thewhile loop will need to check that the counter, i, has not exceeded

√k and that no divisor

has yet been found.

Solution: is.prime<-function(k) {highest<-floor(sqrt(k)) # only check up to [k^(1/2)]

factor=FALSE

i=2

while ((!factor) && (i<=highest)) {factor<-(k/i==floor(k/i))

i<-i+1

}return(!factor)

}

E.4 Vector logical tests *

The above if and while tests only act on scalar logicals. R’s treatment of vector logicals isvery powerful, but care must be taken!

55

Imagine we wish to simulate 1000 000 instances from a mixture distribtion

W =

{W1 with probability pW2 with probability 1− p

where W1 ∼ N(105, 72) and W2 ∼ N(130, 202). This corresponds to weights (in grams) ofadult Peregrine Falcons, males (W1) being lighter on average than females (W2). If 30% ofthe faclons are male then p = 0.3.

We could simply write the following function

# Simulate n times from a mixture of two normal distributions

mixture1<-function(p,n,mu1,sigma1,mu2,sigma2) {

z<-rep(0,n)

for (i in 1:n) {

u<-runif(1)

if (u <= p) {

z[i]<-rnorm(1,mean=mu1,sd=sigma1)

}

else {

z[i]<-rnorm(1,mean=mu2,sd=sigma2)

}

}

return(z)

}

Try this out and note how long the code takes to run!

> a<-mixture1(.3,1000000,105,7,130,20)

> hist(a)

Now take a few minutes to recall Section 11.2.4 on the use of logicals as numbers. Thefollowing function also returns a sample from the required mixture distribution but does itmuch much quicker as all the loops are internal and are therefore in C.

# Simulate n times from a mixture of two normal distributions

mixture2<-function(p,n,mu1,sigma1,mu2,sigma2) {

u<-runif(n)

z.male<-rnorm(n,mean=mu1,sd=sigma1)

z.female<-rnorm(n,mean=mu2,sd=sigma2)

z<- (u <= p) * z.male + (u > p) * z.female

return(z)

}

56

math580: introduction to r - lancaster universityfearnhea/rintro/r.course.pdf · 2013. 9. 23. ·...

Documents