a brief guide to r for beginners in econometrics

33
A Brief Guide to R for Beginners in Econometrics Mahmood Arai Department of Economics, Stockholm University First Version: 2002-11-05, This Version: 2009-09-02 1 Introduction 1.1 About R R is published under the GPL (GNU Public License) and exists for all major platforms. R is described on the R Homepage as follows: ”R is a free software environment for statistical computing and graphics. It compiles and runs on a wide variety of UNIX platforms, Windows and MacOS. To download R, please choose your preferred CRAN-mirror.” See R Homepage for manuals and documentations. There are a number of books on R. See http://www.R-project.org/doc/bib/R.bib for a bibliography of the R-related publications. Dalgaard [2008] and Fox [2002] are nice introductory books. For an advanced book see Venables and Ripley [2002] which is a classic reference. Kleiber and Zeileis [2008] covers econometrics. See also CRAN Task View: Computational Econometrics. For weaving R and L A T E Xsee Sweave http: //www.stat.uni-muenchen.de/~leisch/Sweave/. For reproducible research using R see Koenker and Zeileis [2007]. To cite R in publications you can refer to: R Development Core Team [2008] 1.2 About these pages This is a brief manual for beginners in Econometrics. For latest version see http: //people.su.se/~ma/R_intro/R_intro.pdf. For the Sweave file producing these pages see http://people.su.se/~ma/R_intro/R_intro.Rnw. The sym- bol # is used for comments. Thus all text after # in a line is a comment. Lines following > are R-commands executed at the R prompt which as standard looks like >. This is an example: > myexample <- "example" > myexample [1] "example" 1

Upload: others

Post on 12-Sep-2021

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A Brief Guide to R for Beginners in Econometrics

A Brief Guide to R for Beginners in Econometrics

Mahmood AraiDepartment of Economics Stockholm University

First Version 2002-11-05 This Version 2009-09-02

1 Introduction

11 About R

R is published under the GPL (GNU Public License) and exists for all majorplatforms R is described on the R Homepage as follows

rdquoR is a free software environment for statistical computing andgraphics It compiles and runs on a wide variety of UNIX platformsWindows and MacOS To download R please choose your preferredCRAN-mirrorrdquo

See R Homepage for manuals and documentations There are a number of bookson R See httpwwwR-projectorgdocbibRbib for a bibliography ofthe R-related publications Dalgaard [2008] and Fox [2002] are nice introductorybooks For an advanced book see Venables and Ripley [2002] which is a classicreference Kleiber and Zeileis [2008] covers econometrics See also CRAN TaskView Computational Econometrics For weaving R and LATEXsee Sweave httpwwwstatuni-muenchende~leischSweave For reproducible researchusing R see Koenker and Zeileis [2007] To cite R in publications you can referto R Development Core Team [2008]

12 About these pages

This is a brief manual for beginners in Econometrics For latest version see httppeoplesuse~maR_introR_intropdf For the Sweave file producingthese pages see httppeoplesuse~maR_introR_introRnw The sym-bol is used for comments Thus all text after in a line is a comment Linesfollowing gt are R-commands executed at the R prompt which as standardlooks like gt This is an example

gt myexample lt- example

gt myexample

[1] example

1

R-codes including comments of codes that are not executed are indented asfollows

myexample lt- example creates an object named ltmyexamplegt

myexample

The characters within lt gt refer to verbatim names of files functions etc when itis necessary for clarity The names ltmysomethinggt such as ltmydatagtltmyobjectgtare used to refer to a general dataframe object etc

13 Objects and files

R regards things as objects A dataset vector matrix results of a regressiona plot etc are all objects One or several objects can be saved in a file A filecontaining R-data is not an object but a set of objects

Basically all commands you use are functions A command something(object)does something on an object This means that you are going to write lots ofparentheses Check that they are there and check that they are in the rightplace

2 First things

21 Installation

R exists for several platforms and can be downloaded from [CRAN-mirror]

22 Working with R

It is a good idea to create a directory for a project and start R from there Thismakes it easy to save your work and find it in later sessions

If you want R to start in a certain directory in MS-Windows you have tospecify the ltstart in directorygt to be your working directory This is doneby changing the ltpropertiesgt by clicking on the right button of the mousewhile pointing at your R-icon and then going to ltpropertiesgt

Displaying the working directory within R

gt getwd()

[1] homema1RR_beginR_Brief_Guide

Changing the working directory to an existing directory lthomemaproject1gt

setwd(homemaproject1)

2

23 Naming in R

Do not name an object as ltmy_objectgt or ltmy-objectgt use instead ltmyobjectgtNotice that in R ltmyobjectgt and ltMyobjectgt are two different names Namesstarting with a digit (lt1agt) is not accepted You can instead use lta1gt)

You should not use names of variables in a data-frame as names of objects Ifyou do so the object will shadow the variable with the same name in anotherobject The problem is then that when you call this variable you will get theobject ndash the object shadows the variable the variable will be masked by theobject with the same name

To avoid this problem

1- Do not give a name to an object that is identical to the name of a variablein your data frames

2- If you are not able to follow this rule refer to variables by referring to thevariable and the dataset that includes the variable For example the variableltwagegt in the data frame ltdf1gt is called by

df1$wage

The problem of rdquoshadowingrdquo concerns R functions as well Do not use objectnames that are the same as R functions ltconflicts(detail=TRUE)gt checkswhether an object you have created conflicts with another object in the Rpackages and lists them You should only care about those that are listedunder ltGlobalEnvgt ndash objects in your workspace All objects listed underltGlobalEnvgt shadows objects in R packages and should be removed in orderto be able to use the objects in the R packages

The following example creates ltTgt that should be avoided (since ltTgt stands forltTRUEgt) checks conflicts and resolves the conflict by removing ltTgt

T lt- time

conflicts(detail=TRUE)

rm(T)

conflicts(detail=TRUE)

You should avoid using the following one-letter words ltcCDFIqtTgt asnames They have special meanings in R

Extensions for files

It is a good practice to use the extension ltRgt for your files including R-codesA file ltlab1Rgt is then a text-file including R-codes

The extension ltrdagt is appropriate for work images (ie files created by ltsave()gt)The file ltlab1rdagt is then a file including R-objects

The default name for the saved work image is ltRDatagt Be careful not toname a file as ltRDatagt when you use ltRDatagt as extension since you will thenoverwrite the ltRdatagt file

3

24 Saving and loading objects and images of workingspaces

Download the file ltDataWageMacrordagt httppeoplesuse~maR_introdata

You can read the file ltDataWageMacrordagt containing the data frames ltlnugtand ltmacrogt as follows

load(DataWageMacrorda)

ls() lists the objects

The following command saves the object ltlnugt in a file ltmydatardagt

save(lnu file=mydatarda)

To save an image of the your workspace that will be automatically loaded whenyou next time start R in the same directory

saveimage()

You can also save your working image by answering ltyesgt when you quit andare askedltSave workspace image [ync]gt

In this way the image of your workspace is saved in the hidden file ltRDatagt

You can save an image of the current workspace and give it a name ltmyimagerdagt

saveimage(myimagerda)

25 Overall options

ltoptions()gt can be used to set a number of options that governs various aspectsof computations and displaying results

Here are some useful options We start by setting the line with to 60 characters

gt options(width = 60)

options(prompt= Rgt ) changes the prompt to lt Rgt gt

options(scipen=3) From R version 18 This option

tells R to display numbers in fixed format instead of

in exponential form for example lt1446257064291gt instead of

lt1446257e+12gt as the result of ltexp(28)gt

options() displays the options

4

26 Getting Help

helpstart() invokes the help pages

help(lm) help on ltlmgt linear model

lm same as above

3 Elementary commands

ls() Lists all objects

lsstr() Lists details of all objects

str(myobject) Lists details of ltmyobjectgt

listfiles() Lists all files in the current directory

dir() Lists all files in the current directory

myobject Prints simply the object

rm(myobject) removes the object ltmyobjectgt

rm(list=ls()) removes all the objects in the working space

save(myobject file=myobjectrda)

saves the object ltmyobjectgt in a file ltmyobjectrdagt

load(myworkrda) loads myworkrda into memory

summary(mydata) Prints the simple statistics for ltmydatagt

hist(xfreq=TRUE) Prints a histogram of the object ltxgt

ltfreq=TRUEgt yields frequency and

ltfreq=FALSEgt yields probabilities

q() Quits R

The output of a command can be directed in an object by using lt lt- gt anobject is then assigned a value The first line in the following code chunk createsvector named ltVVgt with a values 12 and 3 The second line creates an objectnamed ltVVgt and prints the contents of the object ltVVgt

gt VV lt- c(1 2 3)

gt (VV lt- 12)

[1] 1 2

4 Data management

41 Reading data in plain text format

Data in columns

The data in this example are from a text file lttmptxtgt containing the variablenames in the first line (separated with a space) and the values of these variables(separated with a space) in the following lines

5

The following reads the contents of the file lttmptxtgt and assigns it to an objectnamed ltdatgt

gt FILE lt- httppeoplesuse~maR_introdatatmptxt

gt dat lt- readtable(file = FILE header = TRUE)

gt dat

wage school public female1 94 8 1 02 75 7 0 03 80 11 1 04 70 16 0 05 75 8 1 06 78 11 1 07 103 11 0 08 53 8 0 09 99 8 1 0

The argument ltheader = TRUEgt indicates that the first line includes the namesof the variables The object ltdatgt is a data-frame as it is called in R

If the columns of the data in the file lttmptxtgt were separated by ltgt thesyntax would be

readtable(tmptxt header = TRUE sep=)

Note that if your decimal character is not ltgt you should specify it If thedecimal character is ltgt you can use ltreadcsvgt and specify the following ar-gument in the function ltdec=gt

42 Non-available and delimiters in tabular data

We have a file ltdata1txtgt with the following contents

1 9

6 3 2

where the first observation on the second column (variable) is a missing valuecoded as ltgt To tell R that ltgt is a missing value you use the argumentltnastrings=gt

gt FILE lt- httppeoplesuse~maR_introdatadata1txt

gt readtable(file = FILE nastrings = )

V1 V2 V31 1 NA 92 6 3 2

6

Sometimes columns are separated by other separators than spaces The sepa-rator might for example be ltgt in which case we have to use the argumentltsep=gt

Be aware that if the columns are separated by ltgt and there are spaces in somecolumns like the case below the ltnastrings=gt does not work The NA isactually coded as two spaces a point and two spaces and should be indicatedas ltnastrings= gt

1 9

6 3 2

Sometimes missing value is simply ltblankgt as follows

1 9

6 3 2

Notice that there are two spaces between 1 and 9 in the first line implying thatthe value in the second column is blank This is a missing value Here it isimportant to specify ltsep= gt along with ltnastrings=gt

43 Reading and writing data in other formats

Attach the library ltforeigngt in order to read data in various standard packagesdata formats Examples are SAS SPSS STATA etc

library(foreign)

reads the data ltwagedtagt and put it in the object ltlnugt

lnu lt- readdta(file=wagedta)

ltreadssd()gt ltreadspss()gt etc are other commands in the foreign packagefor reading data in SAS and SPSS format

It is also easy to write data in a foreign format The following codes writes theobject ltlnugt to stata-file ltlnunewdtagt

library(foreign)

writedta(lnulnunewdta)

44 Examining the contents of a data-frame object

Here we use data from Swedish Level of Living Surveys LNU 1991

gt FILE lt- httppeoplesuse~maR_introdatalnu91txt

gt lnu lt- readtable(file = FILE header = TRUE)

7

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 2: A Brief Guide to R for Beginners in Econometrics

R-codes including comments of codes that are not executed are indented asfollows

myexample lt- example creates an object named ltmyexamplegt

myexample

The characters within lt gt refer to verbatim names of files functions etc when itis necessary for clarity The names ltmysomethinggt such as ltmydatagtltmyobjectgtare used to refer to a general dataframe object etc

13 Objects and files

R regards things as objects A dataset vector matrix results of a regressiona plot etc are all objects One or several objects can be saved in a file A filecontaining R-data is not an object but a set of objects

Basically all commands you use are functions A command something(object)does something on an object This means that you are going to write lots ofparentheses Check that they are there and check that they are in the rightplace

2 First things

21 Installation

R exists for several platforms and can be downloaded from [CRAN-mirror]

22 Working with R

It is a good idea to create a directory for a project and start R from there Thismakes it easy to save your work and find it in later sessions

If you want R to start in a certain directory in MS-Windows you have tospecify the ltstart in directorygt to be your working directory This is doneby changing the ltpropertiesgt by clicking on the right button of the mousewhile pointing at your R-icon and then going to ltpropertiesgt

Displaying the working directory within R

gt getwd()

[1] homema1RR_beginR_Brief_Guide

Changing the working directory to an existing directory lthomemaproject1gt

setwd(homemaproject1)

2

23 Naming in R

Do not name an object as ltmy_objectgt or ltmy-objectgt use instead ltmyobjectgtNotice that in R ltmyobjectgt and ltMyobjectgt are two different names Namesstarting with a digit (lt1agt) is not accepted You can instead use lta1gt)

You should not use names of variables in a data-frame as names of objects Ifyou do so the object will shadow the variable with the same name in anotherobject The problem is then that when you call this variable you will get theobject ndash the object shadows the variable the variable will be masked by theobject with the same name

To avoid this problem

1- Do not give a name to an object that is identical to the name of a variablein your data frames

2- If you are not able to follow this rule refer to variables by referring to thevariable and the dataset that includes the variable For example the variableltwagegt in the data frame ltdf1gt is called by

df1$wage

The problem of rdquoshadowingrdquo concerns R functions as well Do not use objectnames that are the same as R functions ltconflicts(detail=TRUE)gt checkswhether an object you have created conflicts with another object in the Rpackages and lists them You should only care about those that are listedunder ltGlobalEnvgt ndash objects in your workspace All objects listed underltGlobalEnvgt shadows objects in R packages and should be removed in orderto be able to use the objects in the R packages

The following example creates ltTgt that should be avoided (since ltTgt stands forltTRUEgt) checks conflicts and resolves the conflict by removing ltTgt

T lt- time

conflicts(detail=TRUE)

rm(T)

conflicts(detail=TRUE)

You should avoid using the following one-letter words ltcCDFIqtTgt asnames They have special meanings in R

Extensions for files

It is a good practice to use the extension ltRgt for your files including R-codesA file ltlab1Rgt is then a text-file including R-codes

The extension ltrdagt is appropriate for work images (ie files created by ltsave()gt)The file ltlab1rdagt is then a file including R-objects

The default name for the saved work image is ltRDatagt Be careful not toname a file as ltRDatagt when you use ltRDatagt as extension since you will thenoverwrite the ltRdatagt file

3

24 Saving and loading objects and images of workingspaces

Download the file ltDataWageMacrordagt httppeoplesuse~maR_introdata

You can read the file ltDataWageMacrordagt containing the data frames ltlnugtand ltmacrogt as follows

load(DataWageMacrorda)

ls() lists the objects

The following command saves the object ltlnugt in a file ltmydatardagt

save(lnu file=mydatarda)

To save an image of the your workspace that will be automatically loaded whenyou next time start R in the same directory

saveimage()

You can also save your working image by answering ltyesgt when you quit andare askedltSave workspace image [ync]gt

In this way the image of your workspace is saved in the hidden file ltRDatagt

You can save an image of the current workspace and give it a name ltmyimagerdagt

saveimage(myimagerda)

25 Overall options

ltoptions()gt can be used to set a number of options that governs various aspectsof computations and displaying results

Here are some useful options We start by setting the line with to 60 characters

gt options(width = 60)

options(prompt= Rgt ) changes the prompt to lt Rgt gt

options(scipen=3) From R version 18 This option

tells R to display numbers in fixed format instead of

in exponential form for example lt1446257064291gt instead of

lt1446257e+12gt as the result of ltexp(28)gt

options() displays the options

4

26 Getting Help

helpstart() invokes the help pages

help(lm) help on ltlmgt linear model

lm same as above

3 Elementary commands

ls() Lists all objects

lsstr() Lists details of all objects

str(myobject) Lists details of ltmyobjectgt

listfiles() Lists all files in the current directory

dir() Lists all files in the current directory

myobject Prints simply the object

rm(myobject) removes the object ltmyobjectgt

rm(list=ls()) removes all the objects in the working space

save(myobject file=myobjectrda)

saves the object ltmyobjectgt in a file ltmyobjectrdagt

load(myworkrda) loads myworkrda into memory

summary(mydata) Prints the simple statistics for ltmydatagt

hist(xfreq=TRUE) Prints a histogram of the object ltxgt

ltfreq=TRUEgt yields frequency and

ltfreq=FALSEgt yields probabilities

q() Quits R

The output of a command can be directed in an object by using lt lt- gt anobject is then assigned a value The first line in the following code chunk createsvector named ltVVgt with a values 12 and 3 The second line creates an objectnamed ltVVgt and prints the contents of the object ltVVgt

gt VV lt- c(1 2 3)

gt (VV lt- 12)

[1] 1 2

4 Data management

41 Reading data in plain text format

Data in columns

The data in this example are from a text file lttmptxtgt containing the variablenames in the first line (separated with a space) and the values of these variables(separated with a space) in the following lines

5

The following reads the contents of the file lttmptxtgt and assigns it to an objectnamed ltdatgt

gt FILE lt- httppeoplesuse~maR_introdatatmptxt

gt dat lt- readtable(file = FILE header = TRUE)

gt dat

wage school public female1 94 8 1 02 75 7 0 03 80 11 1 04 70 16 0 05 75 8 1 06 78 11 1 07 103 11 0 08 53 8 0 09 99 8 1 0

The argument ltheader = TRUEgt indicates that the first line includes the namesof the variables The object ltdatgt is a data-frame as it is called in R

If the columns of the data in the file lttmptxtgt were separated by ltgt thesyntax would be

readtable(tmptxt header = TRUE sep=)

Note that if your decimal character is not ltgt you should specify it If thedecimal character is ltgt you can use ltreadcsvgt and specify the following ar-gument in the function ltdec=gt

42 Non-available and delimiters in tabular data

We have a file ltdata1txtgt with the following contents

1 9

6 3 2

where the first observation on the second column (variable) is a missing valuecoded as ltgt To tell R that ltgt is a missing value you use the argumentltnastrings=gt

gt FILE lt- httppeoplesuse~maR_introdatadata1txt

gt readtable(file = FILE nastrings = )

V1 V2 V31 1 NA 92 6 3 2

6

Sometimes columns are separated by other separators than spaces The sepa-rator might for example be ltgt in which case we have to use the argumentltsep=gt

Be aware that if the columns are separated by ltgt and there are spaces in somecolumns like the case below the ltnastrings=gt does not work The NA isactually coded as two spaces a point and two spaces and should be indicatedas ltnastrings= gt

1 9

6 3 2

Sometimes missing value is simply ltblankgt as follows

1 9

6 3 2

Notice that there are two spaces between 1 and 9 in the first line implying thatthe value in the second column is blank This is a missing value Here it isimportant to specify ltsep= gt along with ltnastrings=gt

43 Reading and writing data in other formats

Attach the library ltforeigngt in order to read data in various standard packagesdata formats Examples are SAS SPSS STATA etc

library(foreign)

reads the data ltwagedtagt and put it in the object ltlnugt

lnu lt- readdta(file=wagedta)

ltreadssd()gt ltreadspss()gt etc are other commands in the foreign packagefor reading data in SAS and SPSS format

It is also easy to write data in a foreign format The following codes writes theobject ltlnugt to stata-file ltlnunewdtagt

library(foreign)

writedta(lnulnunewdta)

44 Examining the contents of a data-frame object

Here we use data from Swedish Level of Living Surveys LNU 1991

gt FILE lt- httppeoplesuse~maR_introdatalnu91txt

gt lnu lt- readtable(file = FILE header = TRUE)

7

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 3: A Brief Guide to R for Beginners in Econometrics

23 Naming in R

Do not name an object as ltmy_objectgt or ltmy-objectgt use instead ltmyobjectgtNotice that in R ltmyobjectgt and ltMyobjectgt are two different names Namesstarting with a digit (lt1agt) is not accepted You can instead use lta1gt)

You should not use names of variables in a data-frame as names of objects Ifyou do so the object will shadow the variable with the same name in anotherobject The problem is then that when you call this variable you will get theobject ndash the object shadows the variable the variable will be masked by theobject with the same name

To avoid this problem

1- Do not give a name to an object that is identical to the name of a variablein your data frames

2- If you are not able to follow this rule refer to variables by referring to thevariable and the dataset that includes the variable For example the variableltwagegt in the data frame ltdf1gt is called by

df1$wage

The problem of rdquoshadowingrdquo concerns R functions as well Do not use objectnames that are the same as R functions ltconflicts(detail=TRUE)gt checkswhether an object you have created conflicts with another object in the Rpackages and lists them You should only care about those that are listedunder ltGlobalEnvgt ndash objects in your workspace All objects listed underltGlobalEnvgt shadows objects in R packages and should be removed in orderto be able to use the objects in the R packages

The following example creates ltTgt that should be avoided (since ltTgt stands forltTRUEgt) checks conflicts and resolves the conflict by removing ltTgt

T lt- time

conflicts(detail=TRUE)

rm(T)

conflicts(detail=TRUE)

You should avoid using the following one-letter words ltcCDFIqtTgt asnames They have special meanings in R

Extensions for files

It is a good practice to use the extension ltRgt for your files including R-codesA file ltlab1Rgt is then a text-file including R-codes

The extension ltrdagt is appropriate for work images (ie files created by ltsave()gt)The file ltlab1rdagt is then a file including R-objects

The default name for the saved work image is ltRDatagt Be careful not toname a file as ltRDatagt when you use ltRDatagt as extension since you will thenoverwrite the ltRdatagt file

3

24 Saving and loading objects and images of workingspaces

Download the file ltDataWageMacrordagt httppeoplesuse~maR_introdata

You can read the file ltDataWageMacrordagt containing the data frames ltlnugtand ltmacrogt as follows

load(DataWageMacrorda)

ls() lists the objects

The following command saves the object ltlnugt in a file ltmydatardagt

save(lnu file=mydatarda)

To save an image of the your workspace that will be automatically loaded whenyou next time start R in the same directory

saveimage()

You can also save your working image by answering ltyesgt when you quit andare askedltSave workspace image [ync]gt

In this way the image of your workspace is saved in the hidden file ltRDatagt

You can save an image of the current workspace and give it a name ltmyimagerdagt

saveimage(myimagerda)

25 Overall options

ltoptions()gt can be used to set a number of options that governs various aspectsof computations and displaying results

Here are some useful options We start by setting the line with to 60 characters

gt options(width = 60)

options(prompt= Rgt ) changes the prompt to lt Rgt gt

options(scipen=3) From R version 18 This option

tells R to display numbers in fixed format instead of

in exponential form for example lt1446257064291gt instead of

lt1446257e+12gt as the result of ltexp(28)gt

options() displays the options

4

26 Getting Help

helpstart() invokes the help pages

help(lm) help on ltlmgt linear model

lm same as above

3 Elementary commands

ls() Lists all objects

lsstr() Lists details of all objects

str(myobject) Lists details of ltmyobjectgt

listfiles() Lists all files in the current directory

dir() Lists all files in the current directory

myobject Prints simply the object

rm(myobject) removes the object ltmyobjectgt

rm(list=ls()) removes all the objects in the working space

save(myobject file=myobjectrda)

saves the object ltmyobjectgt in a file ltmyobjectrdagt

load(myworkrda) loads myworkrda into memory

summary(mydata) Prints the simple statistics for ltmydatagt

hist(xfreq=TRUE) Prints a histogram of the object ltxgt

ltfreq=TRUEgt yields frequency and

ltfreq=FALSEgt yields probabilities

q() Quits R

The output of a command can be directed in an object by using lt lt- gt anobject is then assigned a value The first line in the following code chunk createsvector named ltVVgt with a values 12 and 3 The second line creates an objectnamed ltVVgt and prints the contents of the object ltVVgt

gt VV lt- c(1 2 3)

gt (VV lt- 12)

[1] 1 2

4 Data management

41 Reading data in plain text format

Data in columns

The data in this example are from a text file lttmptxtgt containing the variablenames in the first line (separated with a space) and the values of these variables(separated with a space) in the following lines

5

The following reads the contents of the file lttmptxtgt and assigns it to an objectnamed ltdatgt

gt FILE lt- httppeoplesuse~maR_introdatatmptxt

gt dat lt- readtable(file = FILE header = TRUE)

gt dat

wage school public female1 94 8 1 02 75 7 0 03 80 11 1 04 70 16 0 05 75 8 1 06 78 11 1 07 103 11 0 08 53 8 0 09 99 8 1 0

The argument ltheader = TRUEgt indicates that the first line includes the namesof the variables The object ltdatgt is a data-frame as it is called in R

If the columns of the data in the file lttmptxtgt were separated by ltgt thesyntax would be

readtable(tmptxt header = TRUE sep=)

Note that if your decimal character is not ltgt you should specify it If thedecimal character is ltgt you can use ltreadcsvgt and specify the following ar-gument in the function ltdec=gt

42 Non-available and delimiters in tabular data

We have a file ltdata1txtgt with the following contents

1 9

6 3 2

where the first observation on the second column (variable) is a missing valuecoded as ltgt To tell R that ltgt is a missing value you use the argumentltnastrings=gt

gt FILE lt- httppeoplesuse~maR_introdatadata1txt

gt readtable(file = FILE nastrings = )

V1 V2 V31 1 NA 92 6 3 2

6

Sometimes columns are separated by other separators than spaces The sepa-rator might for example be ltgt in which case we have to use the argumentltsep=gt

Be aware that if the columns are separated by ltgt and there are spaces in somecolumns like the case below the ltnastrings=gt does not work The NA isactually coded as two spaces a point and two spaces and should be indicatedas ltnastrings= gt

1 9

6 3 2

Sometimes missing value is simply ltblankgt as follows

1 9

6 3 2

Notice that there are two spaces between 1 and 9 in the first line implying thatthe value in the second column is blank This is a missing value Here it isimportant to specify ltsep= gt along with ltnastrings=gt

43 Reading and writing data in other formats

Attach the library ltforeigngt in order to read data in various standard packagesdata formats Examples are SAS SPSS STATA etc

library(foreign)

reads the data ltwagedtagt and put it in the object ltlnugt

lnu lt- readdta(file=wagedta)

ltreadssd()gt ltreadspss()gt etc are other commands in the foreign packagefor reading data in SAS and SPSS format

It is also easy to write data in a foreign format The following codes writes theobject ltlnugt to stata-file ltlnunewdtagt

library(foreign)

writedta(lnulnunewdta)

44 Examining the contents of a data-frame object

Here we use data from Swedish Level of Living Surveys LNU 1991

gt FILE lt- httppeoplesuse~maR_introdatalnu91txt

gt lnu lt- readtable(file = FILE header = TRUE)

7

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 4: A Brief Guide to R for Beginners in Econometrics

24 Saving and loading objects and images of workingspaces

Download the file ltDataWageMacrordagt httppeoplesuse~maR_introdata

You can read the file ltDataWageMacrordagt containing the data frames ltlnugtand ltmacrogt as follows

load(DataWageMacrorda)

ls() lists the objects

The following command saves the object ltlnugt in a file ltmydatardagt

save(lnu file=mydatarda)

To save an image of the your workspace that will be automatically loaded whenyou next time start R in the same directory

saveimage()

You can also save your working image by answering ltyesgt when you quit andare askedltSave workspace image [ync]gt

In this way the image of your workspace is saved in the hidden file ltRDatagt

You can save an image of the current workspace and give it a name ltmyimagerdagt

saveimage(myimagerda)

25 Overall options

ltoptions()gt can be used to set a number of options that governs various aspectsof computations and displaying results

Here are some useful options We start by setting the line with to 60 characters

gt options(width = 60)

options(prompt= Rgt ) changes the prompt to lt Rgt gt

options(scipen=3) From R version 18 This option

tells R to display numbers in fixed format instead of

in exponential form for example lt1446257064291gt instead of

lt1446257e+12gt as the result of ltexp(28)gt

options() displays the options

4

26 Getting Help

helpstart() invokes the help pages

help(lm) help on ltlmgt linear model

lm same as above

3 Elementary commands

ls() Lists all objects

lsstr() Lists details of all objects

str(myobject) Lists details of ltmyobjectgt

listfiles() Lists all files in the current directory

dir() Lists all files in the current directory

myobject Prints simply the object

rm(myobject) removes the object ltmyobjectgt

rm(list=ls()) removes all the objects in the working space

save(myobject file=myobjectrda)

saves the object ltmyobjectgt in a file ltmyobjectrdagt

load(myworkrda) loads myworkrda into memory

summary(mydata) Prints the simple statistics for ltmydatagt

hist(xfreq=TRUE) Prints a histogram of the object ltxgt

ltfreq=TRUEgt yields frequency and

ltfreq=FALSEgt yields probabilities

q() Quits R

The output of a command can be directed in an object by using lt lt- gt anobject is then assigned a value The first line in the following code chunk createsvector named ltVVgt with a values 12 and 3 The second line creates an objectnamed ltVVgt and prints the contents of the object ltVVgt

gt VV lt- c(1 2 3)

gt (VV lt- 12)

[1] 1 2

4 Data management

41 Reading data in plain text format

Data in columns

The data in this example are from a text file lttmptxtgt containing the variablenames in the first line (separated with a space) and the values of these variables(separated with a space) in the following lines

5

The following reads the contents of the file lttmptxtgt and assigns it to an objectnamed ltdatgt

gt FILE lt- httppeoplesuse~maR_introdatatmptxt

gt dat lt- readtable(file = FILE header = TRUE)

gt dat

wage school public female1 94 8 1 02 75 7 0 03 80 11 1 04 70 16 0 05 75 8 1 06 78 11 1 07 103 11 0 08 53 8 0 09 99 8 1 0

The argument ltheader = TRUEgt indicates that the first line includes the namesof the variables The object ltdatgt is a data-frame as it is called in R

If the columns of the data in the file lttmptxtgt were separated by ltgt thesyntax would be

readtable(tmptxt header = TRUE sep=)

Note that if your decimal character is not ltgt you should specify it If thedecimal character is ltgt you can use ltreadcsvgt and specify the following ar-gument in the function ltdec=gt

42 Non-available and delimiters in tabular data

We have a file ltdata1txtgt with the following contents

1 9

6 3 2

where the first observation on the second column (variable) is a missing valuecoded as ltgt To tell R that ltgt is a missing value you use the argumentltnastrings=gt

gt FILE lt- httppeoplesuse~maR_introdatadata1txt

gt readtable(file = FILE nastrings = )

V1 V2 V31 1 NA 92 6 3 2

6

Sometimes columns are separated by other separators than spaces The sepa-rator might for example be ltgt in which case we have to use the argumentltsep=gt

Be aware that if the columns are separated by ltgt and there are spaces in somecolumns like the case below the ltnastrings=gt does not work The NA isactually coded as two spaces a point and two spaces and should be indicatedas ltnastrings= gt

1 9

6 3 2

Sometimes missing value is simply ltblankgt as follows

1 9

6 3 2

Notice that there are two spaces between 1 and 9 in the first line implying thatthe value in the second column is blank This is a missing value Here it isimportant to specify ltsep= gt along with ltnastrings=gt

43 Reading and writing data in other formats

Attach the library ltforeigngt in order to read data in various standard packagesdata formats Examples are SAS SPSS STATA etc

library(foreign)

reads the data ltwagedtagt and put it in the object ltlnugt

lnu lt- readdta(file=wagedta)

ltreadssd()gt ltreadspss()gt etc are other commands in the foreign packagefor reading data in SAS and SPSS format

It is also easy to write data in a foreign format The following codes writes theobject ltlnugt to stata-file ltlnunewdtagt

library(foreign)

writedta(lnulnunewdta)

44 Examining the contents of a data-frame object

Here we use data from Swedish Level of Living Surveys LNU 1991

gt FILE lt- httppeoplesuse~maR_introdatalnu91txt

gt lnu lt- readtable(file = FILE header = TRUE)

7

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 5: A Brief Guide to R for Beginners in Econometrics

26 Getting Help

helpstart() invokes the help pages

help(lm) help on ltlmgt linear model

lm same as above

3 Elementary commands

ls() Lists all objects

lsstr() Lists details of all objects

str(myobject) Lists details of ltmyobjectgt

listfiles() Lists all files in the current directory

dir() Lists all files in the current directory

myobject Prints simply the object

rm(myobject) removes the object ltmyobjectgt

rm(list=ls()) removes all the objects in the working space

save(myobject file=myobjectrda)

saves the object ltmyobjectgt in a file ltmyobjectrdagt

load(myworkrda) loads myworkrda into memory

summary(mydata) Prints the simple statistics for ltmydatagt

hist(xfreq=TRUE) Prints a histogram of the object ltxgt

ltfreq=TRUEgt yields frequency and

ltfreq=FALSEgt yields probabilities

q() Quits R

The output of a command can be directed in an object by using lt lt- gt anobject is then assigned a value The first line in the following code chunk createsvector named ltVVgt with a values 12 and 3 The second line creates an objectnamed ltVVgt and prints the contents of the object ltVVgt

gt VV lt- c(1 2 3)

gt (VV lt- 12)

[1] 1 2

4 Data management

41 Reading data in plain text format

Data in columns

The data in this example are from a text file lttmptxtgt containing the variablenames in the first line (separated with a space) and the values of these variables(separated with a space) in the following lines

5

The following reads the contents of the file lttmptxtgt and assigns it to an objectnamed ltdatgt

gt FILE lt- httppeoplesuse~maR_introdatatmptxt

gt dat lt- readtable(file = FILE header = TRUE)

gt dat

wage school public female1 94 8 1 02 75 7 0 03 80 11 1 04 70 16 0 05 75 8 1 06 78 11 1 07 103 11 0 08 53 8 0 09 99 8 1 0

The argument ltheader = TRUEgt indicates that the first line includes the namesof the variables The object ltdatgt is a data-frame as it is called in R

If the columns of the data in the file lttmptxtgt were separated by ltgt thesyntax would be

readtable(tmptxt header = TRUE sep=)

Note that if your decimal character is not ltgt you should specify it If thedecimal character is ltgt you can use ltreadcsvgt and specify the following ar-gument in the function ltdec=gt

42 Non-available and delimiters in tabular data

We have a file ltdata1txtgt with the following contents

1 9

6 3 2

where the first observation on the second column (variable) is a missing valuecoded as ltgt To tell R that ltgt is a missing value you use the argumentltnastrings=gt

gt FILE lt- httppeoplesuse~maR_introdatadata1txt

gt readtable(file = FILE nastrings = )

V1 V2 V31 1 NA 92 6 3 2

6

Sometimes columns are separated by other separators than spaces The sepa-rator might for example be ltgt in which case we have to use the argumentltsep=gt

Be aware that if the columns are separated by ltgt and there are spaces in somecolumns like the case below the ltnastrings=gt does not work The NA isactually coded as two spaces a point and two spaces and should be indicatedas ltnastrings= gt

1 9

6 3 2

Sometimes missing value is simply ltblankgt as follows

1 9

6 3 2

Notice that there are two spaces between 1 and 9 in the first line implying thatthe value in the second column is blank This is a missing value Here it isimportant to specify ltsep= gt along with ltnastrings=gt

43 Reading and writing data in other formats

Attach the library ltforeigngt in order to read data in various standard packagesdata formats Examples are SAS SPSS STATA etc

library(foreign)

reads the data ltwagedtagt and put it in the object ltlnugt

lnu lt- readdta(file=wagedta)

ltreadssd()gt ltreadspss()gt etc are other commands in the foreign packagefor reading data in SAS and SPSS format

It is also easy to write data in a foreign format The following codes writes theobject ltlnugt to stata-file ltlnunewdtagt

library(foreign)

writedta(lnulnunewdta)

44 Examining the contents of a data-frame object

Here we use data from Swedish Level of Living Surveys LNU 1991

gt FILE lt- httppeoplesuse~maR_introdatalnu91txt

gt lnu lt- readtable(file = FILE header = TRUE)

7

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 6: A Brief Guide to R for Beginners in Econometrics

The following reads the contents of the file lttmptxtgt and assigns it to an objectnamed ltdatgt

gt FILE lt- httppeoplesuse~maR_introdatatmptxt

gt dat lt- readtable(file = FILE header = TRUE)

gt dat

wage school public female1 94 8 1 02 75 7 0 03 80 11 1 04 70 16 0 05 75 8 1 06 78 11 1 07 103 11 0 08 53 8 0 09 99 8 1 0

The argument ltheader = TRUEgt indicates that the first line includes the namesof the variables The object ltdatgt is a data-frame as it is called in R

If the columns of the data in the file lttmptxtgt were separated by ltgt thesyntax would be

readtable(tmptxt header = TRUE sep=)

Note that if your decimal character is not ltgt you should specify it If thedecimal character is ltgt you can use ltreadcsvgt and specify the following ar-gument in the function ltdec=gt

42 Non-available and delimiters in tabular data

We have a file ltdata1txtgt with the following contents

1 9

6 3 2

where the first observation on the second column (variable) is a missing valuecoded as ltgt To tell R that ltgt is a missing value you use the argumentltnastrings=gt

gt FILE lt- httppeoplesuse~maR_introdatadata1txt

gt readtable(file = FILE nastrings = )

V1 V2 V31 1 NA 92 6 3 2

6

Sometimes columns are separated by other separators than spaces The sepa-rator might for example be ltgt in which case we have to use the argumentltsep=gt

Be aware that if the columns are separated by ltgt and there are spaces in somecolumns like the case below the ltnastrings=gt does not work The NA isactually coded as two spaces a point and two spaces and should be indicatedas ltnastrings= gt

1 9

6 3 2

Sometimes missing value is simply ltblankgt as follows

1 9

6 3 2

Notice that there are two spaces between 1 and 9 in the first line implying thatthe value in the second column is blank This is a missing value Here it isimportant to specify ltsep= gt along with ltnastrings=gt

43 Reading and writing data in other formats

Attach the library ltforeigngt in order to read data in various standard packagesdata formats Examples are SAS SPSS STATA etc

library(foreign)

reads the data ltwagedtagt and put it in the object ltlnugt

lnu lt- readdta(file=wagedta)

ltreadssd()gt ltreadspss()gt etc are other commands in the foreign packagefor reading data in SAS and SPSS format

It is also easy to write data in a foreign format The following codes writes theobject ltlnugt to stata-file ltlnunewdtagt

library(foreign)

writedta(lnulnunewdta)

44 Examining the contents of a data-frame object

Here we use data from Swedish Level of Living Surveys LNU 1991

gt FILE lt- httppeoplesuse~maR_introdatalnu91txt

gt lnu lt- readtable(file = FILE header = TRUE)

7

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 7: A Brief Guide to R for Beginners in Econometrics

Sometimes columns are separated by other separators than spaces The sepa-rator might for example be ltgt in which case we have to use the argumentltsep=gt

Be aware that if the columns are separated by ltgt and there are spaces in somecolumns like the case below the ltnastrings=gt does not work The NA isactually coded as two spaces a point and two spaces and should be indicatedas ltnastrings= gt

1 9

6 3 2

Sometimes missing value is simply ltblankgt as follows

1 9

6 3 2

Notice that there are two spaces between 1 and 9 in the first line implying thatthe value in the second column is blank This is a missing value Here it isimportant to specify ltsep= gt along with ltnastrings=gt

43 Reading and writing data in other formats

Attach the library ltforeigngt in order to read data in various standard packagesdata formats Examples are SAS SPSS STATA etc

library(foreign)

reads the data ltwagedtagt and put it in the object ltlnugt

lnu lt- readdta(file=wagedta)

ltreadssd()gt ltreadspss()gt etc are other commands in the foreign packagefor reading data in SAS and SPSS format

It is also easy to write data in a foreign format The following codes writes theobject ltlnugt to stata-file ltlnunewdtagt

library(foreign)

writedta(lnulnunewdta)

44 Examining the contents of a data-frame object

Here we use data from Swedish Level of Living Surveys LNU 1991

gt FILE lt- httppeoplesuse~maR_introdatalnu91txt

gt lnu lt- readtable(file = FILE header = TRUE)

7

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 8: A Brief Guide to R for Beginners in Econometrics

Attaching the ltlnugt data by ltattach(lnu)gt allows you to access the contentsof the dataset ltlnugt by referring to the variable names in the ltlnugt If youhave not attached the ltlnugt you can use ltlnu$femalegt to refer to the variableltfemalegt in the data frame ltlnugt When you do not need to have the dataattached anymore you can undo the ltattach()gt by ltdetach()gt

A description of the contents of the data frame lnu

gt str(lnu)

dataframe 2249 obs of 6 variables$ wage int 81 77 63 84 110 151 59 109 159 71 $ school int 15 12 10 15 16 18 11 12 10 11 $ expr int 17 10 18 16 13 15 19 20 21 20 $ public int 0 1 0 1 0 0 1 0 0 0 $ female int 1 1 1 1 0 0 1 0 1 0 $ industry int 63 93 71 34 83 38 82 50 71 37

gt summary(lnu)

wage school exprMin 1700 Min 400 Min 0001st Qu 6400 1st Qu 900 1st Qu 800Median 7300 Median 1100 Median 1800Mean 8025 Mean 1157 Mean 18593rd Qu 8800 3rd Qu1300 3rd Qu2700Max 28900 Max 2400 Max 5000

public female industryMin 00000 Min 00000 Min 11001st Qu00000 1st Qu00000 1st Qu5000Median 00000 Median 00000 Median 8100Mean 04535 Mean 04851 Mean 69743rd Qu10000 3rd Qu10000 3rd Qu9300Max 10000 Max 10000 Max 9500

45 Creating and removing variables in a data frame

Here we create a variable ltlogwagegt as the logarithm of ltwagegt Then weremove the variable

gt lnu$logwage lt- log(lnu$wage)

gt lnu$logwage lt- NULL

Notice that you do not need to create variables that are simple transforma-tions of the original variables You can do the transformation directly in yourcomputations and estimations

8

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 9: A Brief Guide to R for Beginners in Econometrics

46 Choosing a subset of variables in a data frame

Read a ltsubsetgt of variables (wagefemale) in lnu

lnufemale lt- subset(lnu select=c(wagefemale))

Putting together two objects (or variables) in a data frame

attach(lnu)

lnufemale lt- dataframe(wagefemale)

Read all variables in lnu but female

lnux lt- subset(lnu select=-female)

The following keeps all variables from wage to public as listed above

lnuxx lt- subset(lnu select=wagepublic)

47 Choosing a subset of observations in a dataset

attach(lnu)

Deleting observations that include missing value in a variable

lnu lt- naomit(lnu)

Keeping observations for female only

femdata lt- subset(lnu female==1)

Keeping observations for female and public employees only

fempublicdata lt- subset(lnu female==1 amp public==1)

Choosing all observations where wage gt 90

highwage lt- subset(lnu wage gt 90)

48 Replacing values of variables

We create a variable indicating whether the individual has university educationor not by replacing the values in the schooling variable

Copy the schooling variable

gt lnu$university lt- lnu$school

Replace university value with 0 if years of schooling is less than 13 years

gt lnu$university lt- replace(lnu$university lnu$university lt

+ 13 0)

Replace university value with 1 if years of schooling is greater than 12 years

gt lnu$university lt- replace(lnu$university lnu$university gt

+ 12 1)

9

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 10: A Brief Guide to R for Beginners in Econometrics

The variable ltlnu$universitygt is now a dummy for university education Re-member to re-attach the data set after recoding For creating category variablesyou can use ltcutgt See further the section on ltfactorsgt below

gt attach(lnu warnconflicts = FALSE)

gt table(university)

universityFALSE TRUE1516 733

To create a dummy we could simply proceed as follows

gt university lt- school gt 12

gt table(university)

universityFALSE TRUE1516 733

However we usually do not need to create dummies We can compute onltschool gt 12gt directly

gt table(school gt 12)

FALSE TRUE1516 733

49 Replacing missing values

We create a vector Recode one value as missing value And Then replace themissing with the original value

a lt- c(1234) creates a vector

isna(a) lt- a ==2 recode a==2 as NA

a lt- replace(a isna(a) 2) replaces the NA with 2

or

a[isna(a)] lt- 2

410 Factors

Sometimes our variable has to be redefined to be used as a category variablewith appropriate levels that corresponds to various intervals We might wish tohave schooling categories that corresponds to schooling up to 9 years 10 to 12years and above 12 years This could be coded by using ltcut()gt To includethe lowest category we use the argument ltincludelowest=TRUEgt

10

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 11: A Brief Guide to R for Beginners in Econometrics

gt SchoolLevel lt- cut(school c(9 12 max(school)

+ includelowest = TRUE))

gt table(SchoolLevel)

SchoolLevel(19] (912] (1224]608 908 733

Labels can be set for each level Consider the university variable created in theprevious section

gt SchoolLevel lt- factor(SchoolLevel labels = c(basic

+ gymnasium university))

gt table(SchoolLevel)

SchoolLevelbasic gymnasium university608 908 733

The factor defined as above can for example be used in a regression model Thereference category is the level with the lowest value The lowest value is 1 thatcorresponds to verb+ltBasicgt+ and the column for ltBasicgt is not included inthe contrast matrix Changing the base category will remove another columninstead of this column This is demonstrated in the following example

gt contrasts(SchoolLevel)

gymnasium universitybasic 0 0gymnasium 1 0university 0 1

gt contrasts(SchoolLevel) lt- contrtreatment(levels(SchoolLevel)

+ base = 3)

gt contrasts(SchoolLevel)

basic gymnasiumbasic 1 0gymnasium 0 1university 0 0

The following redefines ltschoolgt as a numeric variable

gt lnu$school lt- asnumeric(lnu$school)

11

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 12: A Brief Guide to R for Beginners in Econometrics

411 Aggregating data by group

Let us create a simple dataset consisting of 3 variables V1 V2 and V3 V1 isthe group identity and V2 and V3 are two numeric variables

gt (df1 lt- dataframe(V1 = 13 V2 = 19 V3 = 1119))

V1 V2 V31 1 1 112 2 2 123 3 3 134 1 4 145 2 5 156 3 6 167 1 7 178 2 8 189 3 9 19

By using the command ltaggregategt we can create a new dataframe consistingof group characteristics such as ltsumgt ltmeangt etc Here the function sum isapplied to ltdf1[23]gt that is the second and third columns of ltdf1gt by thegroup identity ltV1gt

gt (aggregatesumdf1 lt- aggregate(df1[ 23] list(df1$V1)

+ sum))

Group1 V2 V31 1 12 422 2 15 453 3 18 48

gt (aggregatemeandf1 lt- aggregate(df1[ 23] list(df1$V1)

+ mean))

Group1 V2 V31 1 4 142 2 5 153 3 6 16

The variable ltGroup1gt is a factor that identifies groups

The following is an example of using the function aggregate Assume thatyou have a data set ltdatgt including a unit-identifier ltdat$idgt The units areobserved repeatedly over time indicated by a variable dat$Time

gt (dat lt- dataframe(id = rep(1112 each = 2)

+ Time = 12 x = 23 y = 56))

12

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 13: A Brief Guide to R for Beginners in Econometrics

id Time x y1 11 1 2 52 11 2 3 63 12 1 2 54 12 2 3 6

This computes group means for all variables in the data frame and drops the vari-able ltTimegt and the automatically created group-indicator variable ltGroup1gt

gt (Bdat lt- subset(aggregate(dat list(dat$id) FUN = mean)

+ select = -c(Time Group1)))

id x y1 11 25 552 12 25 55

Merge ltBdatgt and ltdat$idgt to create a data set with repeated group averagesfor each observation on ltidgt and of the length as ltidgt

gt (dat2 lt- subset(merge(dataframe(id = dat$id)

+ Bdat) select = -id))

x y1 25 552 25 553 25 554 25 55

Now you can create a data set including the ltidgt and ltTimegt indicators andthe deviation from mean values of all the other variables

gt (withindata lt- cbind(id = dat$id Time = dat$Time

+ subset(dat select = -c(Time id)) - dat2))

id Time x y1 11 1 -05 -052 11 2 05 053 12 1 -05 -054 12 2 05 05

412 Using several data sets

We often need to use data from several datasets In R it is not necessary to putthese data together into a dataset as is the case in many statistical packageswhere only one data set is available at a time and all stored data are in the formof a table

13

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 14: A Brief Guide to R for Beginners in Econometrics

It is for example possible to run a regression using one variable from one dataset and another variable from another dataset as long as these variables havethe same length (same number of observations) and they are in the same order(the ith observation in both variables correspond to the same unit) Considerthe following two datasets

gt data1 lt- dataframe(wage = c(81 77 63 84 110

+ 151 59 109 159 71) female = c(1 1 1

+ 1 0 0 1 0 1 0) id = c(1 3 5 6 7

+ 8 9 10 11 12))

gt data2 lt- dataframe(experience = c(17 10 18

+ 16 13 15 19 20 21 20) id = c(1 3

+ 5 6 7 8 9 10 11 12))

We can use variables from both datasets without merging the datasets Let usregress ltdata1$wagegt on ltdata1$femalegt and ltdata2$experiencegt

gt lm(log(data1$wage) ~ data1$female + data2$experience)

Calllm(formula = log(data1$wage) ~ data1$female + data2$experience)

Coefficients(Intercept) data1$female data2$experience

4641120 -0257909 0001578

We can also put together variables from different data frames into a data frameand do our analysis on these data

gt (data3 lt- dataframe(data1$wage data1$female

+ data2$experience))

data1wage data1female data2experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

We can merge the datasets If we have one common variable in both data setsthe data is merged according to that variable

gt (data4 lt- merge(data1 data2))

14

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 15: A Brief Guide to R for Beginners in Econometrics

id wage female experience1 1 81 1 172 3 77 1 103 5 63 1 184 6 84 1 165 7 110 0 136 8 151 0 157 9 59 1 198 10 109 0 209 11 159 1 2110 12 71 0 20

Notice that unlike some other softwares we do not need the observations toappear in the same order as defined by the ltidgt

If we need to match two data sets using a common variable (column) and thecommon variable have different names in the datasets we either can change thenames to the same name or use the data as they are and specify the variablesthat are to be used for matching in the data sets If the matching variablein ltdata2gt and ltdata1gt are called ltid2gt and ltidgt you can use the followingsyntax

merge(data1data2 byx=id byy=id2)

ltbyx=id byy=id2gt arguments says that id is the matching variablein data1 and id2 is the matching variable in data2

You can also put together the datasets in the existing order with help of ltdataframegtor ltcbindgt The data are then matched observation by observation in the ex-isting order in the data sets This is illustrated by the following example

gt data1noid lt- dataframe(wage = c(81 77 63)

+ female = c(1 0 1))

gt data2noid lt- dataframe(experience = c(17 10

+ 18))

gt cbind(data1noid data2noid)

wage female experience1 81 1 172 77 0 103 63 1 18

If you want to add a number of observations at the end of a data set you useltrbindgt The following example splits the clumns 23 and 4 in ltdata4gt in twoparts and then puts themtogether by ltrbindgt

gt dataonetofive lt- data4[15 24]

gt datasixtoten lt- data4[610 24]

gt rbind(dataonetofive datasixtoten)

15

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 16: A Brief Guide to R for Beginners in Econometrics

wage female experience1 81 1 172 77 1 103 63 1 184 84 1 165 110 0 136 151 0 157 59 1 198 109 0 209 159 1 2110 71 0 20

5 Basic statistics

Summary statistics for all variables in a data frame

summary(mydata)

Mean Median Standard deviation Maximum and Minimum of a variable

mean (myvariable)

median (myvariable)

sd (myvariable)

max (myvariable)

min (myvariable)

compute 10 20 90 percentiles

quantile(myvariable 1910)

When R computes ltsumgt ltmeangt etc on an object containing ltNAgt it returnsltNAgt To be able to apply these functions on observations where data existsyou should add the argument ltnarm=TRUEgt Another alternative is to removeall lines of data containing ltNAgt by ltnaomitgt

gt a lt- c(1 NA 3 4)

gt sum(a)

[1] NA

gt sum(a narm = TRUE)

[1] 8

gt table(a exclude = c())

a1 3 4 ltNAgt1 1 1 1

16

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 17: A Brief Guide to R for Beginners in Econometrics

You can also use ltsum(naomit(a))gt that removes the NA and computes thesum or ltsum(a[isna(a)])gt that sums the elements that are not NA (isna)in ltagt

51 Tabulation

Read a dataset first

Cross Tabulation

gt attach(lnu warnconflicts = FALSE)

gt table(female public)

publicfemale 0 1

0 815 3431 414 677

gt (ftablerow lt- cbind(table(female public) total = table(female)))

0 1 total0 815 343 11581 414 677 1091

gt (ftablecol lt- rbind(table(female public) total = table(public)))

0 10 815 3431 414 677total 1229 1020

Try this

relative freq by rows female

ftablerowc(table(female))

relative freq by columns public

ftablecolrep(table(public)each=3)

rep(table(public)each=3) repeats

each value in table(public) 3 times

Creating various statistics by category The following yields average wage formales and females

gt tapply(wage female mean)

0 18875302 7123190

17

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 18: A Brief Guide to R for Beginners in Econometrics

Using ltlengthgt ltmingt ltmaxgt etc yields number of observations minimummaximum etc for males and females

gt tapply(wage female length)

0 11158 1091

The following example yields average wage for males and females in the privateand public sector

gt tapply(wage list(female public) mean)

0 10 8952883 86909621 7154589 7103988

The following computes the average by group creating a vector of the samelength Same length implies that for the group statistics is retained for allmembers of each group Average wage for males and females

gt attach(lnu warnconflicts = FALSE)

gt lnu$wagebysex lt- ave(wage female FUN = mean)

The function ltmeangt can be substituted with ltmingt ltmaxgt ltlengthgt etcyielding group-wise minimum maximum number of observations etc

6 Matrixes

In R we define a matrix as follows (see matrix in R)

A matrix with 3 rows and 4 columns with elements 1 to 12 filled by columns

gt matrix(112 3 4)

[1] [2] [3] [4][1] 1 4 7 10[2] 2 5 8 11[3] 3 6 9 12

A matrix with 3 rows and 4 columns with elements 123 12 filled by rows

gt (A lt- matrix(112 3 4 byrow = TRUE))

[1] [2] [3] [4][1] 1 2 3 4[2] 5 6 7 8[3] 9 10 11 12

18

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 19: A Brief Guide to R for Beginners in Econometrics

gt dim(A)

[1] 3 4

gt nrow(A)

[1] 3

gt ncol(A)

[1] 4

61 Indexation

The elements of a matrix can be extracted by using brackets after the matrixname and referring to rows and columns separated by a comma You can usethe indexation in a similar way to extract elements of other types of objects

A[3] Extracting the third rowA[3] Extracting the third columnA[33] the third row and the third columnA[-1] the matrix except the first rowA[-2] the matrix except the second column

Evaluating some condition on all elements of a matrix

gt A gt 3

[1] [2] [3] [4][1] FALSE FALSE FALSE TRUE[2] TRUE TRUE TRUE TRUE[3] TRUE TRUE TRUE TRUE

gt A == 3

[1] [2] [3] [4][1] FALSE FALSE TRUE FALSE[2] FALSE FALSE FALSE FALSE[3] FALSE FALSE FALSE FALSE

Listing the elements fulfilling some condition

gt A[A gt 6]

[1] 9 10 7 11 8 12

19

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 20: A Brief Guide to R for Beginners in Econometrics

62 Scalar Matrix

A special type of matrix is a scalar matrix which is a square matrix with thesame number of rows and columns all off-diagonal elements equal to zero andthe same element in all diagonal positions The following exercises demonstratessome matrix facilities regarding the diagonals of matrixes See also uppertriand lowertri

gt diag(2 3 3)

[1] [2] [3][1] 2 0 0[2] 0 2 0[3] 0 0 2

gt diag(diag(2 3 3))

[1] 2 2 2

63 Matrix operators

Transpose of a matrix

Interchanging the rows and columns of a matrix yields the transpose of a matrix

gt t(matrix(16 2 3))

[1] [2][1] 1 2[2] 3 4[3] 5 6

Try matrix(1623) and matrix(1632 byrow=T)

Addition and subtraction

Addition and subtraction can be applied on matrixes of the same dimensions ora scalar and a matrix

Try this

A lt- matrix(11234)

B lt- matrix(-1-1234)

C1 lt- A+B

D1 lt- A-B

Scalar multiplication

Try this

A lt- matrix(11234) TwoTimesA = 2A

c(222)A

c(123)A

c(110)A

20

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 21: A Brief Guide to R for Beginners in Econometrics

Matrix multiplication

For multiplying matrixes R uses ltlowastgt and this works only when the matrixesare conform

E lt- matrix(1933)

crossproductofE lt- t(E)E

Or another and more efficient way of obtaining crossproducts is

crossproductofE lt- crossprod(E)

Matrix inversion

The inverse of a square matrix A denoted as Aminus1 is defined as a matrix thatwhen multiplied with A results in an Identity matrix (1rsquos in the diagonal and0rsquos in all off-diagonal elements)

AAminus1 = Aminus1A = I

FF lt- matrix((19)33)

detFFlt- det(FF) we check the determinant

B lt- matrix((19)^233) create an invertible matrix

Binverse lt- solve(B)

Identitymatrix lt- BBinverse

7 Ordinary Least Squares

The function for running a linear regression model using OLS is ltlm()gt In thefollowing example the dependent variable is ltlog(wage)gt and the explanatoryvariables are ltschoolgt and ltfemalegt An intercept is included by defaultNotice that we do not have to specify the data since the data frame ltlnugtcontaining these variables is attached The result of the regression is assignedto the object named ltregmodelgt This object includes a number of interestingregression results that can be extracted as illustrated further below after someexamples for using ltlmgt

Read a dataset first

gt regmodel lt- lm(log(wage) ~ school + female)

gt summary(regmodel)

Calllm(formula = log(wage) ~ school + female)

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

21

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 22: A Brief Guide to R for Beginners in Econometrics

CoefficientsEstimate Std Error t value Pr(gt|t|)

(Intercept) 4087730 0022203 18410 lt2e-16 school 0029667 0001783 1664 lt2e-16 female -0191109 0011066 -1727 lt2e-16 ---Signif codes 0 0001 001 005 01 1

Residual standard error 02621 on 2246 degrees of freedomMultiple R-squared 02101 Adjusted R-squared 02094F-statistic 2988 on 2 and 2246 DF p-value lt 22e-16

Sometimes we wish to run the regression on a subset of our data

lm (log(wage) ~ school + female subset=wagegt100)

Sometimes we need to use transformed values of the variables in the model Thetransformation should be given as the in the function ltI()gt I() means Identityfunction ltexpr^2gt is ltexprgt squared

lm (log(wage) ~ school + female + expr + I(expr^2))

Interacting variables ltfemalegt ltschoolgt

lm (log(wage) ~ femaleschool data=lnu)

Same as

lm (log(wage) ~ female + school + femaleschool data= lnu)

A model with no intercept

regnointercept lt- lm (log(wage) ~ female - 1)

A model with only an intercept

regonlyintercept lt- lm (log(wage) ~ 1 )

The following example runs ltlmgt for females and males in the private and publicsector separately as defined by the variables ltfemalegt and ltpublicgt The dataltlnugt are split into four cells males in private (00) females in private (10)males in public (01) and females in public (11) The obtained object ltbyreggtis a list and to display the ltsummary()gt of each element we use ltlapplygt (listapply)

22

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 23: A Brief Guide to R for Beginners in Econometrics

byreg lt- by(lnu list(femalepublic)

function(x) lm(log(wage) ~ school data=x))

summary of the separate regressions

lapply(byreg summary)

summary for the second element in the

list ie females in private sector

summary(byreg[[2]])

The following lists mean of variables for male and female workers (the first line)creates a list named byfemalelnu of two data sets (the second line) and runsregressions for male and female workers (the third and fourth lines)

by(lnu list(female) mean)

byfemalelnu lt- by(lnu list(female)

function(x) x) str(byfemalelnu)

summary(lm(log(wage) ~ school data=byfemalelnu[[1]]))

summary(lm(log(wage) ~ school data=byfemalelnu[[2]]))

71 Extracting the model formula and results

The model formula

(equation1 lt- formula(regmodel))

log(wage) ~ school + female

The estimated coefficients

gt coefficients(regmodel)

(Intercept) school female408772967 002966711 -019110920

The standard errors

gt coef(summary(regmodel))[ 2]

(Intercept) school female0022203334 0001782725 0011066176

ltcoef(summary(regmodel))[12]gt yields both ltEstimategt and ltStdErrorgt

The t-values

gt coef(summary(regmodel))[ 3]

(Intercept) school female18410432 1664144 -1726967

23

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 24: A Brief Guide to R for Beginners in Econometrics

Try also ltcoef(summary(regmodel))gt Analogously you can extract otherelements of the lm-object by

The variance-covariance matrix ltvcov(regmodel)gt

Residual degrees of freedomltdfresidual(regmodel)gt

The residual sum of squaresltdeviance(regmodel)gt

And other componentsltresiduals(regmodel)gtltfittedvalues(regmodel)gtltsummary(regmodel)$rsquaredgtltsummary(regmodel)$adjrsquaredgtltsummary(regmodel)$sigmagtltsummary(regmodel)$fstatisticgt

72 Whitersquos heteroskedasticity corrected standard errors

The package ltcargt and ltsandwichgt and ltDesigngt has predefined functionsfor computing robust standard errors There are different weighting options

The Whitersquos correction

gt library(car)

gt f1 lt- formula(log(wage) ~ female + school)

gt sqrt(diag(hccm(lm(f1) type = hc0)))

(Intercept) female school0022356311 0010920889 0001929391

Using the library ltsandwichgt

gt library(sandwich)

gt library(lmtest)

gt coeftest(lm(f1) vcov = (vcovHC(lm(f1) HC0)))

t test of coefficients

Estimate Std Error t value Pr(gt|t|)(Intercept) 40877297 00223563 182845 lt 22e-16 female -01911092 00109209 -17499 lt 22e-16 school 00296671 00019294 15376 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

24

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 25: A Brief Guide to R for Beginners in Econometrics

lthc0gt in library ltcargt and ltHC0gt in library sandwich use the original Whiteformula The lthc1gt ltHC1gt multiply the variances with N

Nminusk Using libraryltDesigngt

gt library(Design war warnconflicts = FALSE)

gt f1 lt- formula(log(wage) ~ female + school)

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE))

73 Group-wise non-constant error variance

This uses the library ltDesigngt and account for non-constant error variancewhen data is clustered across sectors indicated by the variable ltindustrygt

gt robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

Linear Regression Model

ols(formula = f1 x = TRUE y = TRUE)

n Model LR df R2 Sigma2249 5305 2 02101 02621

ResidualsMin 1Q Median 3Q Max

-146436 -015308 -001852 013542 110402

CoefficientsValue Std Error t Pr(gt|t|)

Intercept 408773 0036285 11266 0female -019111 0016626 -1149 0school 002967 0002806 1057 0

Residual standard error 02621 on 2246 degrees of freedomAdjusted R-Squared 02094

When the number of groups M are small you can correct the standard errorsas follows

gt library(Design)

gt f1 lt- formula(log(wage) ~ female + school)

gt M lt- length(unique(industry))

gt N lt- length(industry)

gt K lt- lm(f1)$rank

gt cl lt- (M(M - 1)) ((N - 1)(N - K))

gt fm1 lt- robcov(ols(f1 x = TRUE y = TRUE) cluster = industry)

gt fm1$var lt- fm1$var cl

gt coeftest(fm1)

25

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 26: A Brief Guide to R for Beginners in Econometrics

t test of coefficients

Estimate Std Error t value Pr(gt|t|)Intercept 40877297 00222033 184104 lt 22e-16 female -01911092 00110662 -17270 lt 22e-16 school 00296671 00017827 16641 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

74 F-test

Estimate the restricted (restricting some (or all) of slope coefficients to be zero)and the unrestricted model (allowing non-zero as well as zero coefficients) Youcan then use anova() to test the joint hypotheses defined as in the restrictedmodel

gt modrestricted lt- lm(log(wage) ~ 1)

gt modunrestricted lt- lm(log(wage) ~ female + school)

gt anova(modrestricted modunrestricted)

Analysis of Variance Table

Model 1 log(wage) ~ 1Model 2 log(wage) ~ female + schoolResDf RSS Df Sum of Sq F Pr(gtF)

1 2248 1953382 2246 154291 2 41047 29876 lt 22e-16 ---Signif codes 0 0001 001 005 01 1

Under non-constant error variance we use the White variance-Covariance ma-trix and the model F-value is as follows The lt-1gt in the codes below removerelated rowcolumn for the intercept

gt library(car)

gt COV lt- hccm(modunrestricted hc1)[-1 -1]

gt beta lt- matrix(coef(modunrestricted 1))[-1

+ ]

gt t(beta) solve(COV) beta(lm(f1)$rank -

+ 1)

[1][1] 2531234

26

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 27: A Brief Guide to R for Beginners in Econometrics

8 Time-series

Data in rows

Time-series data often appear in a form where series are in rows As an exampleI use data from the Swedish Consumer Surveys by the National Institute ofEconomic Research containing three series consumer confidence index a macroindex and a micro index

First I saved the original data in a file in text-format Before using the data asinput in R I used a text editor and kept only the values for the first three seriesseparated with spaces The series are in rows The values of the three series arelisted in separate rows without variable names

To read this file ltmacrotxtgt the following code puts the data in a matrix of3 columns and 129 rows with help of ltscangt and ltmatrixgt before defining atime-series object starting in 1993 with frequency 12 (monthly) The series arenamed as ltccigtltmacroindexgt and ltmicroindexgt Notice that ltmatrixgtby default fills in data by columns

gt FILE lt- httppeoplesuse~maR_intromacrotxt

gt macro lt- ts(matrix(scan(FILE) 129 3) start = 1993

+ frequency = 12 names = c(cci macroindex

+ microindex))

Here I give an example for creating lag values of a variable and adding it to atime-series data set See also ltdiffgt for computing differences

Let us create a new time series data set with the series in the data frameltmacrogt adding lagged ltccigt (lagged by 1 month) The function lttsuniongtputs together the series keeping all observations while lttsintersectgt wouldkeep only the overlapping part of the series

gt macro2 lt- tsunion(macro lcci = lag(macro[

+ 1] -1))

You can use the function aggregate to change the frequency of you time-seriesdata The following example converts the frequency data nfrequency=1 yieldsannual data FUN=mean computes the average of the variables over time Thedefault is ltsumgt

gt aggregate(macro nfrequency = 1 FUN = mean)

Time SeriesStart = 1993End = 2002Frequency = 1

cci macroindex microindex1993 -197583333 -333833333 -132416671994 -02916667 141666667 -5158333

27

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 28: A Brief Guide to R for Beginners in Econometrics

1995 -108583333 -37250000 -102500001996 -79166667 -205500000 -38666671997 35333333 07833333 21416671998 119083333 163500000 72500001999 176083333 206083333 118916672000 262666667 404083333 149250002001 34583333 -137083333 96083332002 72166667 -73250000 10125000

81 Durbin Watson

ltdwtestgt in the package ltlmtestgt and ltdurbinwatsongt in the package ltcargtcan be used See also ltbgtestgt in the package ltlmtestgt for Breusch-Godfreytest for higher order serial correlation

gt mod1 lt- lm(cci ~ macroindex data = macro)

gt library(lmtest)

gt dwtest(mod1)

Durbin-Watson test

data mod1DW = 0063 p-value lt 22e-16alternative hypothesis true autocorrelation is greater than 0

9 Graphics

91 Save graphs in postscript

postscript(myfileps)

hist(110)

devoff()

92 Save graphs in pdf

pdf(myfilepdf)

hist(110)

devoff()

28

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 29: A Brief Guide to R for Beginners in Econometrics

93 Plotting the observations and the Regression line

Plot the data and the regression line Plot ltschoolgt against ltlog(wage)gt

gt XLABEL = Years of schooling

gt YLABEL = Log hourly wage in SEK

gt TITLE lt- Figure 1 Scatterplot and the Regression line

gt SubTitle lt- Source Level of Living Surveys LNU 1991

gt plot(school log(wage) pch = main = TITLE

+ sub = SubTitle xlab = XLABEL ylab = YLABEL)

gt abline(lm(log(wage) ~ school))

gt abline(v = mean(school) col = red)

gt abline(h = mean(log(wage)) col = red)

5 10 15 20

30

35

40

45

50

55

Figure 1 Scatterplot and the Regression line

Source Level of Living Surveys LNU 1991Years of schooling

Log

hour

ly w

age

in S

EK

29

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 30: A Brief Guide to R for Beginners in Econometrics

94 Plotting time series

Start by reading a time-series data set For details see section 8

Plot series in one diagram

gt TITLE lt- Figure 2 Consumer confidence index in Sweden

gt SubTitle lt- Source Swedish Consumer Surveys

gt XLABEL lt- Year

gt COLORS = c(red blue black)

gt tsplot(macro col = COLORS main = TITLE sub = SubTitle

+ xlab = XLABEL)

gt legend(bottomright legend = colnames(macro)

+ lty = 1 col = COLORS)

Figure 2 Consumer confidence index in Sweden

Source Swedish Consumer SurveysYear

1994 1996 1998 2000 2002 2004

minus60

minus40

minus20

020

40

ccimacroindexmicroindex

ltplotts(macro)gt plots series separately

30

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 31: A Brief Guide to R for Beginners in Econometrics

10 Writing functions

The syntax is myfunction lt- function(x a ) The argu-ments for a function are the variables used in the operations as specified inthe body of the function ie the codes within Once you have written afunction and saved it you can use this function to perform the operation asspecified in by referring to your function and using the arguments relevantfor the actual computation

The following function computes the squared of mean of a variable By definingthe function ltmsgt we can write ltms(x)gt instead of lt(mean(x))^2)gt every timewe want to compute the square of mean for a variable ltxgt

gt ms lt- function(x)

+ (mean(x))^2

+

gt a lt- 1100

gt ms(a)

[1] 255025

The arguments of a function

The following function has no arguments and prints the string of text ltWelcomegt

gt welc lt- function()

+ print(Welcome)

+

gt welc()

[1] Welcome

This function takes an argument x The arguments of the function must besupplied

gt myprognodefault lt- function(x) print(paste(I use

+ x for statistical computation))

If a default value is specified the default value is assumed when no argumentsare supplied

gt myprog lt- function(x = R)

+ print(paste(I use x for statistical computation))

+

gt myprog()

[1] I use R for statistical computation

gt myprog(R and sometimes something else)

[1] I use R and sometimes something else for statistical computation

31

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 32: A Brief Guide to R for Beginners in Econometrics

101 A function for computing Clustered Standard Errors

Here follows a function for computing clustered-Standard Errors (See also thefunction robcov in the library Design discussed above) The arguments are adata frame ltdatgt a model formulaltf1gt and the cluster variable ltclustergt

clusteredstandarderrors lt- function(datf1 cluster)

attach(dat warnconflicts = FALSE)

M lt- length(unique(cluster))

N lt- length(cluster)

K lt- lm(f1)$rank

cl lt- (M(M-1))((N-1)(N-K))

X lt- modelmatrix(f1)

invXpX lt- solve(t(X) X)

ei lt- resid(lm(f1))

uj lt- asmatrix(aggregate(eiXlist(cluster)FUN=sum)[-1])

sqrt(cldiag(invXpXt(uj)ujinvXpX))

Notice that substituting the last line withsqrt( diag(invXpX t(eiX)(Xei)invXpX) )

would yield Whitersquos standard errors

11 Miscellaneous hints

Income Distribution see ineqLogit ltglm(formula family=binomial(link=logit))gt

See ltglmgt amp ltfamilygtNegative binomial ltnegativebinomial or glmnbgt in MASS VRPoisson regression ltglm(formula family=poisson(link=log))gt

See ltglmgt amp ltfamilygtProbit ltglm(formulafamily=binomial(link=probit))gt

See ltglmgt amp ltfamilygtSimultaneous Equations see sem systemfitTime Series see lttsgt tseries urca and strucchangeTobit see lttobingt in survival

12 Acknowledgements

I am grateful to Michael Lundholm Lena Nekby and Achim Zeileis for helpfulcomments

32

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements
Page 33: A Brief Guide to R for Beginners in Econometrics

References

Peter Dalgaard Introductory Statistics with R Springer 2nd edition 2008 URLhttpwwwbiostatkudk~pdISwRhtml ISBN 978-0-387-79053-4

John Fox An R and S-Plus Companion to Applied Regression Sage Publi-cations Thousand Oaks CA USA 2002 URL httpsocservsocscimcmastercajfoxBooksCompanionindexhtml ISBN 0-761-92279-2

Christian Kleiber and Achim Zeileis Applied Econometrics with R SpringerNew York 2008 URL httpwwwspringercom978-0-387-77316-2ISBN 978-0-387-77316-2

Roger Koenker and Achim Zeileis Reproducible econometric research (a criticalreview of the state of the art) Report 60 Department of Statistics and Math-ematics Wirtschaftsuniversitat Wien Research Report Series 2007 URLhttpwwweconuiucedu~rogerresearchrepro

R Development Core Team R A Language and Environment for StatisticalComputing R Foundation for Statistical Computing Vienna Austria 2008URL httpwwwR-projectorg ISBN 3-900051-07-0

William N Venables and Brian D Ripley Modern Applied Statistics with SFourth Edition Springer New York 2002 URL httpwwwstatsoxacukpubMASS4 ISBN 0-387-95457-0

33

  • Introduction
    • About R
    • About these pages
    • Objects and files
      • First things
        • Installation
        • Working with R
        • Naming in R
        • Saving and loading objects and images of working spaces
        • Overall options
        • Getting Help
          • Elementary commands
          • Data management
            • Reading data in plain text format
            • Non-available and delimiters in tabular data
            • Reading and writing data in other formats
            • Examining the contents of a data-frame object
            • Creating and removing variables in a data frame
            • Choosing a subset of variables in a data frame
            • Choosing a subset of observations in a dataset
            • Replacing values of variables
            • Replacing missing values
            • Factors
            • Aggregating data by group
            • Using several data sets
              • Basic statistics
                • Tabulation
                  • Matrixes
                    • Indexation
                    • Scalar Matrix
                    • Matrix operators
                      • Ordinary Least Squares
                        • Extracting the model formula and results
                        • Whites heteroskedasticity corrected standard errors
                        • Group-wise non-constant error variance
                        • F-test
                          • Time-series
                            • Durbin Watson
                              • Graphics
                                • Save graphs in postscript
                                • Save graphs in pdf
                                • Plotting the observations and the Regression line
                                • Plotting time series
                                  • Writing functions
                                    • A function for computing Clustered Standard Errors
                                      • Miscellaneous hints
                                      • Acknowledgements