r introduction and descriptive statistics · 2020-06-11 · introduction and descriptive statistics...

R

Introduction and descriptive statistics

tutorial 1

2

what is R

R is a free software programming language and software environment for statistical computing and graphics. (Wikipedia)

R is open source.

3

what is R

R is an object oriented programming language.

Everything in R is an object.

R objects are stored in memory, and are acted upon by functions (and operators).

4

Homepage CRAN: Comprehensive R Archive Network

http://www.r-project.org/

how to get R

Editor RStudio

6

http://www.rstudio.com

how to edit R

8

RStudio

http://www.rstudio.com

how R works

9

using R as a calculator

10

Users type expressions to the R interpreter.

R responds by computing and printing the answers.

11

arithmetic operators

type operator action performed

arithmetic

results in numeric value(s)

+ addition

- subtraction

* multiplication

/ division

^ raise to power

12

logical operators

type operator action performed

comparison

results in logical

value(s):

TRUE FALSE

< less than

> greater than

== equal to

!= not equal to

<= greater than or equal to

>= less than or equal to

connectors

& boolean intersection operator (logical and)

| boolean union operator (logical or)

13

arithmetic operators

power multiplication > 3 ^ 2

> 2 ^ (-2)

Note: > 100 ^ (1/2)

is equivalent to

> sqrt(100)

addition / subtraction > 5 + 5

> 10 - 2

multiplication / division > 10 * 10

> 25 / 5

14

logical operators

> 4 < 3 [1] FALSE

> 2^3 == 9 [1] FALSE

> (3 + 1) != 3 [1] TRUE

> (3 >= 1) & (4 == (3+1)) [1] TRUE

assignment

15

Values are stored by assigning them a name.

The statements

> z = 17

> z <- 17

> 17 -> z

all store the value 17 under the name z in the workspace. Assignment operators are: <- , = , ->

16

data types

There are three basic types or modes of variables:

numeric (numbers: integers, real)

logical (TRUE, FALSE)

character (text strings, in "")

The type is shown by the mode() function. Note: A general missing value indicator is NA.

17

data types

> a = 49 # numeric

> sqrt(a)

[1] 7

> mode(a)

[1] "numeric"

> a = "The dog ate my homework" # character

> a

[1] "The dog ate my homework"

> mode(a)

[1] "character"

> a = (1 + 1 == 3) # logical

> a

[1] FALSE

> mode(a)

[1] "logical"

18

Elements: numeric, logical, character in

vectors ordered sets of elements of one type

data.frames ordered sets of vectors (different vector types)

matrices ordered sets of vectors (all of one vector type)

lists ordered sets of anything.

data structures

19

creating vectors

> x = c(1, 3, 5, 7, 8, 9) # numerical vector

> x

[1] 1 3 5 7 8 9

> z = c("I","am","Ironman") # character vector

> z

[1] "I" "am" "Ironman"

> x = c(TRUE,FALSE,NA) # logical vector

> x

[1] TRUE FALSE NA

The function c( ) can combine several elements into vectors.

combining vectors

20

The function c( ) can be used to combine both vectors and elements into larger vectors. > x = c(1, 2, 3, 4)

> c(x, 10)

[1] 1 2 3 4 10

> c(x, x)

[1] 1 2 3 4 1 2 3 4

In fact, R stores elements like 10 as vectors of length one, so that both arguments in the expression above are vectors

sequences

21

A useful way of generating vectors is using the sequence operator. The expression n1:n2, generates the sequence of integers ranging from n1 to n2. > 1:15

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13

[14] 14 15

> 5:-5

[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5

> y = 1:11

> y

[1] 1 2 3 4 5 6 7 8 9 10 11

22

extracting elements

> x = c(1, 3, 5, 7, 8, 9)

> x[3] # extract 3rd position

[1] 5

> x[1:3] # extract positions 1-3

[1] 1 3 5

> x[-2] # without 2nd position

[1] 1 5 7 8 9

> x[x<7] # select values < 7

[1] 1 3 5

> x[x!=5] # select values not equal to 5

[1] 1 3 7 8 9

data frame

23

Data frames provide a way of grouping a number of related vectors into a single data object. The function data.frame() takes a number of vectors with same lengths and returns a single object containing all the variables.

df = data.frame(var1, var2, ...)

24

data frame

In a data frame the column labels are the vector names. Note: Vectors can be of different types in a data frame (numeric, logical, character).

Data frames can be created in a number of ways:

Binding together vectors by the function data.frame( ).

Reading in data from an external file.

25

data frame

> time = c("early","mid","late","late","early")

> type <- c("G", "G", "O", "O", "G")

> counts <- c(20, 13, 8, 34, 7)

> data <- data.frame(time,type,counts)

> data

time type counts

1 early G 20

2 mid G 13

3 late O 8

4 late O 34

5 early G 7

> fix(data)

26

example data: low birth weight

name text variable type

low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'

age age of mother continuous: years

lwt mother's weight at last period continuous: pounds

race ethnicity nominal: 1 'white' 2 'black' 3 'other'

smoke smoking status nominal: 0 'no' 1 'yes'

ptl premature labor discrete: number of

ht hypertension nominal: 0 'no' 1 'yes'

ui presence of uterine irritability nominal: 0 'no' 1 'yes'

ftv physician visits in first trimester discrete: number of

bwt birthweight of the baby continous: g

The birthweight data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986.

27


loading: library(MASS), the dataframe is called birthwt. Overview over dataframes:

dim(birthwt)

summary(birthwt)

head(birthwt)

str(birthwt)

extracting vectors

28

data$vectorlabel gives the vector named vectorlabel of the dataframe named data. Extracting elements from this vector is done as usually.

> birthwt$age

> birthwt$age[33]

> birthwt$age[1:10]

29

some functions in R

name function

summary(x) summary statistics of the elements of x

max(x) maximum of the elements of x

min(x) minimum of the elements of x

sum(x) sum of the elements of x

mean(x) mean of the elements of x

sd(x) standard deviation of the elements of x

median(x) median of the elements of x

quantile(x, probs=…) quantiles of the elements of x

sort(x) ordering the elements of x

30

some functions in R

> mean(birthwt$age)

[1] 23.2381

> max(birthwt$age)

[1] 45

> min(birthwt$age)

[1] 14

31

getting help

to get help on the sd() function you can type either of > help(sd)

> ?sd

sorting vectors

32

> help(sort)

> x=sort(birthwt$age, decreasing=FALSE)

> x[1:10]

[1] 14 14 14 15 15 15 16 16 16 16

> x=sort(birthwt$age, decreasing=TRUE)

> x[1:10]

[1] 45 36 36 35 35 34 33 33 33 32

> x[25] # 25th highest age

[1] 30

Sorting / ordering of data in vectors with the function sort()

graphics

33

R has extensive graphics facilities.

Graphic functions are differentiated in

high-level graphics functions

low-level graphics functions

The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.

34

high-level graphics

name function

plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)

hist(x) histogram of the frequencies of x

barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars

dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)

pie(x) circular pie-chart

boxplot(x) box-and-whiskers plot

stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)

mosaicplot(x) mosaic plot from frequencies in a contingency table

qqnorm(x) quantiles of x with respect to the values expected under a normal law

35

high-level graphics

> hist(birthwt$age)

> boxplot(birthwt$age)

36

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Load the dataframe into your workspace with the data("PimaIndiansDiabetes2") command. Get an overview with the functions dim, head.

Calculate the mean and median of the variable insulin. Remove NAs for the calculation with the na.rm = TRUE option in mean and median functions.

Plot an histogram of the variable insulin.

R

Graphics and probability theory

tutorial 2

graphics

38






39

high-level graphics

name function

plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)

hist(x) histogram of the frequencies of x

barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars

dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)

pie(x) circular pie-chart

boxplot(x) box-and-whiskers plot

stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)

mosaicplot(x) mosaic plot from frequencies in a contingency table

qqnorm(x) quantiles of x with respect to the values expected under a normal law

plot function

40

The core R graphics command is plot(). This is an all-in-one function which carries out a number of actions:

It opens a new graphics window.

It plots the content of the graph (points, lines etc.).

It plots x and y axes and boxes around the plot and produces the axis labels and title.

….

plot function

41

Parameters in the plot() function are:

x x-coordinate(s) y y-coordinates (optional, depends on x)

42

plot function

To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot) , type:

> a = c(1,2,3,4)

> b = c(4,4,0,5)

> plot(x=a,y=b)

> plot(a,b) # the same

43

plot function

To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot), type:

> library(MASS)

> plot(x=birthwt$age,y=birthwt$lwt)

# lwt: mothers weight in pounds

> plot(x=birthwt$age[1:10],y=birthwt$lwt[1:10])

# first 10 mothers

44

plot function

Another example:

> a = seq(-5, +5, by=0.2)

# generates a sequence from -5 to +5 with increment

0.2

[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2

...

[45] 3.8 4.0 4.2 4.4 4.6 4.8 5.0

> b = a^2 # squares all components of a

> plot(a,b)

plot function

45

Parameters in the plot() function are (see help(plot) and help(par)):

x x-coordinate(s) y y-coordinates (optional, depends on x) main, sub title and subtitle xlab, ylab axes labels xlim, ylim range of values for x and y type type of plot lty type of lines pch plot symbol cex scale factor col color of points etc. ...

plot symbol / line type

46

plot symbol: pch=

line type: lty=

plot type: type=

“p‘‘ points ‘‘l‘‘ lines ‘‘b“ both ‘‘s“ steps ‘‘h“ vertical lines ‘‘n“ nothing …

47

plot function

> a = seq(-5, +5, by=0.2)

> b = a^2

> plot(a, b)

> plot(a,b,main="quadratic function")

> plot(a,b,main="quadratic function",cex=2)

> plot(a,b,main="quadratic function",col="blue")

48

plot function

> a = seq(-5, +5, by=0.2)

> b = a^2

> plot(a,b,main="quadratic function",type="l")

> plot(a,b,main="quadratic function",type="b")

> plot(a,b,main="quadratic function",pch=2)

-4 -2 0 2 4

05

10

15

20

25

quadratic function

x

y

probability theory, factorials

49

Binomial coefficients can be computed by choose(n,k):

> choose(8,5)

[1] 56

50

functions for random variables

Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates

d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation

and the last part of the name of the function specifies the type of distribution, e.g.

binomial dististribution normal distribution

binomial distribution

51

• dbinom(x, size, prob)

x k size n prob π

knk )1(k

n)kX(P)k(f

Probability function:

normal distribution

52

• dnorm(x, mean, sd)

Density function:

2

2

2

)x(

e2

1)x(f

normal distribution

53

Calculating the probability density function: > dnorm(x=2, mean=6, sd=2)

[1] 0.02699548

0 2 4 6 8 10 12

0.00

0.05

0.10

0.15

0.20

x

f(x)

normal distribution

54

• pnorm(q, mean, sd)

q: b

Distribution function:

b

dx)x(f)b(F

b

f(x) 'density'

x

55

13

f(x) 'density'

x

normal distribution

Distribution function:

N(10,25) distribution

> pnorm(q=13, mean=10, sd=5)

[1] 0.7257469

> dbinom(x=5, size=50, prob=0.15)

# Probability of having exactly 5 successes in 50

independent observations/measurements with a success

probability of 0.15 each

[1] 0.1072481

> dbinom(5, 50, 0.15) # the same


56

Probability function:

normal distribution

57

Plotting the density of a N(5,49) distribution:

> x_values=seq(-15, 25, by=0.5)

> y_values=dnorm(x_values, mean=5, sd=7)

> plot(x_values,y_values,type="l")

58

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Make a scatter plot for the variables glucose and insulin. What are the possible realizations of a random variable X distributed according to Bin(4,0.85)? Calculate all possible values of the probability function of X. Plot the probability function of X with the possible realizations of X on the x axis and the corresponding values of the probability function on the y axis.

R

Random numbers and factors

tutorial 3

60





binomial dististribution normal distribution t distribution


61

• rbinom(n, size, prob)

n: number of samples to draw size: n prob=π

output: number of successes

knk )1(k

n)kX(P)k(f

Generating random realizations:

normal distribution

62

• rnorm(n, mean, sd)

n: number of samples to draw

2

2

2

)x(

e2

1)x(f


t distribution

63

• qt(p, df)

p: quantile probability df: degrees of freedom

Quantiles:

> rbinom(n=1, size=50, prob=0.15)

# Generating one sample of 50 independent

observations/measurements with a success probability

of 0.15 each

[1] 14 # 14 successes in this simulation


[1] 7


64



# Generating 10 samples

[1] 14 10 6 12 8 6 7 10 5 9

# The number of successes for all samples


65


normal distribution

66

> values=rnorm(10, mean=0, sd=1)

> values

[1] -0.56047565 -0.23017749 1.55870831 0.07050839

0.12928774 1.71506499 0.46091621 -1.26506123

-0.68685285 -0.44566197

# 10 simulations from a N(0,1) distribution

> mean(values)

[1] 0.07462565

= for α=0.05, n=100

t distribution

67

• qt(p, df)

p: quantile probability df: degrees of freedom

> qt(p=0.95,df=9)

[1] 1.833113

> qt(p=0.95,df=99)

[1] 1.660391

> qnorm(p=0.95,mean=0,sd=1)

[1] 1.644854

> qt(p=0.975,df=99)

[1] 1.984217

Quantiles:

121 n,/t α

68

object classes

All objects in R have a class. The class attribute allows R

to treat objects differently (e.g. for summary() or plot()).

Possible classes are:

numeric

logical

character

list

matrix

data.frame

array

factor

The class is shown by the class() function.

factors

69

Categorical variables in R are often specified as factors.

Factors have a fixed number of categories, called levels.

summary(factor) displays the frequency of the factor levels.

Functions in R for creating factors:

factor(), as.factor()

levels() displays and sets levels.

factors

70

• factor(x,levels,labels)

• as.factor(x)

x: vector of data, usually small number of values levels: specifies the values (categories) of x labels: labels the levels

71

> smoke = c(0,0,1,1,0,0,0,1,0,1)

> smoke

[1] 0 0 1 1 0 0 0 1 0 1

> summary(smoke)

Min. 1st Qu. Median Mean 3rd Qu. Max.

0.0 0.0 0.0 0.4 1.0 1.0

> class(smoke)

[1] "numeric"

factors

72

> smoke_new=factor(smoke)

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1

> summary(smoke_new)

0 1

6 4

> class(smoke_new)

[1] "factor"

factors

73

> smoke_new=factor(smoke,levels=c(0,1))

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1

> smoke_new=factor(smoke,levels=c(0,1,2))

> smoke_new

[1] 0 0 1 1 0 0 0 1 0 1

Levels: 0 1 2


0 1 2

6 4 0

factors

74

> smoke_new=factor(smoke,levels=c(0,1),

labels=c("no", "yes")

> smoke_new

[1] no no yes yes no no no yes no yes

Levels: no yes


no yes

6 4

factors

75

> library(MASS)

> summary(birthwt$race)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.000 1.000 1.000 1.847 3.000 3.000

> race_new=as.factor(birthwt$race)

> summary(race_new)

1 2 3

96 26 67

> levels(race_new)

[1] "1" "2" "3"

> levels(race_new)=c("white","black","other")

> summary(race_new)

white black other

96 26 67

factors

76

hands-on example

Sample 20 realizations of a N(0,1) distribution. Calculate mean and standard deviation. What is the formula for the confidence interval for the mean for unknown σ? For a 90% confidence interval and the above sample: What are the parameters α and n? Which value has t1- α/2,n-1? Calculate the 90% confidence interval for our example.

R

Reading data from files,

frequency tables

tutorial 4

78





binomial dististribution normal distribution t distribution

• qnorm(p, mean, sd)

p: quantile probability


[1] 1.644854


[1] 1.959964

= for α=0.05

79

Quantiles:

21 /αz

normal distribution

= z0.95

80

reading data: working directory

For reading or saving files, a simple file name identifies a file in the working directory. Files in other places can be specified by the path name.

getwd() gives the current working directory.

setwd("path") sets a specific directory as your

working directory.

Use setwd("path") to load and save data in the

directory of your choice.

The standard way of storing statistical data is to put them in a rectangular form with rows corresponding to observations and columns corresponding to variables.

Spreadsheets are often used to store and manipulate data in this way, e.g. EXCEL.

The function read.table() can be used to read

data which has been stored in this way.

The first argument to read.table() identifies the

file to be read.

reading data

81

reading data

82

Optional arguments to read.table() which can be

used to change its behaviour.

Setting header=TRUE indicates to R that the first row

of the data file contains names for each of the columns.

The argument skip= makes it possible to skip the

specified number of lines at the top of the file.

The argument sep= can be used to specify a character which separates columns. (Use sep=";" for csv files.)

The argument dec= can be used to specify a character

as decimal point.

83

example data: infarct

(case/control study)

name label

nro identifier

grp group (string) 'control', 'infarct'

code coded group 0 'control', 1 'infarct'

sex sex 1 'male', 2 'female'

age age years

height body height cm

weight body weight kg

blood sugar blood sugar level mg/100ml

diabet diabetes 0 'no', 1 'yes'

chol cholesterol level mg/100ml

trigl triglyceride level mg/100ml

cig cigarettes number of

84

> setwd("C:/Users/Präsentation/MLS")

> mi = read.table("infarct data.csv")

Error in scan(file, what,...: line 2 did not have 2

elements # wrong separator

> mi = read.table("infarct data.csv",sep=";")

> summary(mi) # no variable names

> mi = read.table("infarct data.csv",sep=";",

header=TRUE)

> summary(mi) # with variable names


85

frequency tables

table(var1, var2) gives a table of the

absolute frequencies of all combinations of var1 and var2. var1 and var2 have to attain a finite number of values (frequency table, cross classification table,

contingency table). var1 defines the rows, var2 the columns. addmargins(table) adds the sums of rows and

columns. prop.table(table) gives the relative

frequencies, overall or with respect to rows or columns.

86

frequency tables

> grp_sex=table(mi$grp,mi$sex)

> grp_sex

1 2

control 25 15

infarct 28 12

> addmargins(grp_sex)

1 2 Sum

control 25 15 40

infarct 28 12 40

Sum 53 27 80

87

frequency tables

> prop.table(grp_sex)

1 2

control 0.3125 0.1875

infarct 0.3500 0.1500

> prop.table(grp_sex,margin=1)

1 2

control 0.625 0.375

infarct 0.700 0.300 # rows sums to 1

> prop.table(grp_sex,margin=2)

1 2

control 0.4716981 0.5555556

infarct 0.5283019 0.4444444 # columns sum to 1

88

hands-on example

Load the dataset from the file bdendo.csv into the workspace. Generate a table of the variables d (case-control status) and dur (categorical duration of oestrogen therapy). Generate a table of the variables d (case-control status) and agegr (age group). Compare the two tables.

R

Installing packages,

the package "pROC"

tutorial 5

R packages

90

R consists of a base level of functionality together with a set of contributed libraries which provide extended capabilities.

The key idea is that of a package which provides a related set of software components, documentation and data sets.

Packages can be installed into R. This needs administrator rights.

91

Package: pROC

Type: Package

Title: display and analyze ROC curves

Version: 1.7.1

Date: 2014-02-20

Encoding: UTF-8

Depends: R (>= 2.13)

Imports: plyr, utils, methods, Rcpp (>= 0.10.5)

Suggests: microbenchmark, tcltk, MASS, logcondens, doMC,

doSNOW

LinkingTo: Rcpp

Author: Xavier Robin, Natacha Turck, Alexandre Hainard,

Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez

and Markus Müller.

Maintainer: Xavier Robin <[email protected]>

pROC – diagnostic testing

installing packages

92

You can install R packages using the install.packages() command.

> install.packages("pROC")

Installing package(s) into

‘C:/Users/Amke/Documents/R/win-library/2.15’

(as ‘lib’ is unspecified)

downloaded 827 Kb

package ‘pROC’ successfully unpacked and MD5 sums

checked

The downloaded binary packages are in

C:\Users\Amke\AppData\Local\Temp\RtmpUJPoia\downl

oaded_packages

93

Installing R packages using the menu:

using installed packages

95

When R is running, simply type:

> library(pROC)

This adds the R functions in the library to the search path. You can now use the functions and datasets in the package and inspect the documentation.

96

cite packages

To cite the package pROC in publications use:

> citation("pROC")

...

Xavier Robin, Natacha Turck, Alexandre Hainard,

Natalia Tiberti, Frédérique

Lisacek, Jean-Charles Sanchez and Markus Müller

(2011). pROC: an open-source

package for R and S+ to analyze and compare ROC

curves. BMC Bioinformatics, 12,

p. 77. DOI: 10.1186/1471-2105-12-77

<http://www.biomedcentral.com/1471-2105/12/77/>

...

97

package pROC

The main function is roc(response, predictor). It creates the values necessary for an ROC curve.

response: disease status (as provided by gold standard)

predictor: continuous test result

(to be dichotomized)

For an roc object the plot(roc_obj) function produces an ROC curve.

98

package pROC

The function coords(roc_obj,x,best.method,ret) calculates measures of test performance.

x: value for which measures are calculated (default: threshold) , x="best" gives the optimal threshold

best.method: if x="best", the method to determine the best threshold (e.g. "youden")

ret: Measures calculated. One or more of "threshold", "specificity", "sensitivity", "accuracy", "tn" (true negative count), "tp" (true positive count), "fn" (false negative count), "fp" (false positive count), "npv" (negative predictive value), "ppv" (positive predictive value)

(default: threshold, specificity, sensitivity)

99

example data: aSAH

name label

gos6

Glasgow Outcome Score (GOS) at

6 months 1-5

outcome prediction of development 'good', 'poor' to be diagnozed

gender sex 'male', 'female'

age age years

wfns

World Federation of Neurological Surgeons

Score 1-5

s100b

S100 calcium binding protein

B μg/l biomarker

continuous test result

ndka Nucleoside diphosphate

kinase A μg/l biomarker

continous test result

aneurysmal subarachnoid haemorrhage

100

> data(aSAH) # loads the data set "aSAH"

> head(aSAH)

> rocobj = roc(aSAH$outcome, aSAH$s100b)

> plot(rocobj)

> coords(rocobj, 0.55)

threshold specificity sensitivity

0.5500000 1.0000000 0.2682927

> coords(rocobj, x="best",best.method="youden")


0.2050000 0.8055556 0.6341463

# youden threshold is 0.20; according spec and sens

package pROC

true positive

Measures of Test Performance

Outcomes of a diagnostic study for a dichotomous test result

positive negative

present

absent

test result

disease

false negative

false positive true negative

102

> coords(rocobj,x="best",best.method="youden",

ret=c("threshold","specificity","sensitivity",

"tn","tp","fn","fp"))


0.2050000 0.8055556 0.6341463

tn tp fn fp

58.0000000 26.0000000 15.0000000 14.0000000

package pROC

tp: 26

positive negative

present

absent

test result

disease

fn: 15

fp:14 tn: 26

R

Statistical testing 1

tutorial 6

statistical test functions

104

name function

t.test( ) Student‘s t-test

wilcox.test( ) Wilcoxon rank sum test and signed rank test

ks.test( ) Kolmogorov-Smirnov test

chisq.test( ) Pearson‘s chi-squared test for count data

mcnemar.test( ) McNemar test

105

One sample t test

The function t.test() performs different Student‘s t tests.

Parameters for the one sample t test are t.test(x,mu,alternative)

x: numeric vector of values which shall be tested

(assumed to follow a normal distribution)

mu: reference value µ0

alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than µ0),

"greater" (alternative: expectation of x is larger than µ0)

Blood Sugar Level and Myocardial Infarction

H0: ≤0 HA: >0

A study was carried out to assess whether the expected blood sugar level (BSL) of patients with myocardial

infarction µ is higher than the expected BSL of control

individuals, namely µ0=100 mg/100ml.

107



name label

nro identifier


code coded group 0 'control', 1 'infarkt'


age age years








108


> mi = read.table("infarct data.csv",sep=";",

dec=",", header=TRUE)

>summary(mi$blood.sugar)

>summary(as.factor(mi$code))

>bloods_infarct=mi$blood.sugar[mi$code==1]

# Attention: two "="s!

# Extracts the blood sugar levels of only the cases.

>summary(bloods_infarct)

One sample t test

109

>t.test(bloods_infarct,mu=100,alternative="greater")

One Sample t-test

data: bloods_infarct

t = -0.7824, df = 39, p-value = 0.7807

alternative hypothesis: true mean is greater than 100

95 percent confidence interval:

90.14572 Inf

sample estimates:

mean of x

96.875

# Blood sugar level of infarct patients is not

significantly higher than 100mg/100ml.

One sample t test

110

hands-on example

Load the dataset from the file infarct data.csv into the workspace. Perform a two-sided one-sample t-test for cholesterol level in infarct patients. The reference value for the population is 180 mg/100ml. What is the result of the test?

R

Statistical testing 2

tutorial 7

statistical test functions

112

name function

t.test( ) Student‘s t-test

wilcox.test( ) Wilcoxon rank sum test and signed rank test

ks.test( ) Kolmogorov-Smirnov test

chisq.test( ) Pearson‘s chi-squared test for count data

mcnemar.test( ) McNemar test

The function t.test() performs different Student‘s t tests.

Parameters for the two sample t test are:

t.test(x, y, alternative, var.equal)

x, y: numeric vectors of values which shall be compared

(assumed to follow a normal distribution)

alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than expectation of y), "greater" (alternative: expectation of x is larger than expectation of y)

var.equal: Are the variances of x and y equal? (TRUE or FALSE (default); TRUE is the t test of the lecture)

113

Two sample t test

The function wilcox.test() performs the Wilcoxon rank sum test and the Wilcoxon signed rank test.

Parameters for the Wilcoxon rank sum test are: wilcox.test(x, y, alternative)

x, y: numeric vectors of values which shall be compared

(need not follow a normal distribution)

alternative: similar to t.test

114

Wilcoxon rank sum test

Blood Sugar Level and Myocardial Infarction

H0: 1≤2 HA: 1>2

A case-control study was carried out to assess whether the expected blood sugar level (BSL) of patients with

myocardial infarction µ1 is higher than the expected BSL of control individuals µ2.

116



name label

nro identifier




age age years








117


> mi = read.table("infarct data.csv", sep=";",


> summary(mi$blood.sugar)

> summary(as.factor(mi$code))

> bloods_infarct=mi$blood.sugar[mi$code==1]

> bloods_control=mi$blood.sugar[mi$code==0]

# Extracts the blood sugar levels of the cases

# and of the controls.

Two sample t test

118

> t.test(bloods_infarct, bloods_control,

var.equal=TRUE, alternative="greater")

Two Sample t-test

data: bloods_infarct and bloods_control

t = 0.0305, df = 78, p-value = 0.4879

alternative hypothesis: true difference in means is

greater than 0


-13.39077 Inf

sample estimates:

mean of x mean of y

96.875 96.625

# Expected BSL of infarct patients is not

significantly higher than expected BSL of controls.

Two sample t test

119

> wilcox.test(bloods_infarct, bloods_control,

alternative="greater")

Wilcoxon rank sum test with continuity correction

data: bloods_infarct and bloods_control

W = 867.5, p-value = 0.2576

alternative hypothesis: true location shift is greater

than 0

# The Wilcoxon test can be applied if the BSL does not

# follow a normal distribution. Then the t test is not

# valid.

Wilcoxon rank sum test

120

Pearson‘s chi-squared test

The function chisq.test() performs a Pearson‘s chi-squared test for count data. chisq.test(x)

x: n x m table (matrix) to be tested

121


name text variable type

low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'

age age of mother continuous: years

lwt mother's weight at last period continuous: pounds

race ethnicity nominal: 1 'white' 2 'black' 3 'other'

smoke smoking status nominal: 0 'no' 1 'yes'

ptl premature labor discrete: number of

ht hypertension nominal: 0 'no' 1 'yes'

ui presence of uterine irritability nominal: 0 'no' 1 'yes'

ftv physician visits in first trimester discrete: number of

bwt birthweight of the baby continous: g

122

> library(MASS)

> tab_bw_smok=table(birthwt$low, birthwt$smoke)

> tab_bw_smok

0 1

0 86 44

1 29 30

> chisq.test(tab_bw_smok)

Pearson's Chi-squared test with Yates'

continuity correction

data: tab_bw_smok

X-squared = 4.2359, df = 1, p-value = 0.03958

# The probability of having a baby with low birth

# weight is significantly higher for smoking mothers.

Pearson‘s chi-squared test

123

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Plot a histogram of the variable insulin. Compare the insulin values between cases and controls (variable diabetes) using an appropriate test.

R

Correlation and linear regression,

low level graphics

tutorial 8

125

Correlation

The function cor(x, y, method) computes the correlation between two paired random variables.

x, y: numeric vectors of values for which the correlation shall be calculated (must have the same length)

method: "pearson", "spearman" or "kendall"

126

Test of correlation

The function cor.test(x, y, alternative, method) tests for correlation between paired random variables.

x, y: numeric vectors of values for which the correlation shall be tested (must have the same length)

alternative:

"two.sided" (alternative: correlation coefficient ≠ 0,

default),

"less" (alternative: negative correlation),

"greater" (alternative: positive correlation)

method: "pearson", "spearman" or "kendall"

127

Linear regression (simple)

The function lm(formula, data) fits a linear model to data.

formula: y~x with y response variable and x explanatory variable (must have the same length)

data: optional, if not specified in formula, the dataframe containing x and y

128



name label

nro identifier




age age years








129




> plot(x=mi$height, y=mi$weight)

> cor(mi$height, mi$weight, method="pearson")

[1] 0.6307697

> cor(mi$height, mi$weight, method="spearman")

[1] 0.6281738

Correlation

130

Correlation

131

> cor.test(mi$height,mi$weight,method="pearson")

Pearson's product-moment correlation

data: mi$height and mi$weight

t = 7.1792, df = 78, p-value = 3.586e-10

alternative hypothesis: true correlation is not

equal to 0


0.4771865 0.7469643

sample estimates:

cor

0.6307697

# Significant correlation between body height and

# body weight

Correlation

132

> lm(mi$weight~mi$height)

Call:

lm(formula = mi$weight ~ mi$height)

Coefficients:

(Intercept) mi$height

-51.2910 0.7477

# Y = a + b × x + E # with Y: body weight, x: body height,

# a=-51.29, b=0.75

Linear regression

graphics

133






low-level graphics

134

Plots produced by high-level graphics facilities can be modified by low-level graphics commands.

135

low-level functions

name function

points(x, y) adds points (the option type= can be used)

lines(x, y) adds lines (the option type= can be used)

text(x, y, labels, ...) adds text given by labels at coordinates (x,y); a typical use is: plot(x, y, type="n"); text(x, y, names)

abline(a, b) draws a line of slope b and intercept a

abline(h=y) draws a horizontal line at ordinate y

abline(v=x) draws a vertical line at abcissa x

rect(x1, y1, x2, y2) draws a rectangle whose left, right, bottom, and top limits are x1, x2, y1, and y2, respectively

polygon(x, y) draws a polygon with coordinates given by x and y

title( ) adds a title and optionally a sub-title

136

> plot(x=mi$height, y=mi$weight)

> abline(a=-51.29, b=0.75, col="blue")

# Adds the regression line to the scatter plot.

> title("Regression of weight and height")

> text(x=185, y=65, labels="Kieler Woche",

col="green")

low-level functions

137

low-level functions

Kieler Woche

138

hands-on example

Load the dataset from the file correlation.csv into the workspace. Calculate the Pearson correlation coefficient between the variables x and y and test whether this coefficient is significantly different from 0. Generate a scatter plot.

R

Regression models

tutorial 9

140

Linear regression (simple)


formula: y~x with y response variable and x explanatory variable (must have the same length)


141

Linear regression (multiple)


formula: y~x1+x2+…+xk with y response variable and

x1,…,xk explanatory variables

(must have the same length)

data: optional, if not specified in formula, the dataframe containing x1,…,xk and y

142

Generalised linear model

The function glm(formula, family) fits a generalised linear model to data.

formula: y~x1+x2+…+xk with y response variable and

x1,…,xk explanatory variables

(must have the same length)

family: specifies the link function; choose family=binomial for the logistic regression

143



name label

nro identifier


code coded group 0 'control', 1 'infarct'


age age years








144




> model_mi=glm(mi$code~mi$sex+mi$age+

mi$height+mi$weight+mi$blood.sugar+mi$diabet

+mi$chol+mi$trigl+mi$cig,family=binomial)

> summary(model_mi)


145

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -34.60297 12.51757 -2.764 0.005704 **

mi$sex 0.23048 0.90885 0.254 0.799810

mi$age 0.10734 0.04161 2.580 0.009883 **

mi$height 0.14930 0.07838 1.905 0.056799 .

mi$weight -0.11508 0.06304 -1.826 0.067916 .

mi$blood.sugar -0.02246 0.01399 -1.605 0.108425

mi$diabet 2.05732 2.15947 0.953 0.340743

mi$chol 0.07294 0.02188 3.334 0.000855

***

mi$trigl -0.01936 0.01227 -1.578 0.114638

mi$cig 0.07686 0.04695 1.637 0.101603

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1


146

> model_mi=glm(mi$code~mi$age+mi$chol,family=binomial)

> summary(model_mi)

Coefficients:

Estimate Std. Error z value Pr(>|z|)

(Intercept) -16.13858 3.78005 -4.269 1.96e-05 ***

mi$age 0.08404 0.03255 2.582 0.009827 **

mi$chol 0.05564 0.01569 3.546 0.000391 ***

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’

0.1 ‘ ’ 1

# Model after backward selection


147

hands-on example

loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Perform a linear regression with the variable insulin as response and variables glucose, pressure, mass and triceps as explanatory variables. Apply a backwards selection to generate a reduced model.

R

Survival analysis

tutorial 10

149

Survival object

Before performing analysis the function Surv(time, event) has to create a survival object.

time:

if event occured: time of the event

if no event occured: last observation time

Since start of study (survival time)

event:

1: event

0: no event

Important: has to be numeric.

150

Survival curves

The function survfit(formula, data) creates an estimated survival curve. Afterwards use the plot command.

formula: Let y be a Surv object.

y~1 for a Kaplan-Meier curve

y~x for several Kaplan-Meier curves stratified by x


151

Log-Rank Test

The function survdiff(formula, rho, data) tests if there is a difference between two or more survival curves.

formula: y~x with

y: Surv object

x: group or stratifying variable

rho: a scalar parameter that controls the type of test.

rho=0 (default) for the Log-Rank Test (Modification of test in lecture)


152

example data: survival

name

therapy

two chemotherapies: C1 and C2

time

if death occured: time of death if no death occured: last observation time

event

1: death 0: no death

153

> install.packages("survival")

> library(survival)


> cancer=read.table("survival.csv",dec=",",sep=";",

header=TRUE)

> head(cancer)

> surv_object=Surv(time=cancer$time,

event=cancer$event)

> curve=survfit(surv_object~1)

> summary(curve)

> plot(curve)

# One Kaplan-Meier curve for both therapies combined

# (with confidence bands)

Survival analysis

154

Survival analysis

155

> surv_object=Surv(time=cancer$time,

event=cancer$event)

> curve=survfit(surv_object~cancer$therapy)

> summary(curve)

> plot(curve,lty=1:2)

> legend("topright",levels(cancer$therapy),lty=1:2)

# Two Kaplan-Meier curves, one for each therapy

# group (without confidence bands)

Survival analysis

156

Survival analysis

157

> survdiff(surv_object~cancer$therapy,rho=0)

Call:

survdiff(formula = surv_object ~ cancer$therapy,rho = 0)

N Obs Expected (O-E)^2/E (O-E)^2/V

cancer$therapy=C1 10 6 4.07 0.919 1.56

cancer$therapy=C2 10 6 7.93 0.471 1.56

Chisq= 1.6 on 1 degrees of freedom, p= 0.211

# No significant difference between the

# survival functions of the two therapies

Survival analysis

158

hands-on example

loading: library(survival), the dataframe is called retinopathy. In this dataframe the variable futime is the time variable and the variable status the event variable. Plot a Kaplan-Meier curve.