r introduction and descriptive statistics · 2020-06-11 · introduction and descriptive statistics...
TRANSCRIPT
R
Introduction and descriptive statistics
tutorial 1
2
what is R
R is a free software programming language and software environment for statistical computing and graphics. (Wikipedia)
R is open source.
3
what is R
R is an object oriented programming language.
Everything in R is an object.
R objects are stored in memory, and are acted upon by functions (and operators).
4
Homepage CRAN: Comprehensive R Archive Network
http://www.r-project.org/
how to get R
5
Editor RStudio
6
http://www.rstudio.com
how to edit R
8
RStudio
http://www.rstudio.com
how R works
9
using R as a calculator
10
Users type expressions to the R interpreter.
R responds by computing and printing the answers.
11
arithmetic operators
type operator action performed
arithmetic
results in numeric value(s)
+ addition
- subtraction
* multiplication
/ division
^ raise to power
12
logical operators
type operator action performed
comparison
results in logical
value(s):
TRUE FALSE
< less than
> greater than
== equal to
!= not equal to
<= greater than or equal to
>= less than or equal to
connectors
& boolean intersection operator (logical and)
| boolean union operator (logical or)
13
arithmetic operators
power multiplication > 3 ^ 2
> 2 ^ (-2)
Note: > 100 ^ (1/2)
is equivalent to
> sqrt(100)
addition / subtraction > 5 + 5
> 10 - 2
multiplication / division > 10 * 10
> 25 / 5
14
logical operators
> 4 < 3 [1] FALSE
> 2^3 == 9 [1] FALSE
> (3 + 1) != 3 [1] TRUE
> (3 >= 1) & (4 == (3+1)) [1] TRUE
assignment
15
Values are stored by assigning them a name.
The statements
> z = 17
> z <- 17
> 17 -> z
all store the value 17 under the name z in the workspace. Assignment operators are: <- , = , ->
16
data types
There are three basic types or modes of variables:
numeric (numbers: integers, real)
logical (TRUE, FALSE)
character (text strings, in "")
The type is shown by the mode() function. Note: A general missing value indicator is NA.
17
data types
> a = 49 # numeric
> sqrt(a)
[1] 7
> mode(a)
[1] "numeric"
> a = "The dog ate my homework" # character
> a
[1] "The dog ate my homework"
> mode(a)
[1] "character"
> a = (1 + 1 == 3) # logical
> a
[1] FALSE
> mode(a)
[1] "logical"
18
Elements: numeric, logical, character in
vectors ordered sets of elements of one type
data.frames ordered sets of vectors (different vector types)
matrices ordered sets of vectors (all of one vector type)
lists ordered sets of anything.
data structures
19
creating vectors
> x = c(1, 3, 5, 7, 8, 9) # numerical vector
> x
[1] 1 3 5 7 8 9
> z = c("I","am","Ironman") # character vector
> z
[1] "I" "am" "Ironman"
> x = c(TRUE,FALSE,NA) # logical vector
> x
[1] TRUE FALSE NA
The function c( ) can combine several elements into vectors.
combining vectors
20
The function c( ) can be used to combine both vectors and elements into larger vectors. > x = c(1, 2, 3, 4)
> c(x, 10)
[1] 1 2 3 4 10
> c(x, x)
[1] 1 2 3 4 1 2 3 4
In fact, R stores elements like 10 as vectors of length one, so that both arguments in the expression above are vectors
sequences
21
A useful way of generating vectors is using the sequence operator. The expression n1:n2, generates the sequence of integers ranging from n1 to n2. > 1:15
[1] 1 2 3 4 5 6 7 8 9 10 11 12 13
[14] 14 15
> 5:-5
[1] 5 4 3 2 1 0 -1 -2 -3 -4 -5
> y = 1:11
> y
[1] 1 2 3 4 5 6 7 8 9 10 11
22
extracting elements
> x = c(1, 3, 5, 7, 8, 9)
> x[3] # extract 3rd position
[1] 5
> x[1:3] # extract positions 1-3
[1] 1 3 5
> x[-2] # without 2nd position
[1] 1 5 7 8 9
> x[x<7] # select values < 7
[1] 1 3 5
> x[x!=5] # select values not equal to 5
[1] 1 3 7 8 9
data frame
23
Data frames provide a way of grouping a number of related vectors into a single data object. The function data.frame() takes a number of vectors with same lengths and returns a single object containing all the variables.
df = data.frame(var1, var2, ...)
24
data frame
In a data frame the column labels are the vector names. Note: Vectors can be of different types in a data frame (numeric, logical, character).
Data frames can be created in a number of ways:
Binding together vectors by the function data.frame( ).
Reading in data from an external file.
25
data frame
> time = c("early","mid","late","late","early")
> type <- c("G", "G", "O", "O", "G")
> counts <- c(20, 13, 8, 34, 7)
> data <- data.frame(time,type,counts)
> data
time type counts
1 early G 20
2 mid G 13
3 late O 8
4 late O 34
5 early G 7
> fix(data)
26
example data: low birth weight
name text variable type
low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'
age age of mother continuous: years
lwt mother's weight at last period continuous: pounds
race ethnicity nominal: 1 'white' 2 'black' 3 'other'
smoke smoking status nominal: 0 'no' 1 'yes'
ptl premature labor discrete: number of
ht hypertension nominal: 0 'no' 1 'yes'
ui presence of uterine irritability nominal: 0 'no' 1 'yes'
ftv physician visits in first trimester discrete: number of
bwt birthweight of the baby continous: g
The birthweight data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Mass during 1986.
27
example data: low birth weight
loading: library(MASS), the dataframe is called birthwt. Overview over dataframes:
dim(birthwt)
summary(birthwt)
head(birthwt)
str(birthwt)
extracting vectors
28
data$vectorlabel gives the vector named vectorlabel of the dataframe named data. Extracting elements from this vector is done as usually.
> birthwt$age
> birthwt$age[33]
> birthwt$age[1:10]
29
some functions in R
name function
summary(x) summary statistics of the elements of x
max(x) maximum of the elements of x
min(x) minimum of the elements of x
sum(x) sum of the elements of x
mean(x) mean of the elements of x
sd(x) standard deviation of the elements of x
median(x) median of the elements of x
quantile(x, probs=…) quantiles of the elements of x
sort(x) ordering the elements of x
30
some functions in R
> mean(birthwt$age)
[1] 23.2381
> max(birthwt$age)
[1] 45
> min(birthwt$age)
[1] 14
31
getting help
to get help on the sd() function you can type either of > help(sd)
> ?sd
sorting vectors
32
> help(sort)
> x=sort(birthwt$age, decreasing=FALSE)
> x[1:10]
[1] 14 14 14 15 15 15 16 16 16 16
> x=sort(birthwt$age, decreasing=TRUE)
> x[1:10]
[1] 45 36 36 35 35 34 33 33 33 32
> x[25] # 25th highest age
[1] 30
Sorting / ordering of data in vectors with the function sort()
graphics
33
R has extensive graphics facilities.
Graphic functions are differentiated in
high-level graphics functions
low-level graphics functions
The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.
34
high-level graphics
name function
plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)
hist(x) histogram of the frequencies of x
barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars
dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)
pie(x) circular pie-chart
boxplot(x) box-and-whiskers plot
stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)
mosaicplot(x) mosaic plot from frequencies in a contingency table
qqnorm(x) quantiles of x with respect to the values expected under a normal law
35
high-level graphics
> hist(birthwt$age)
> boxplot(birthwt$age)
36
hands-on example
loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Load the dataframe into your workspace with the data("PimaIndiansDiabetes2") command. Get an overview with the functions dim, head.
Calculate the mean and median of the variable insulin. Remove NAs for the calculation with the na.rm = TRUE option in mean and median functions.
Plot an histogram of the variable insulin.
R
Graphics and probability theory
tutorial 2
graphics
38
R has extensive graphics facilities.
Graphic functions are differentiated in
high-level graphics functions
low-level graphics functions
The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.
39
high-level graphics
name function
plot(x, y) bivariate plot of x (on the x-axis) and y (on the y-axis)
hist(x) histogram of the frequencies of x
barplot(x) histogram of the values of x; use horiz=FALSE for horizontal bars
dotchart(x) if x is a data frame, plots a Cleveland dot plot (stacked plots line-by-line and column-by-column)
pie(x) circular pie-chart
boxplot(x) box-and-whiskers plot
stripplot(x) plot of the values of x on a line (an alternative to boxplot() for small sample sizes)
mosaicplot(x) mosaic plot from frequencies in a contingency table
qqnorm(x) quantiles of x with respect to the values expected under a normal law
plot function
40
The core R graphics command is plot(). This is an all-in-one function which carries out a number of actions:
It opens a new graphics window.
It plots the content of the graph (points, lines etc.).
It plots x and y axes and boxes around the plot and produces the axis labels and title.
….
plot function
41
Parameters in the plot() function are:
x x-coordinate(s) y y-coordinates (optional, depends on x)
42
plot function
To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot) , type:
> a = c(1,2,3,4)
> b = c(4,4,0,5)
> plot(x=a,y=b)
> plot(a,b) # the same
43
plot function
To plot points with x and y coordinates or two random variables for a data set (one on the x axis, the other on the y axis; called a scatterplot), type:
> library(MASS)
> plot(x=birthwt$age,y=birthwt$lwt)
# lwt: mothers weight in pounds
> plot(x=birthwt$age[1:10],y=birthwt$lwt[1:10])
# first 10 mothers
44
plot function
Another example:
> a = seq(-5, +5, by=0.2)
# generates a sequence from -5 to +5 with increment
0.2
[1] -5.0 -4.8 -4.6 -4.4 -4.2 -4.0 -3.8 -3.6 -3.4 -3.2
...
[45] 3.8 4.0 4.2 4.4 4.6 4.8 5.0
> b = a^2 # squares all components of a
> plot(a,b)
plot function
45
Parameters in the plot() function are (see help(plot) and help(par)):
x x-coordinate(s) y y-coordinates (optional, depends on x) main, sub title and subtitle xlab, ylab axes labels xlim, ylim range of values for x and y type type of plot lty type of lines pch plot symbol cex scale factor col color of points etc. ...
plot symbol / line type
46
plot symbol: pch=
line type: lty=
plot type: type=
“p‘‘ points ‘‘l‘‘ lines ‘‘b“ both ‘‘s“ steps ‘‘h“ vertical lines ‘‘n“ nothing …
47
plot function
> a = seq(-5, +5, by=0.2)
> b = a^2
> plot(a, b)
> plot(a,b,main="quadratic function")
> plot(a,b,main="quadratic function",cex=2)
> plot(a,b,main="quadratic function",col="blue")
48
plot function
> a = seq(-5, +5, by=0.2)
> b = a^2
> plot(a,b,main="quadratic function",type="l")
> plot(a,b,main="quadratic function",type="b")
> plot(a,b,main="quadratic function",pch=2)
-4 -2 0 2 4
05
10
15
20
25
quadratic function
x
y
probability theory, factorials
49
Binomial coefficients can be computed by choose(n,k):
> choose(8,5)
[1] 56
50
functions for random variables
Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates
d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation
and the last part of the name of the function specifies the type of distribution, e.g.
binomial dististribution normal distribution
binomial distribution
51
• dbinom(x, size, prob)
x k size n prob π
knk )1(k
n)kX(P)k(f
Probability function:
normal distribution
52
• dnorm(x, mean, sd)
Density function:
2
2
2
)x(
e2
1)x(f
normal distribution
53
Calculating the probability density function: > dnorm(x=2, mean=6, sd=2)
[1] 0.02699548
0 2 4 6 8 10 12
0.00
0.05
0.10
0.15
0.20
x
f(x)
normal distribution
54
• pnorm(q, mean, sd)
q: b
Distribution function:
b
dx)x(f)b(F
b
f(x) 'density'
x
55
13
f(x) 'density'
x
normal distribution
Distribution function:
N(10,25) distribution
> pnorm(q=13, mean=10, sd=5)
[1] 0.7257469
> dbinom(x=5, size=50, prob=0.15)
# Probability of having exactly 5 successes in 50
independent observations/measurements with a success
probability of 0.15 each
[1] 0.1072481
> dbinom(5, 50, 0.15) # the same
binomial distribution
56
Probability function:
normal distribution
57
Plotting the density of a N(5,49) distribution:
> x_values=seq(-15, 25, by=0.5)
> y_values=dnorm(x_values, mean=5, sd=7)
> plot(x_values,y_values,type="l")
58
hands-on example
loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Make a scatter plot for the variables glucose and insulin. What are the possible realizations of a random variable X distributed according to Bin(4,0.85)? Calculate all possible values of the probability function of X. Plot the probability function of X with the possible realizations of X on the x axis and the corresponding values of the probability function on the y axis.
R
Random numbers and factors
tutorial 3
60
functions for random variables
Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates
d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation
and the last part of the name of the function specifies the type of distribution, e.g.
binomial dististribution normal distribution t distribution
binomial distribution
61
• rbinom(n, size, prob)
n: number of samples to draw size: n prob=π
output: number of successes
knk )1(k
n)kX(P)k(f
Generating random realizations:
normal distribution
62
• rnorm(n, mean, sd)
n: number of samples to draw
2
2
2
)x(
e2
1)x(f
Generating random realizations:
t distribution
63
• qt(p, df)
p: quantile probability df: degrees of freedom
Quantiles:
> rbinom(n=1, size=50, prob=0.15)
# Generating one sample of 50 independent
observations/measurements with a success probability
of 0.15 each
[1] 14 # 14 successes in this simulation
> rbinom(n=1, size=50, prob=0.15)
[1] 7
binomial distribution
64
Generating random realizations:
> rbinom(n=10, size=50, prob=0.15)
# Generating 10 samples
[1] 14 10 6 12 8 6 7 10 5 9
# The number of successes for all samples
binomial distribution
65
Generating random realizations:
normal distribution
66
> values=rnorm(10, mean=0, sd=1)
> values
[1] -0.56047565 -0.23017749 1.55870831 0.07050839
0.12928774 1.71506499 0.46091621 -1.26506123
-0.68685285 -0.44566197
# 10 simulations from a N(0,1) distribution
> mean(values)
[1] 0.07462565
= for α=0.05, n=100
t distribution
67
• qt(p, df)
p: quantile probability df: degrees of freedom
> qt(p=0.95,df=9)
[1] 1.833113
> qt(p=0.95,df=99)
[1] 1.660391
> qnorm(p=0.95,mean=0,sd=1)
[1] 1.644854
> qt(p=0.975,df=99)
[1] 1.984217
Quantiles:
121 n,/t α
68
object classes
All objects in R have a class. The class attribute allows R
to treat objects differently (e.g. for summary() or plot()).
Possible classes are:
numeric
logical
character
list
matrix
data.frame
array
factor
The class is shown by the class() function.
factors
69
Categorical variables in R are often specified as factors.
Factors have a fixed number of categories, called levels.
summary(factor) displays the frequency of the factor levels.
Functions in R for creating factors:
factor(), as.factor()
levels() displays and sets levels.
factors
70
• factor(x,levels,labels)
• as.factor(x)
x: vector of data, usually small number of values levels: specifies the values (categories) of x labels: labels the levels
71
> smoke = c(0,0,1,1,0,0,0,1,0,1)
> smoke
[1] 0 0 1 1 0 0 0 1 0 1
> summary(smoke)
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.0 0.0 0.0 0.4 1.0 1.0
> class(smoke)
[1] "numeric"
factors
72
> smoke_new=factor(smoke)
> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1
> summary(smoke_new)
0 1
6 4
> class(smoke_new)
[1] "factor"
factors
73
> smoke_new=factor(smoke,levels=c(0,1))
> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1
> smoke_new=factor(smoke,levels=c(0,1,2))
> smoke_new
[1] 0 0 1 1 0 0 0 1 0 1
Levels: 0 1 2
> summary(smoke_new)
0 1 2
6 4 0
factors
74
> smoke_new=factor(smoke,levels=c(0,1),
labels=c("no", "yes")
> smoke_new
[1] no no yes yes no no no yes no yes
Levels: no yes
> summary(smoke_new)
no yes
6 4
factors
75
> library(MASS)
> summary(birthwt$race)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1.000 1.000 1.000 1.847 3.000 3.000
> race_new=as.factor(birthwt$race)
> summary(race_new)
1 2 3
96 26 67
> levels(race_new)
[1] "1" "2" "3"
> levels(race_new)=c("white","black","other")
> summary(race_new)
white black other
96 26 67
factors
76
hands-on example
Sample 20 realizations of a N(0,1) distribution. Calculate mean and standard deviation. What is the formula for the confidence interval for the mean for unknown σ? For a 90% confidence interval and the above sample: What are the parameters α and n? Which value has t1- α/2,n-1? Calculate the 90% confidence interval for our example.
R
Reading data from files,
frequency tables
tutorial 4
78
functions for random variables
Distributions can be easily calculated or simulated using R. The functions are named such that the first letter states what the function calculates or simulates
d = density function (probability function) p = distribution function q = quantile (inverse distribution) r = random number generation
and the last part of the name of the function specifies the type of distribution, e.g.
binomial dististribution normal distribution t distribution
• qnorm(p, mean, sd)
p: quantile probability
> qnorm(p=0.95,mean=0,sd=1)
[1] 1.644854
> qnorm(p=0.975,mean=0,sd=1)
[1] 1.959964
= for α=0.05
79
Quantiles:
21 /αz
normal distribution
= z0.95
80
reading data: working directory
For reading or saving files, a simple file name identifies a file in the working directory. Files in other places can be specified by the path name.
getwd() gives the current working directory.
setwd("path") sets a specific directory as your
working directory.
Use setwd("path") to load and save data in the
directory of your choice.
The standard way of storing statistical data is to put them in a rectangular form with rows corresponding to observations and columns corresponding to variables.
Spreadsheets are often used to store and manipulate data in this way, e.g. EXCEL.
The function read.table() can be used to read
data which has been stored in this way.
The first argument to read.table() identifies the
file to be read.
reading data
81
reading data
82
Optional arguments to read.table() which can be
used to change its behaviour.
Setting header=TRUE indicates to R that the first row
of the data file contains names for each of the columns.
The argument skip= makes it possible to skip the
specified number of lines at the top of the file.
The argument sep= can be used to specify a character which separates columns. (Use sep=";" for csv files.)
The argument dec= can be used to specify a character
as decimal point.
83
example data: infarct
(case/control study)
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarct'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number of
84
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv")
Error in scan(file, what,...: line 2 did not have 2
elements # wrong separator
> mi = read.table("infarct data.csv",sep=";")
> summary(mi) # no variable names
> mi = read.table("infarct data.csv",sep=";",
header=TRUE)
> summary(mi) # with variable names
example data: infarct
85
frequency tables
table(var1, var2) gives a table of the
absolute frequencies of all combinations of var1 and var2. var1 and var2 have to attain a finite number of values (frequency table, cross classification table,
contingency table). var1 defines the rows, var2 the columns. addmargins(table) adds the sums of rows and
columns. prop.table(table) gives the relative
frequencies, overall or with respect to rows or columns.
86
frequency tables
> grp_sex=table(mi$grp,mi$sex)
> grp_sex
1 2
control 25 15
infarct 28 12
> addmargins(grp_sex)
1 2 Sum
control 25 15 40
infarct 28 12 40
Sum 53 27 80
87
frequency tables
> prop.table(grp_sex)
1 2
control 0.3125 0.1875
infarct 0.3500 0.1500
> prop.table(grp_sex,margin=1)
1 2
control 0.625 0.375
infarct 0.700 0.300 # rows sums to 1
> prop.table(grp_sex,margin=2)
1 2
control 0.4716981 0.5555556
infarct 0.5283019 0.4444444 # columns sum to 1
88
hands-on example
Load the dataset from the file bdendo.csv into the workspace. Generate a table of the variables d (case-control status) and dur (categorical duration of oestrogen therapy). Generate a table of the variables d (case-control status) and agegr (age group). Compare the two tables.
R
Installing packages,
the package "pROC"
tutorial 5
R packages
90
R consists of a base level of functionality together with a set of contributed libraries which provide extended capabilities.
The key idea is that of a package which provides a related set of software components, documentation and data sets.
Packages can be installed into R. This needs administrator rights.
91
Package: pROC
Type: Package
Title: display and analyze ROC curves
Version: 1.7.1
Date: 2014-02-20
Encoding: UTF-8
Depends: R (>= 2.13)
Imports: plyr, utils, methods, Rcpp (>= 0.10.5)
Suggests: microbenchmark, tcltk, MASS, logcondens, doMC,
doSNOW
LinkingTo: Rcpp
Author: Xavier Robin, Natacha Turck, Alexandre Hainard,
Natalia Tiberti, Frédérique Lisacek, Jean-Charles Sanchez
and Markus Müller.
Maintainer: Xavier Robin <[email protected]>
pROC – diagnostic testing
installing packages
92
You can install R packages using the install.packages() command.
> install.packages("pROC")
Installing package(s) into
‘C:/Users/Amke/Documents/R/win-library/2.15’
(as ‘lib’ is unspecified)
downloaded 827 Kb
package ‘pROC’ successfully unpacked and MD5 sums
checked
The downloaded binary packages are in
C:\Users\Amke\AppData\Local\Temp\RtmpUJPoia\downl
oaded_packages
93
Installing R packages using the menu:
94
using installed packages
95
When R is running, simply type:
> library(pROC)
This adds the R functions in the library to the search path. You can now use the functions and datasets in the package and inspect the documentation.
96
cite packages
To cite the package pROC in publications use:
> citation("pROC")
...
Xavier Robin, Natacha Turck, Alexandre Hainard,
Natalia Tiberti, Frédérique
Lisacek, Jean-Charles Sanchez and Markus Müller
(2011). pROC: an open-source
package for R and S+ to analyze and compare ROC
curves. BMC Bioinformatics, 12,
p. 77. DOI: 10.1186/1471-2105-12-77
<http://www.biomedcentral.com/1471-2105/12/77/>
...
97
package pROC
The main function is roc(response, predictor). It creates the values necessary for an ROC curve.
response: disease status (as provided by gold standard)
predictor: continuous test result
(to be dichotomized)
For an roc object the plot(roc_obj) function produces an ROC curve.
98
package pROC
The function coords(roc_obj,x,best.method,ret) calculates measures of test performance.
x: value for which measures are calculated (default: threshold) , x="best" gives the optimal threshold
best.method: if x="best", the method to determine the best threshold (e.g. "youden")
ret: Measures calculated. One or more of "threshold", "specificity", "sensitivity", "accuracy", "tn" (true negative count), "tp" (true positive count), "fn" (false negative count), "fp" (false positive count), "npv" (negative predictive value), "ppv" (positive predictive value)
(default: threshold, specificity, sensitivity)
99
example data: aSAH
name label
gos6
Glasgow Outcome Score (GOS) at
6 months 1-5
outcome prediction of development 'good', 'poor' to be diagnozed
gender sex 'male', 'female'
age age years
wfns
World Federation of Neurological Surgeons
Score 1-5
s100b
S100 calcium binding protein
B μg/l biomarker
continuous test result
ndka Nucleoside diphosphate
kinase A μg/l biomarker
continous test result
aneurysmal subarachnoid haemorrhage
100
> data(aSAH) # loads the data set "aSAH"
> head(aSAH)
> rocobj = roc(aSAH$outcome, aSAH$s100b)
> plot(rocobj)
> coords(rocobj, 0.55)
threshold specificity sensitivity
0.5500000 1.0000000 0.2682927
> coords(rocobj, x="best",best.method="youden")
threshold specificity sensitivity
0.2050000 0.8055556 0.6341463
# youden threshold is 0.20; according spec and sens
package pROC
true positive
Measures of Test Performance
Outcomes of a diagnostic study for a dichotomous test result
positive negative
present
absent
test result
disease
false negative
false positive true negative
102
> coords(rocobj,x="best",best.method="youden",
ret=c("threshold","specificity","sensitivity",
"tn","tp","fn","fp"))
threshold specificity sensitivity
0.2050000 0.8055556 0.6341463
tn tp fn fp
58.0000000 26.0000000 15.0000000 14.0000000
package pROC
tp: 26
positive negative
present
absent
test result
disease
fn: 15
fp:14 tn: 26
R
Statistical testing 1
tutorial 6
statistical test functions
104
name function
t.test( ) Student‘s t-test
wilcox.test( ) Wilcoxon rank sum test and signed rank test
ks.test( ) Kolmogorov-Smirnov test
chisq.test( ) Pearson‘s chi-squared test for count data
mcnemar.test( ) McNemar test
105
One sample t test
The function t.test() performs different Student‘s t tests.
Parameters for the one sample t test are t.test(x,mu,alternative)
x: numeric vector of values which shall be tested
(assumed to follow a normal distribution)
mu: reference value µ0
alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than µ0),
"greater" (alternative: expectation of x is larger than µ0)
Blood Sugar Level and Myocardial Infarction
H0: ≤0 HA: >0
A study was carried out to assess whether the expected blood sugar level (BSL) of patients with myocardial
infarction µ is higher than the expected BSL of control
individuals, namely µ0=100 mg/100ml.
107
example data: infarct
(case/control study)
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarkt'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number of
108
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv",sep=";",
dec=",", header=TRUE)
>summary(mi$blood.sugar)
>summary(as.factor(mi$code))
>bloods_infarct=mi$blood.sugar[mi$code==1]
# Attention: two "="s!
# Extracts the blood sugar levels of only the cases.
>summary(bloods_infarct)
One sample t test
109
>t.test(bloods_infarct,mu=100,alternative="greater")
One Sample t-test
data: bloods_infarct
t = -0.7824, df = 39, p-value = 0.7807
alternative hypothesis: true mean is greater than 100
95 percent confidence interval:
90.14572 Inf
sample estimates:
mean of x
96.875
# Blood sugar level of infarct patients is not
significantly higher than 100mg/100ml.
One sample t test
110
hands-on example
Load the dataset from the file infarct data.csv into the workspace. Perform a two-sided one-sample t-test for cholesterol level in infarct patients. The reference value for the population is 180 mg/100ml. What is the result of the test?
R
Statistical testing 2
tutorial 7
statistical test functions
112
name function
t.test( ) Student‘s t-test
wilcox.test( ) Wilcoxon rank sum test and signed rank test
ks.test( ) Kolmogorov-Smirnov test
chisq.test( ) Pearson‘s chi-squared test for count data
mcnemar.test( ) McNemar test
The function t.test() performs different Student‘s t tests.
Parameters for the two sample t test are:
t.test(x, y, alternative, var.equal)
x, y: numeric vectors of values which shall be compared
(assumed to follow a normal distribution)
alternative: "two.sided" (two sided alternative, default), "less" (alternative: expectation of x is less than expectation of y), "greater" (alternative: expectation of x is larger than expectation of y)
var.equal: Are the variances of x and y equal? (TRUE or FALSE (default); TRUE is the t test of the lecture)
113
Two sample t test
The function wilcox.test() performs the Wilcoxon rank sum test and the Wilcoxon signed rank test.
Parameters for the Wilcoxon rank sum test are: wilcox.test(x, y, alternative)
x, y: numeric vectors of values which shall be compared
(need not follow a normal distribution)
alternative: similar to t.test
114
Wilcoxon rank sum test
Blood Sugar Level and Myocardial Infarction
H0: 1≤2 HA: 1>2
A case-control study was carried out to assess whether the expected blood sugar level (BSL) of patients with
myocardial infarction µ1 is higher than the expected BSL of control individuals µ2.
116
example data: infarct
(case/control study)
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarkt'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number of
117
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)
> summary(mi$blood.sugar)
> summary(as.factor(mi$code))
> bloods_infarct=mi$blood.sugar[mi$code==1]
> bloods_control=mi$blood.sugar[mi$code==0]
# Extracts the blood sugar levels of the cases
# and of the controls.
Two sample t test
118
> t.test(bloods_infarct, bloods_control,
var.equal=TRUE, alternative="greater")
Two Sample t-test
data: bloods_infarct and bloods_control
t = 0.0305, df = 78, p-value = 0.4879
alternative hypothesis: true difference in means is
greater than 0
95 percent confidence interval:
-13.39077 Inf
sample estimates:
mean of x mean of y
96.875 96.625
# Expected BSL of infarct patients is not
significantly higher than expected BSL of controls.
Two sample t test
119
> wilcox.test(bloods_infarct, bloods_control,
alternative="greater")
Wilcoxon rank sum test with continuity correction
data: bloods_infarct and bloods_control
W = 867.5, p-value = 0.2576
alternative hypothesis: true location shift is greater
than 0
# The Wilcoxon test can be applied if the BSL does not
# follow a normal distribution. Then the t test is not
# valid.
Wilcoxon rank sum test
120
Pearson‘s chi-squared test
The function chisq.test() performs a Pearson‘s chi-squared test for count data. chisq.test(x)
x: n x m table (matrix) to be tested
121
example data: low birth weight
name text variable type
low low birth weight of the baby nominal: 0 'no >=2500g' 1 'yes <2500g'
age age of mother continuous: years
lwt mother's weight at last period continuous: pounds
race ethnicity nominal: 1 'white' 2 'black' 3 'other'
smoke smoking status nominal: 0 'no' 1 'yes'
ptl premature labor discrete: number of
ht hypertension nominal: 0 'no' 1 'yes'
ui presence of uterine irritability nominal: 0 'no' 1 'yes'
ftv physician visits in first trimester discrete: number of
bwt birthweight of the baby continous: g
122
> library(MASS)
> tab_bw_smok=table(birthwt$low, birthwt$smoke)
> tab_bw_smok
0 1
0 86 44
1 29 30
> chisq.test(tab_bw_smok)
Pearson's Chi-squared test with Yates'
continuity correction
data: tab_bw_smok
X-squared = 4.2359, df = 1, p-value = 0.03958
# The probability of having a baby with low birth
# weight is significantly higher for smoking mothers.
Pearson‘s chi-squared test
123
hands-on example
loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Plot a histogram of the variable insulin. Compare the insulin values between cases and controls (variable diabetes) using an appropriate test.
R
Correlation and linear regression,
low level graphics
tutorial 8
125
Correlation
The function cor(x, y, method) computes the correlation between two paired random variables.
x, y: numeric vectors of values for which the correlation shall be calculated (must have the same length)
method: "pearson", "spearman" or "kendall"
126
Test of correlation
The function cor.test(x, y, alternative, method) tests for correlation between paired random variables.
x, y: numeric vectors of values for which the correlation shall be tested (must have the same length)
alternative:
"two.sided" (alternative: correlation coefficient ≠ 0,
default),
"less" (alternative: negative correlation),
"greater" (alternative: positive correlation)
method: "pearson", "spearman" or "kendall"
127
Linear regression (simple)
The function lm(formula, data) fits a linear model to data.
formula: y~x with y response variable and x explanatory variable (must have the same length)
data: optional, if not specified in formula, the dataframe containing x and y
128
example data: infarct
(case/control study)
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarkt'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number of
129
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)
> plot(x=mi$height, y=mi$weight)
> cor(mi$height, mi$weight, method="pearson")
[1] 0.6307697
> cor(mi$height, mi$weight, method="spearman")
[1] 0.6281738
Correlation
130
Correlation
131
> cor.test(mi$height,mi$weight,method="pearson")
Pearson's product-moment correlation
data: mi$height and mi$weight
t = 7.1792, df = 78, p-value = 3.586e-10
alternative hypothesis: true correlation is not
equal to 0
95 percent confidence interval:
0.4771865 0.7469643
sample estimates:
cor
0.6307697
# Significant correlation between body height and
# body weight
Correlation
132
> lm(mi$weight~mi$height)
Call:
lm(formula = mi$weight ~ mi$height)
Coefficients:
(Intercept) mi$height
-51.2910 0.7477
# Y = a + b × x + E # with Y: body weight, x: body height,
# a=-51.29, b=0.75
Linear regression
graphics
133
R has extensive graphics facilities.
Graphic functions are differentiated in
high-level graphics functions
low-level graphics functions
The quality of the graphs produced by R is often cited as a major reason for using it in preference to other statistical software systems.
low-level graphics
134
Plots produced by high-level graphics facilities can be modified by low-level graphics commands.
135
low-level functions
name function
points(x, y) adds points (the option type= can be used)
lines(x, y) adds lines (the option type= can be used)
text(x, y, labels, ...) adds text given by labels at coordinates (x,y); a typical use is: plot(x, y, type="n"); text(x, y, names)
abline(a, b) draws a line of slope b and intercept a
abline(h=y) draws a horizontal line at ordinate y
abline(v=x) draws a vertical line at abcissa x
rect(x1, y1, x2, y2) draws a rectangle whose left, right, bottom, and top limits are x1, x2, y1, and y2, respectively
polygon(x, y) draws a polygon with coordinates given by x and y
title( ) adds a title and optionally a sub-title
136
> plot(x=mi$height, y=mi$weight)
> abline(a=-51.29, b=0.75, col="blue")
# Adds the regression line to the scatter plot.
> title("Regression of weight and height")
> text(x=185, y=65, labels="Kieler Woche",
col="green")
low-level functions
137
low-level functions
Kieler Woche
138
hands-on example
Load the dataset from the file correlation.csv into the workspace. Calculate the Pearson correlation coefficient between the variables x and y and test whether this coefficient is significantly different from 0. Generate a scatter plot.
R
Regression models
tutorial 9
140
Linear regression (simple)
The function lm(formula, data) fits a linear model to data.
formula: y~x with y response variable and x explanatory variable (must have the same length)
data: optional, if not specified in formula, the dataframe containing x and y
141
Linear regression (multiple)
The function lm(formula, data) fits a linear model to data.
formula: y~x1+x2+…+xk with y response variable and
x1,…,xk explanatory variables
(must have the same length)
data: optional, if not specified in formula, the dataframe containing x1,…,xk and y
142
Generalised linear model
The function glm(formula, family) fits a generalised linear model to data.
formula: y~x1+x2+…+xk with y response variable and
x1,…,xk explanatory variables
(must have the same length)
family: specifies the link function; choose family=binomial for the logistic regression
143
example data: infarct
(case/control study)
name label
nro identifier
grp group (string) 'control', 'infarct'
code coded group 0 'control', 1 'infarct'
sex sex 1 'male', 2 'female'
age age years
height body height cm
weight body weight kg
blood sugar blood sugar level mg/100ml
diabet diabetes 0 'no', 1 'yes'
chol cholesterol level mg/100ml
trigl triglyceride level mg/100ml
cig cigarettes number of
144
> setwd("C:/Users/Präsentation/MLS")
> mi = read.table("infarct data.csv", sep=";",
dec=",", header=TRUE)
> model_mi=glm(mi$code~mi$sex+mi$age+
mi$height+mi$weight+mi$blood.sugar+mi$diabet
+mi$chol+mi$trigl+mi$cig,family=binomial)
> summary(model_mi)
Generalised linear model
145
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -34.60297 12.51757 -2.764 0.005704 **
mi$sex 0.23048 0.90885 0.254 0.799810
mi$age 0.10734 0.04161 2.580 0.009883 **
mi$height 0.14930 0.07838 1.905 0.056799 .
mi$weight -0.11508 0.06304 -1.826 0.067916 .
mi$blood.sugar -0.02246 0.01399 -1.605 0.108425
mi$diabet 2.05732 2.15947 0.953 0.340743
mi$chol 0.07294 0.02188 3.334 0.000855
***
mi$trigl -0.01936 0.01227 -1.578 0.114638
mi$cig 0.07686 0.04695 1.637 0.101603
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
Generalised linear model
146
> model_mi=glm(mi$code~mi$age+mi$chol,family=binomial)
> summary(model_mi)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -16.13858 3.78005 -4.269 1.96e-05 ***
mi$age 0.08404 0.03255 2.582 0.009827 **
mi$chol 0.05564 0.01569 3.546 0.000391 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’
0.1 ‘ ’ 1
# Model after backward selection
Generalised linear model
147
hands-on example
loading: library(mlbench), the dataframe is called PimaIndiansDiabetes2. Perform a linear regression with the variable insulin as response and variables glucose, pressure, mass and triceps as explanatory variables. Apply a backwards selection to generate a reduced model.
R
Survival analysis
tutorial 10
149
Survival object
Before performing analysis the function Surv(time, event) has to create a survival object.
time:
if event occured: time of the event
if no event occured: last observation time
Since start of study (survival time)
event:
1: event
0: no event
Important: has to be numeric.
150
Survival curves
The function survfit(formula, data) creates an estimated survival curve. Afterwards use the plot command.
formula: Let y be a Surv object.
y~1 for a Kaplan-Meier curve
y~x for several Kaplan-Meier curves stratified by x
data: optional, if not specified in formula, the dataframe containing x and y
151
Log-Rank Test
The function survdiff(formula, rho, data) tests if there is a difference between two or more survival curves.
formula: y~x with
y: Surv object
x: group or stratifying variable
rho: a scalar parameter that controls the type of test.
rho=0 (default) for the Log-Rank Test (Modification of test in lecture)
data: optional, if not specified in formula, the dataframe containing x and y
152
example data: survival
name
therapy
two chemotherapies: C1 and C2
time
if death occured: time of death if no death occured: last observation time
event
1: death 0: no death
153
> install.packages("survival")
> library(survival)
> setwd("C:/Users/Präsentation/MLS")
> cancer=read.table("survival.csv",dec=",",sep=";",
header=TRUE)
> head(cancer)
> surv_object=Surv(time=cancer$time,
event=cancer$event)
> curve=survfit(surv_object~1)
> summary(curve)
> plot(curve)
# One Kaplan-Meier curve for both therapies combined
# (with confidence bands)
Survival analysis
154
Survival analysis
155
> surv_object=Surv(time=cancer$time,
event=cancer$event)
> curve=survfit(surv_object~cancer$therapy)
> summary(curve)
> plot(curve,lty=1:2)
> legend("topright",levels(cancer$therapy),lty=1:2)
# Two Kaplan-Meier curves, one for each therapy
# group (without confidence bands)
Survival analysis
156
Survival analysis
157
> survdiff(surv_object~cancer$therapy,rho=0)
Call:
survdiff(formula = surv_object ~ cancer$therapy,rho = 0)
N Obs Expected (O-E)^2/E (O-E)^2/V
cancer$therapy=C1 10 6 4.07 0.919 1.56
cancer$therapy=C2 10 6 7.93 0.471 1.56
Chisq= 1.6 on 1 degrees of freedom, p= 0.211
# No significant difference between the
# survival functions of the two therapies
Survival analysis
158
hands-on example
loading: library(survival), the dataframe is called retinopathy. In this dataframe the variable futime is the time variable and the variable status the event variable. Plot a Kaplan-Meier curve.