introduction to r, statistics, and the grammar of graphics thomas ingicco e. delacroix, dante et...

46
Introduction to R, Statistics, and the grammar of graphics Thomas INGICCO E. Delacroix, Dante et Virgile aux Enfers E. Delacroix, Dante and Virgile in Hell

Upload: nicholas-bruce

Post on 28-Dec-2015

226 views

Category:

Documents


0 download

TRANSCRIPT

Introduction to R, Statistics, and the grammar of graphics Thomas INGICCO

E. Delacroix, Dante et Virgile aux EnfersE. Delacroix, Dante and Virgile in Hell

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))

Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))

Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)colnames(Data3d)<-c("x", "y")rownames(Data3d)<-paste("Lan", 1:10, sep="")

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))

Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)colnames(Data3d)<-c("x", "y")rownames(Data3d)<-paste("Lan", 1:10, sep="")t(Data3d)

Data3d2 <- Data3d1

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

# Arrayee <- array(1:4, dim=c(2, 3, 2))ee <- array(1:4, c(2, 3, 2))

Data3d1 <-matrix(c(0.72,100.32,0.75,100.36,0.77,100.32,0.81,100.32,0.77,100.29,0.77,100.24,0.73,100.28,0.7,100.26,0.7,100.3,0.67,100.33), 10, 2, byrow=T)colnames(Data3d)<-c("x", "y")rownames(Data3d)<-paste("Lan", 1:10, sep="")t(Data3d)

Data3d2 <- Data3d1

array(cbind(Data3d1, Data3d2), dim=c(10, 2, 2))

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

# Listff <- list(aa, bb, cc, dd)

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

## Tablehh <- table(gg)hh <- table(gg, dd[1:6,11])

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

hhh <- data.frame(gg, dd[1:6,11])colnames(hhh) <- c("gg","Lip") # Rename the columnshhhh <- table(hhh)

data.frame(gg, na.omit(dd[1:6,11])) # Function na.omitdata.frame(gg, na.omit(dd[1:7,11]))

dim(hhhh) # Number of lines and columnsdimnames(hhhh)

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

margin.table(hhhh) # Calculate the marginsmargin.table(hhhh, 1)margin.table(hhhh, 2)

hhhh[3,] <- c(1000,2000) # Replace line 3

cbind(hhhh,hhh) # Concatenate the columns of two tables

t(hhhh) # Transposition

Classes – how you present your data- Vector – series of values of 1 dimension- Matrix – series of values of 2 dimensions- Arrays – series of values of n dimensions- Data Frame – series of values in columns- List – series of objects- Table – Contingency table

… but before all it is a language with its own grammar made of:

# Factorgg <- rep(c("Everted", "Round", "Flat"), c(1,2,3))is.vector(gg)is.character(gg) gg1 <- factor(gg)

IndividualsVariables

1 … j … p

1 x11 … x1j … x1p

… … … …

i xi1 … xij … xip

… … … …

n xn1 … xnj … xnp

FAM GEN SP ID UNW LNW MTW ATW ANW MDW ADW TL NH AGEGibbons Hylobates H.sp 1880_1167_D 7.11 7.74 10.999.26 8.16 9.42 9.59 188,3110.37 AGibbons Hylobates H.sp 1880_1167_G 6.12 8.53 11.3 9.29 8.54 9.5 9.42 187,510.13 AGibbons Hylobates H.sp 1880_1170_D 6.18 9.72 10.818.91 7.69 8.05 8.78 177,248.94 AGibbons Hylobates H.sp 1880_1170_G 6.44 10.0910.688.96 9.07 8.05 8.69 177,599.29 AGibbons Hylobates H.sp 1901_102_D 6.31 11.6915.1911.799.26 11.83 11.6 206,6911.49 AGibbons Hylobates H.sp 1901_102_G 7.14 11.1314.9311.689.06 11.76 11.3 205,3211.49 A

Continuous quantitative variableLength of dog calcaneum

{67.0 54.7 7.0 48.5 14.0 17.2 20.7 13.0 43.4 40.2 38.9 54.5 59.8 48.3 22.9 11.5 34.4 35.1 38.7 30.8 30.6 43.1 56.8 40.8 41.8 42.5 31.0 31.7 30.2 25.9 49.2 37.0 35.915.0 30.2 7.2 36.2 45.5 7.8 33.4 36.1 40.2 42.7 42.5 16.2 39.0 35.0 37.0 31.4 37.6 39.9 36.2 42.8 46.424.7 49.1 46.0 35.9 7.8 48.2 15.2 32.5 44.7 42.6 38.8 17.4 40.8 29.1 14.6 59.2}

Discrete quantitative variable

Number of flakes per context{1 0 3 3 0 0 1 1 0 0 1 1 0 2 2 1 0 1 0 0 1 3 0 0 0 2 0 2 5 0 0 0 0 1 1 0 0 0 1 0 0 1 4 0 2 2 1 2 2 2 1 1 0 2 0 0 1 0 4 2 0 0 2 3 1 1 1 0 0 1 0 0 2 0 0 0 2 2 0 0 1 0 2 2 0 1 0 3 3 0 2 0 2 2 3 0 3 1 0 0}

Qualitative variableColour of the pot

{black, red, black, red, brown, brown, black, grey, red, black}

Different kind of data

Different kind of data

Different kind of data

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

We add all the measuresAnd we divide by the number of measurements

𝑋=       ∑𝑖=1

𝑛

𝑥𝑖

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑋= 1𝑁 ∑

𝑖=1

𝑛

𝑥𝑖

We add all the measuresAnd we divide by the number of measurements

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑋= 1𝑁 ∑

𝑖=1

𝑛

𝑥𝑖

We add all the measuresAnd we divide by the number of measurements

sum(DATA[1:49,6]) / length(DATA[1:49,6])

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑋= 1𝑁 ∑

𝑖=1

𝑛

𝑥𝑖

We add all the measuresAnd we divide by the number of measurements

sum(DATA[1:49,6]) / length(DATA[1:49,6])

mean(DATA[1:49,6])

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑋= 1𝑁 ∑

𝑖=1

𝑛

𝑥𝑖

We add all the measuresAnd we divide by the number of measurements

sum(DATA[1:49,6]) / length(DATA[1:49,6])

mean(DATA[1:49,6])

colMeans(DATA[1:49,6:11])

Descriptive and inferential statistics

Example:You are told that you have a serious illness for which the mean survival period is six months… Statistics interest you !

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

The Mode is the most frequent value

Sample > Median = Median < sample

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

The Mode is the most frequent value

Sample > Median = Median < sample

median(DATA[1:49,6])

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

The Mode is the most frequent value

Sample > Median = Median < sample

median(DATA[1:49,6]) quantile(DATA[1:49,6])

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

The Mode is the most frequent value

Sample > Median = Median < sample

median(DATA[1:49,6]) quantile(DATA[1:49,6])

min(DATA[1:49,6])max(DATA[1:49,6])

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

The Mode is the most frequent value

Sample > Median = Median < sample

median(DATA[1:49,6]) quantile(DATA[1:49,6])

min(DATA[1:49,6])max(DATA[1:49,6])range(DATA[1:49,6])

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

The Mode is the most frequent value

Sample > Median = Median < sample

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value

𝜎 ²= 1𝑛−1

.∑𝑖=1

𝑛

(𝑥𝑖−𝑋 ) ²

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value

𝜎 ²= 1𝑛−1

.∑𝑖=1

𝑛

(𝑥𝑖−𝑋 ) ²

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value

𝜎 ²= 1𝑛−1

.∑𝑖=1

𝑛

(𝑥𝑖−𝑋 ) ²

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value

𝜎 ²= 1𝑁−1

.∑𝑖=1

𝑛

(𝑥𝑖−𝑋 )²

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

We calculate the difference between every value and the meanWe square this differenceWe sum the squared differencesAnd we divide by the number of value

𝜎 ²= 1𝑁−1

.∑𝑖=1

𝑛

(𝑥𝑖−𝑋 )²

The variance is the mean of the squared differences to the mean

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝜎=√ 1𝑁−1

.∑𝑖=1

𝑛

(𝑥𝑖−𝑋 ) ²  

The standard deviation is the square root of the variance

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑐𝑣=𝜎𝑋

Transform the standard deviation into the metrics of the variable

It permits to compare two variablesProblem: when X is close to zero, it becomes useless

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑧=(𝑥−𝑋 )

𝑠

To measure the difference to the mean in the standard deviationmetrics, we use:

This is the centered- reduced variable of mean=0 and variance=1

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0

𝑆𝑥𝑦=∑𝑖=1

𝑛

(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)

𝑛

We multiply the x-deviation to the mean to its associated y-deviationWe sum these productsWe divide by the number of values

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0

𝑆𝑥𝑦=∑𝑖=1

𝑛

(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)

𝑛

We multiply the x-deviation to the mean to its associated y-deviationWe sum these productsWe divide by the number of values

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Covariance measures the degree of dependance of two variables: Are the values of each measurement drift independantly away from the centre of gravity, or are they drifting away together?If x and y are independant, then the covariance is equal to 0

𝑆𝑥𝑦=∑𝑖=1

𝑛

(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)

𝑛

We multiply the x-deviation to the mean to its associated y-deviationWe sum these productsWe divide by the number of values

So covariance is the sum of the crossed products

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑆𝑥𝑦=∑𝑖=1

𝑛

(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)

𝑛

So covariance is the sum of the crossed products

( sum(DATA[1:49,6] * DATA[1:49,7]) - prod(sum(DATA[1:49,6]),sum(DATA[1:49,7])) / length(DATA[1:49,6]) ) / ( length(DATA[1:49,6])-1 )

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

𝑆𝑥𝑦=∑𝑖=1

𝑛

(𝑥𝑖−𝑋 𝑥)(𝑦 𝑖−𝑋 𝑦)

𝑛

So covariance is the sum of the crossed products

( sum(DATA[1:49,6] * DATA[1:49,7]) - prod(sum(DATA[1:49,6]),sum(DATA[1:49,7])) / length(DATA[1:49,6]) ) / ( length(DATA[1:49,6])-1 )

Cov(DATA[1:49,6],DATA[1:49,7])

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Pearson’s correlation coefficient differs from the covariance by its absence of unit and its boundaries between -1 and 1

𝑟 𝑥𝑦=𝑆𝑥𝑦

𝑆𝑥×𝑆𝑦

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Pearson’s correlation coefficient differs from the covariance by its absence of unit and its boundaries between -1 and 1

𝑟 𝑥𝑦=𝑆𝑥𝑦

𝑆𝑥×𝑆𝑦

cov(DATA[1:49,6],DATA[1:49,7]) / (sd(DATA[1:49,6]) * sd(DATA[1:49,7]))

Descriptive and inferential statistics

Position parameters:

MeanModeMediane

Dispersion parameters:

Standard deviationVarianceMaximum MinimumCoefficient of variationCovarianceCoefficient of correlation

Pearson’s correlation coefficient differs from the covariance by its absence of unit and its boundaries between -1 and 1

𝑟 𝑥𝑦=𝑆𝑥𝑦

𝑆𝑥×𝑆𝑦

cov(DATA[1:49,6],DATA[1:49,7]) / (sd(DATA[1:49,6]) * sd(DATA[1:49,7]))

Cor(DATA[1:49,6],DATA[1:49,7])