presenting and summarizing data. confidence intervals · central tendency measures of variability...

PRESENTING AND SUMMARIZING DATA. CONFIDENCE INTERVALS

Jesús Piedrafita Arilla [email protected]

Departament de Ciència Animal i dels Aliments

Experimental Design and Statistical Methods

Workshop

http://cran.r-project.org/index.html

Items

• Types of variables

• Numerical methods for presenting data: means, variance, skewness, kurtosis

• Estimation – Point estimates

– Interval estimates

• Distributions: normal, t, chi-square

• Starting with R – Website

– Objects, workspace

– Commands • Expressions

• Assignments

• First operations in R – Creating a vector

– Descriptive statistics

– Distributions

– Pie and bar charts

• Script 2

Variables

• Quantitative (numerical)

– Continuous (adult weight, percent of a fatty acid, … )

– Discrete -countable, finite or infinite- (number of colonies, litter size)

• Qualitative (categorical or classification)

– Ordinal (calving ease score, panel score)

– Nominal (gender, coat colour, …) • Use bar diagrams better than pie-charts

Variable: Set of observations of a particular character

Data: Values of a variable

3

Summary of numerical methods for presenting data

Descriptive statistics

Measurements of central tendency

Measures of variability

Measures of the shape of a distribution

Measures of relative position

Aritmetic mean Range Skewness Percentiles Median Variance Kurtosis Quartiles (Q1, Q2, Q3) Mode Standard deviation z-values Coefficient of

variation

Descriptive Statistics attempts to describe the distribution of the data

4

Measures of central tendency

i ii

i iyfy

n

yy or

Arithmetic mean

The second formula is for grouped data, fi being the proportion of each value

Median: value that is in the middle when observations are sorted

from the smallest to the largest. Robust to the presence of extreme values (in that differed from the mean).

Mode: value among the observations that has the highest frequency

5

Measures of variability

1

)( 2

2

n

yys i i

i

i i

ii iyyn

yyyySS

2

22)(

2ss

Sample variance of n observations: More variance indicates more dispersion.

Corrected sum of squares

Range: difference between the maximum and the minimum values in a set

of observations. Very affected by extreme values.

Sample standard deviation: it maintains the unit of

measurement of row data. Both variance and standard deviations are affected by extreme values.

Coefficient of variation: a relative measure of

variability, dimensionless. %)100(

y

sCV

6

The concept of degrees of freedom (df) is central to the principle of estimating statistics of

populations from samples of them. In short, think of df as a mathematical restriction that we need

to put in place when we calculate an estimate of one statistic from an estimate of another.

Let us see an example. Normal distributions need only two parameters (mean and standard

deviation) for their definition. The population values of mean and standard deviation are referred

to as and , respectively, and the sample estimates are and s.

In order to estimate , we must first have estimated . Thus, is replaced by in the formula

for . At this point, we need to apply the restriction that the deviations must sum to zero. Thus,

degrees of freedom are n-1.

When this principle of restriction is applied to regression and analysis of variance, the general

result is that you lose one degree of freedom for each parameter estimated prior to

estimating the (residual) standard deviation.

Another way of thinking about the restriction principle behind degrees of freedom is to imagine

contingencies. For example, imagine you have four numbers (a, b, c and d) that must add up to a

total of m; you are free to choose the first three numbers at random, but the fourth must be

chosen so that it makes the total equal to m - thus your degrees of freedom are 3.

Degrees of freedom

y

y

7

Measures of the shape of a distribution

i

i

i

i

s

yy

nn

n

s

y

nn

nsk

3

3

)2)(1(

)2)(1(

Skewness: measure of asymmetry of a frequency distribution. It is 0 for a

symmetric distribution.

Kurtosis: measure of flatness or steepness of a distribution, or a measure

of the heaviness of the tails of a distribution. It is 0 for a normal distribution.

i

i

i

i

nn

n

s

yy

nnn

nn

s

y

nkt

)3)(2(

)1(3

)3)(2)(1(

)1(

31

24

4

(+)

(+)

(-)

(-)

8

Measures of the relative position

Percentiles: The percentile value (p) of an observation yi, in a data set

has 100p% of observations smaller than yi and 100(1-p)% observations greater than yi.

Quartiles: Percentiles 25% (Q1 or lower quartile), 50% (Q2 o median)

and 75% (Q3 or upper quartile).

z-value: Deviation of an observation from the mean expressed in standard deviation units:

s

yyz i

i

IQR: Interquartile range. Q3-Q1. Little affected by extreme values (outliers).

9

Why R?

• Pros

– Free software for Statistical Analysis

– Powerful to manage data and draw graphics

– Many complementary packages available

– Programming allows a better understanding of statistical methods and R procedures

– Large internet community (websites, forums, …)

• Cons

– Programming makes the analysis more slow. Some Java applications can be used (Deducer)

– Treating random effects is more complicated

10

Some interesting websites

• The Comprehensive R Archive Network – http://cran.r-project.org

• Cookbook for R – http://www.cookbook-r.com

• Quick R – http://www.statmethods.net

• R Statistics UCLA – http://statistics.ats.ucla.edu/stat/r

• Bioconductor – http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual

• R blogers – http://www.r-bloggers.com

11

http://cran.r-project.org/



http://www.cookbook-r.com/



http://www.statmethods.net/

http://statistics.ats.ucla.edu/stat/r/

http://manuals.bioinformatics.ucr.edu/home/R_BioCondManual

http://www.r-bloggers.com/



Starting with R

R is an open source software that can be found in: http://www.r-project.org

Integrated software for manipulating data, calculus and

graphical procedures.

Follow carefully the installation instructions (Windows x32 or x64, Mac)

12

Starting with R (2)

• The entities that R creates and manipulates are called OBJECTS: – Scalars: numbers, characters, logic (booleans), factors.

– Vectors, matrices, scalar lists.

– Functions.

– Objects ad hoc.

• All the objects are saved in a WORKSPACE.

• During an R session all the objects are in memory and can be saved for the next sessions.

• It is recommended to use several workspaces for different analyses.

• Workspaces are loaded and saved with the instructions load and save.image (in the menu).

13

Starting with R (2b)

• Workspaces are loaded and saved with the instructions load and save.image (in the menu).

We will see later another way of working (through SCRIPTS) in my opinion better than that one. 14

Starting with R (3)

• Two types of commands:

– Expressions: the result is shown by the screen and is not saved. > 2+2

[1] 4

– Assignments: nothing is shown in the screen. >a <- 2+2

>a

[1] 4

a <- 2+2 indicates that we are assigning the

sum of 2+2 to the object a.

An alternative is 2+2 -> a.

Note that to recover the result of 2+2 we have

to type a, followed by ENTER.

ENTER executes the command

15

Starting with R (4)

• R is case sensitive and distinguishes between capital and lowercase letters.

# Two different objects

> b <- 3

> b

[1] 3

> B<-6

> B

[1] 6

The # symbol is for including comments, non

executable.

> b <- 3 indicates that we are assigning the

number 3 to the object b.

Note that in the second assignement, B<-6 there

are not spaces between B, <-, and 6. This can be

the general case.

Do not use c to give the name of an object! It is

reserved to create vectors.

16

> ADG<-c(1.99, 1.72, 1.95, 1.67, 1.51, 1.32, 1.39, 1.64,

1.78, 1.50, 1.43, 1.37, 1.60, 1.58, 1.76, 1.57, 1.81,

1.21, 1.45, 1.58, 1.58, 1.68, 1.61, 1.61, 1.78, 1.95,

1.63, 1.68, 1.71, 1.74, 1.69, 1.68, 1.36, 1.30, 1.35,

1.24, 1.38, 1.32)

> ADG

[1] 1.99 1.72 1.95 1.67 1.51 1.32 1.39 1.64 1.78 1.50 1.43 1.37

1.60 1.58 1.76 1.57 1.81 1.21 1.45 1.58 1.58 1.68 1.61 1.61 1.78

1.95 1.63 1.68 1.71 1.74 1.69 1.68 1.36 1.30

[35] 1.35 1.24 1.38 1.32

A first dataset (distribution)

Remember that c() creates a vector

Imagine we have record of the average daily gain (ADG) during fattening of a group of bulls of the Bruna dels Pirineus beef breed.

We are going to create a vector and save it in the object ADG:

Note that the first 34 values were printed in the same line of the screen

17

The first calculus (1)

We can compute several statistics:

Sum all values > sum(ADG)

[1] 60.12

> length(ADG)

[1] 38

> min(ADG)

[1] 1.21

> max(ADG)

[1] 1.99

> range(ADG)

[1] 1.21 1.99

> mean(ADG)

[1] 1.582105

> median(ADG)

[1] 1.605

> var(ADG)

[1] 0.03962248

> sd(ADG)

[1] 0.199054

Sample mean

Sample median (central value of the distribution)

Sample variance

Sample standard deviation

Sample minimum

Sample maximum

Sample range (lower and upper values)

Number of records in the sample (n)

Note that before making calculus we have to define some dataset 18

The first calculus (1b)

When there are missing values, NA in R, use the following:

Sum all values > sum(ADG,na.rm=TRUE)

[1] 60.12

> sum(!is.na(ADG))

[1] 38

> min(ADG,na.rm=TRUE)

[1] 1.21

> max(ADG,na.rm=TRUE)

[1] 1.99

> range(ADG,na.rm=TRUE)

[1] 1.21 1.99

> mean(ADG,na.rm=TRUE)

[1] 1.582105

> median(ADG,na.rm=TRUE)

[1] 1.605

> var(ADG,na.rm=TRUE)

[1] 0.03962248

> sd(ADG,na.rm=TRUE)

[1] 0.199054

Sample mean

Sample median (central value of the distribution)

Sample variance

Sample standard deviation

Sample minimum

Sample maximum

Sample range (lower and upper values)

Number of records in the sample (n)

19

> quantile(ADG)

0% 25% 50% 75% 100%

1.210 1.400 1.605 1.705 1.990

> IQR(ADG)

[1] 0.305

> summary(ADG)

Min. 1st Qu. Median Mean 3rd Qu. Max.

1.210 1.400 1.605 1.582 1.705 1.990

> CV<-sd(ADG)/mean(ADG)*100

> CV

[1] 12.58159

The first calculus (2)

More statistics:

Quantiles. Remember that the treatment of NAs also applies

Coefficient of variation –to be defined in R, not a function-

Q3 - Q1

20

Statistical inference

Drawing conclusions based on data taking into account the inherent random variation.

1. Second step after the description of the data.

2. We want to extrapolate to the population which we observe in a sample.

3. Need to assume a particular data distribution

• Normal: adult weight, loin muscle area, average daily gain, …

• Bernoulli: ill vs. not ill.

• Poisson: number of microorganisms in a microscope field.

4. In inferential statistics we estimate –obtain an approximate value of- the true value of the parameter (a mean for example) through an adequate statistic (sample mean, for example).

5. There are a many contexts in which inference is desirable, and there are many approaches to performing inference.

6. Some methods do not need to assume a distribution: non parametric methods.

21

Parameters and statistics

Usually the parameters of the distribution are designed with Greek letters, whereas the corresponding statistics are designed with Latin letters. The next table includes some examples:

Estimator: Some equation that allows us to estimate some parameter.

Estimate: The value obtained.

Parameter (population)

Statistic (sample)

Mean y

Variance 2 2s

Standard deviation s

Proportion p

22

Estimation of parameters

1. Point estimation: a value is obtained as an estimate of the parameter.

2. Interval estimation: we calculate an interval in which we affirm that with a certain probability we can find the true value of the parameters.

So far we have presented some point estimators of several parameters.

In practice, when we work with the unknown parameter of the population, in addition to this point estimate we are usually interested in an interval (confidence interval, CI) that gives an idea of the uncertainness of the estimate.

We will present the way to construct intervals through some classical examples. The procedure is based upon the distribution of the statistic.

23

Normal distribution

),(~ 2NY

2

2

2

)(

2

1)(

y

eyf

if its p.d.f. is

Gauss

Standard normal

http://en.wikipedia.org/wiki/Carl

_Friedrich_Gauss

http://en.wikipedia.org/wiki/Normal_distribution

24

http://en.wikipedia.org/wiki/File:Normal_Distribution_PDF.svg

http://en.wikipedia.org/wiki/File:Standard_deviation_diagram.svg

http://en.wikipedia.org/wiki/File:Carl_Friedrich_Gauss.jpg

Applications of the normal distribution

1. Sometimes we have to know whether a given sample is distributed

normally before we can apply a certain test to it.

2. Knowing whether a sample is distributed normally may confirm or

reject certain underlying hypotheses about the nature of the factors

affecting the phenomenon studied. If a variable is distributed

normally, we can think that the causing factors affecting this

variable are additive, independent and of equal variance.

• Skewness may suggest some type of selection.

• Bimodality may indicate a mixture of observations from two

populations.

• In many cases, transformations of non normal variables change the

distribution of the transformed variable to normality.

3. If we assume a given distribution to be normal, we may make

predictions and tests of given hypothesis based upon this assumption.

25

One application of the standard normal

Remember that we defined the z-value as follows:

s

yyz i

i

where zi follows a standard normal distribution.

Imagine we have a value of 1.75 kg /day for ADG in the Bruna breed. We can compute the probability of having a value lower than this one (CDF) if the distribution is normal using R commands:

> z<-(1.75-1.582)/0.199;z

[1] 0.8442211

> pnorm(z)

[1] 0.8007271

The complement of pnorm(z), i.e. 1-pnorm(z), will be the

probability of having a value bigger than 1.75, in this case 0.2. 26

Distribution of the entire population

Suppose we want to measure the mean for ADG of BP bulls (as we have done in fact). Usually we do not have the entire population, but a sample. Let assume that we take repeated samples with replacement of size n (in our case n = 38) from that entire population, that is normally distributed.

For each sample we will have a different, but close mean, for example 1.58, 1.60, 1.64, 1.53, 1.59, … and so on. It can be shown that:

Standard error of the mean

n

Distribution of the estimated means

Standard error of

the mean

Used to compute C.I.

27

CI: mean of a normal, variance known

nzy

nzy

22

11

If we fix some confidence level (for example 95%), with = 1 - , the true mean is found in the interval:

1.96 for = 95%

Confidence limits

> qnorm(0.975)

[1] 1.959964

nz

2

12

n

zyn

zy

22

11,

Interval length

Note that is the

standard error, i.e.,

the standard deviation

of the distribution of

means under a

repeated sampling

(infinite) of size n.

n

qnorm gives the

quantile (z-value) of the

cdf of a standard normal

28

CI: mean of a normal, variance known (example)

Assuming that the estimated variance for ADG is the true variance, and that ADG is normally distributed, the 95% confidence interval of the mean is:

032.096.1582.1,032.096.1582.1

645.1,519.1

12544.0032.096.12

With interval length:

29

CI: mean of a normal, variance unknown (1)

n

sty

n

sty

n

sty nnn 1

1

1

1

1

1222

,

13.0032.003.22

This is the common case, and is similar to the previous case but using the t-distribution instead of the normal distribution.

The length of the interval is:

If we use the data of the average daily gain in beef, we have

647.1,517.1032.003.2582.1,032.003.2582.1

t value with 37 df and /2= 0.025

30

t distribution (1)

1~ ntT

2

12

1

2

2

1

)(

ttf

if its p.d.f. is

Gosset

Note that the t-distribution (red or green line)

approaches the normal distribution (blue line)

as increases

with = n - 1

http://en.wikipedia.org/wiki/Student%27s_t-distribution

http://en.wikipedia.org/wiki/

William_Sealy_Gosset

=1 =30

31

http://upload.wikimedia.org/wikipedia/commons/4/42/William_Sealy_Gosset.jpg

t distribution (2)

2

zt

Let z be a standard normal random variable with = 0 and σ = 1, and let 2

be a chi-square random variable with degrees of freedom. Then

is a random variable with a Student t distribution with degrees of freedom.

We can also have a normal random variable with mean and σ = 1, then

This distribution is defined by degrees of freedom and the noncentrality

parameter (included in y).

Central t distribution

2

yt

Non-central t distribution

32

The chi-square distribution (also chi-squared or 2-distribution) with k degrees of

freedom is the distribution of a sum of the squares of k independent standard normal

random variables. It is one of the most widely used probability distributions in

inferential statistics, e.g. in hypothesis testing or in construction of confidence

intervals.

The best-known situations in which the chi-square distribution is used are the

common chi-square tests for goodness of fit of an observed distribution to a

theoretical one, and of the independence of two criteria of classification of qualitative

data.

The chi-square distribution is a special case

of the gamma distribution, with p.d.f.:

Chi-square distribution

http://en.wikipedia.org/wiki/Chi_square_distribution

33

CI: mean of a normal, variance unknown (2)

1

12

nt In R we can compute as:

If we use the data of the average daily gain in beef, we have

> qt(0.975,length(ADG)-1)

[1] 2.026192

> YBAR<-mean(ADG)

[1] 1.582105

> SEM<-sd(ADG)/sqrt(length(ADG))

> SEM

[1] 0.03229081

> UCL<-YBAR+SEM*qt(0.975,length(ADG)-1)

> UCL

[1] 1.647533

> LCL<-YBAR-SEM*qt(0.975,length(ADG)-1)

> LCL

[1] 1.516678 34

CI: Interpretation

If we would take all possible samples of size 38 of average daily gain in the Bruna dels Pirineus beef breed, and for each of them we would made the above calculations, 95% of the intervals found, approximately, would contain .

We do not know whether the interval we have found contains or not , because this parameter is unknown (in fact it is what we are looking for), but we are 95% confident in that it be so.

35

Creating and executing an script (1)

Usually we do not work writing in the R console the commands we want

to execute. Instead of that we use scripts.

An script is a sequence of commands that we write in the R editor and

afterwards we save in R format.

It is important to work in a particular directory where we will save both

the scripts and the data needed to execute the calculus.

To go to a new directory in Windows you can use

the Change directory option in File (“Archivo”) in

the console.

If you are a Mac user, you can write and execute

into the console something similar to:

To know in which directory you are (both in Windows and Mac): getwd()

setwd(“/Users/Documents/DEME/scripts-data”)

36

Creating and executing an script (2)

After been in the working directory, to create a new script go to File and

then click New Script button (New Document in Macs).

To open an script previously

saved it is necessary to click

Open Script and then click the

desired script (for example

sdescriptive.R).

To execute the script in Windows we have

to mark with the cursor the line or lines to

be executed and press the icon that

indicates the green arrow (in the left figure).

In Macs, select the line or lines (or put the

cursor at the end of an executable line) and

then press cmd+enter.

37

An script to describe the data

Note that > is not

necessary at the

beginning of the line

38

Piecharts in R (1)

> #Simple piechart

> SLICES <- c(2345,350,47,13)

> LBSL <- c("1. Easy", "2. Light assist.", "3. Strong assist.",

+ "4. Vet. assist.")

> pie(SLICES, labels = LBSL, main="PIE CHART OF CALVING EASE")

Let us construct a pie chart describing calving ease in Bruna P. breed:

Note that c() is the usual way to create a

vector; in this case SLICES is the name

assigned to it in the second line. In the third

line, we indicate the name of the categories of calving ease in another vector: LBSL.

Note that we will use capital letters to

design our variables. Lower case letters will

be reserved for R commands and text within

quotas.

Note also that when we need to write

commands in two or more lines, the second and the next lines start with +.

1. Easy

2. Light assist.

3. Strong assist.4. Vet. assist.

PIE CHART OF CALVING EASE

39

Piecharts in R (2)

#Pie Chart with percentages and

another set of colors

> SLICES <- c(2345,350,47,13)

> LBLS <- c("1. Easy","2. Light

assist.","3. Strong assist.",

+ "4. Vet. assist.")

> PCT <- round(SLICES/

+ sum(SLICES)*100,1)

> LBLS <- paste(LBLS, PCT)

> LBLS <- paste(LBLS,"%",sep="")

> COLORS <- c("green","blue",

+ "yellow","maroon")

> pie(SLICES, labels = LBLS,

+ col=COLORS,

+ main="PIE CHART OF CALVING

EASE")

We can choose the colours and add percentages:

1. Easy 85.1%

2. Light assist. 12.7%

3. Strong assist. 1.7%4. Vet. assist. 0.5%

PIE CHART OF CALVING EASE

More pie chart variants can be found in internet or in some books, for example 3D pie charts, rainbow colours, etc.

40

Bar chart in R

Note that it is easier to visualize the distribution in this format.

> COUNTS <- c(2345,350,47,13)

> barplot(COUNTS, main="Calving ease in beef cattle",

+ ylab="Number of calvings per category",

+ names.arg = c("1. Easy","2. Light ass.","3. Str ass.","4. Vet ass.")

+ border="blue", density=c(20,30,40,50))

1. Easy 2. Light ass. 3. Str ass. 4. Vet ass.

Calving ease in beef cattle

Nu

mb

er

of ca

lvin

gs p

er

ca

teg

ory

05

00

10

00

15

00

20

00

41