data management and statistical analysis - descriptive statistics

Leilani A. NoraLeilani A. NoraLeilani A. NoraLeilani A. Nora

Assistant Scientist

Descriptive Statistics

Introduction to R:

Data Manipulation and Statistical

Analysis

DATA FRAME : data.serial

• Consider a serialized data with 3 Sites, 3 Treatments, 4 reps and variable Y

Site Trt Rep Y

A 1 1 3

A 1 2 6

A 1 3 8

A 1 4 5

A 2 1 4

A 2 2 4

A 2 3 6

A 2 4 9

A 3 1 7

A 3 2 4

A 3 3 2

A 3 4 4

Site Trt Rep Y

B 1 1 3

B 1 2 6

B 1 3 5

B 1 4 NA

B 2 1 7

B 2 2 0

B 2 3 8

B 2 4 2

B 3 1 5

B 3 2 7

B 3 3 4

B 3 4 4

Site Trt Rep Y

C 1 1 8

C 1 2 NA

C 1 3 8

C 1 4 6

C 2 1 5

C 2 2 4

C 2 3 4

C 2 4 7

SUMMARY STATISTICS

• R contains all the basic tools for calculating summary

statistics.

• cor(), cov() calculate covariances and correlations

• mean(), median(), sum(), var(), min(), max(), range() all are self explanatory

• mad() calculates the mean absolute deviation

• quantile() computes various quantiles of data

• summary() will be discussed on the next slide

SUMMARY STATISTICS : summary()

• Use to obtain a descriptive statistics of a data frame or specific variable.

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.000 4.000 5.000 5.167 7.000 9.000 2.000

• Output are the quartiles, min, max, median, mean and the count of NA’s.

• Ex1. To obtain summary statistics for the variable Y

> summary(data.serial$Y)

• Ex2. To obtain summary statistics for all the columns of a data frame

Site Trt Rep Y

A:12 Min. :1.000 Min. :1.00 Min. :0.000

B:12 1st Qu.:1.000 1st Qu.:1.75 1st Qu.:4.000

C: 8 Median :2.000 Median :2.50 Median :5.000

Mean :1.875 Mean :2.50 Mean :5.167

3rd Qu.:2.250 3rd Qu.:3.25 3rd Qu.:7.000

Max. :3.000 Max. :4.00 Max. :9.000

NA's :2.000

> summary(data.serial)

SUMMARY STATISTICS : summary() SUMMARY STATISTICS : length()

• Use to obtain number of data points of a variable,

say Y

> length(data.serial$Y)

[1] 32

SUMMARY STATISTICS : var() and sd()

[1] 4.488506

• sd() is use to obtain the standard deviation of Y

[1] 2.118609

• var() is use to obtain the variance of Y

> Y.VAR <- var(data.serial$Y, na.rm=TRUE)

> Y.VAR

> Y.STD <- sd(data.serial$Y, na.rm=TRUE)

> Y.STD

• tapply() applies a function to a variable in a separate (non-empty) groups

X – an object, typically a vector

INDEX – list of factors, each of same length

as X

FUN – function to be applied

SUMMARY STATISTICS : tapply()

> tapply(X, INDEX, FUN)

• Ex1. To obtain separate summary stat of Y for each Site

> tapply(data.serial$Y, data.serial$Site,

summary)$A

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.000 4.000 4.500 5.167 6.250 9.000

$B


0.000 3.500 5.000 4.636 6.500 8.000 1.000

$C


4.0 4.5 6.0 6.0 7.5 8.0 1.0


• Ex2. To obtain separate standard deviation of Y for

each Site

> tapply(data.serial$Y,data.serial$Site,

sd)

A B C

2.081666 2.377929 1.732051


• Ex3. To obtain separate mean of Y for each Site x Trt

> tapply(data.serial$Y,

list(data.serial$Site,

data.serial$Trt), mean)

1 2 3

A 5.500000 5.75 4.25

B 4.666667 4.25 5.00

C 7.333333 5.00 NA

SUMMARY STATISTICS : tapply() SUMMARY STATISTICS : doBy Package

• doBy Package is use to calculate groupwise

summary statistics in a simple way, much in the spirit of PROC SUMMARY of SAS system.

summaryBy()

• Use for calculating quantities like the “mean and

variance” of a variable, for each combination of two or

more factors.

# formula – a formula object, say Y~Site

# data – a data frame

# FUN – a list of functions to be applied.

# KEEP.NAME – logical, if TRUE and if there is only ONE

function in FUN, then the variables in the output will have

the same name as the variables in the input.

# Order – logical, if TRUE the resulting data frame is

ordered according to the variables on the right hand side

of the formula.

SUMMARY STATISTICS : summaryBy()

• Usage

> summaryBy(formula, data, FUN=mean,

keep.name=FALSE, order=TRUE,na.rm=TRUE,..)

• Ex1. To obtain Site x Trt summary of means for Y

> library(doBy)

> summaryBy(Y~Site+Trt, data=data.serial,

na.rm=TRUE)

Site Trt Y.mean

1 A 1 5.500000

2 A 2 5.750000

3 A 3 4.250000

4 B 1 4.666667

5 B 2 4.250000

6 B 3 5.000000

7 C 1 7.333333

8 C 2 5.000000


• Ex2. To obtain Site x Trt summary of minimum, mean,

maximum, variance and standard deviation of Y using

predefined functions.

> summaryBy(Y~Site+Trt, data=data.serial,

FUN=c(min, mean, max, var, sd), na.rm=TRUE)


Site Trt Y.min Y.mean Y.max Y.var Y.sd

1 A 1 3 5.500000 8 4.333333 2.081666

2 A 2 4 5.750000 9 5.583333 2.362908

3 A 3 2 4.250000 7 4.250000 2.061553

4 B 1 3 4.666667 6 2.333333 1.527525

5 B 2 0 4.250000 8 14.916667 3.862210

6 B 3 4 5.000000 7 2.000000 1.414214

7 C 1 6 7.333333 8 1.333333 1.154701

8 C 2 4 5.000000 7 2.000000 1.414214

HISTOGRAM

DENSITY PLOT

# freq – logical, if FALSE probability densities are plotted so that histogram has a total area of one.

> hist(data.serial$Y,main='Histogram

of Y', col=‘yellow2',

border=‘tomato1',

freq = FALSE, xlab=“Y Class”,

ylab=“Probability", xlim=c(0, 20))

DENSITY PLOT: seq()

> x <- seq(from=0, to=20, length=100)

> x

• seq(from, to, length) generate regular sequences from

0 to 20 with length of 100.

[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010

[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222

. . .

[97] 19.3939394 19.5959596 19.7979798 20.0000000

dnorm(x, mean, sd)

• dnorm() is use to obtain the probability of x, given the values of mean and sd.

> y <- dnorm(x,

mean(data.serial$Y,na.rm=TRUE),

sd(data.serial$Y, na.rm=TRUE)))

> y

> lines(x, y)

[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010

[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222

. . .

[97] 19.3939394 19.5959596 19.7979798 20.0000000

DENSITY PLOT : lines()

> lines(x, y)

HISTOGRAM WITH DENSITY PLOT:

mtext()

> mtext("Fitting to a normal

distribution")

• mtext(text, side=3…) displays text on top of the plot

# text – a character expression specifying the text to be

written

# side – on which side of the plot you want to display a

text

1 – bottom 2 – left

3 – top 4 – right

CASE1. HISTOGRAM WITH DENSITY PLOT

> mtext("Fitting to a normal

distribution")

> hist(RF$RLD0, main='Histogram of RLD0',

col='plum4', border='black', br=5,

xlab="RLD0 Class",

ylab="Probability",

freq=FALSE,

xlim=c(0, 20))> x <- seq(from=0, to=20, length=100)

> x

> y <- dnorm(x,

mean(data.serial$Y,na.rm=TRUE),

sd(data.serial$Y, na.rm=TRUE)))> lines(x, y)

HISTOGRAM WITH DENSITY PLOT:

lines(), dnorm(), and mtext()

Histogram of Y with Density plot

Y class

Probability

0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

Fitting to a normal distribution

BOXPLOT

• Ex1. To obtain boxplot of Y with other graphics parameters

> Boxplot(data.serial$Y,

boxwex=0.35,

main=“Boxplot of Y”,

xlab=“Y”,

horizontal=TRUE)

# boxwex = controls the width

of the boxplot

# horizontal = logical, if

TRUE, the boxplot is plotted

horizontally0 2 4 6 8

Boxplot of Y

Y

> boxplot(split

(data.serial$Y,

data.serial$Site))

> boxplot(Y~Site,

data=data.serial)

A B C

02

46

8

BOXPLOT :boxplot()

THANK YOU! ☺☺☺☺

Please do Exercise C

data management and statistical analysis - descriptive statistics

Documents