data management and statistical analysis - descriptive statistics

7
Leilani A. Nora Leilani A. Nora Leilani A. Nora Leilani A. Nora Assistant Scientist Descriptive Statistics Introduction to R: Data Manipulation and Statistical Analysis DATA FRAME : data.serial Consider a serialized data with 3 Sites, 3 Treatments, 4 reps and variable Y Site Trt Rep Y A 1 1 3 A 1 2 6 A 1 3 8 A 1 4 5 A 2 1 4 A 2 2 4 A 2 3 6 A 2 4 9 A 3 1 7 A 3 2 4 A 3 3 2 A 3 4 4 Site Trt Rep Y B 1 1 3 B 1 2 6 B 1 3 5 B 1 4 NA B 2 1 7 B 2 2 0 B 2 3 8 B 2 4 2 B 3 1 5 B 3 2 7 B 3 3 4 B 3 4 4 Site Trt Rep Y C 1 1 8 C 1 2 NA C 1 3 8 C 1 4 6 C 2 1 5 C 2 2 4 C 2 3 4 C 2 4 7 SUMMARY STATISTICS R contains all the basic tools for calculating summary statistics. cor(), cov() calculate covariances and correlations mean(), median(), sum(), var(), min(), max(), range() all are self explanatory mad() calculates the mean absolute deviation quantile() computes various quantiles of data summary() will be discussed on the next slide SUMMARY STATISTICS : summary() Use to obtain a descriptive statistics of a data frame or specific variable. Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 0.000 4.000 5.000 5.167 7.000 9.000 2.000 Output are the quartiles, min, max, median, mean and the count of NA’s. Ex1. To obtain summary statistics for the variable Y > summary(data.serial$Y)

Upload: vivay-salazar

Post on 22-Nov-2014

246 views

Category:

Documents


3 download

DESCRIPTION

Data Management and Statistical Analysis - Descriptive Statistics

TRANSCRIPT

Page 1: Data Management and Statistical Analysis - Descriptive Statistics

Leilani A. NoraLeilani A. NoraLeilani A. NoraLeilani A. Nora

Assistant Scientist

Descriptive Statistics

Introduction to R:

Data Manipulation and Statistical

Analysis

DATA FRAME : data.serial

• Consider a serialized data with 3 Sites, 3 Treatments, 4 reps and variable Y

Site Trt Rep Y

A 1 1 3

A 1 2 6

A 1 3 8

A 1 4 5

A 2 1 4

A 2 2 4

A 2 3 6

A 2 4 9

A 3 1 7

A 3 2 4

A 3 3 2

A 3 4 4

Site Trt Rep Y

B 1 1 3

B 1 2 6

B 1 3 5

B 1 4 NA

B 2 1 7

B 2 2 0

B 2 3 8

B 2 4 2

B 3 1 5

B 3 2 7

B 3 3 4

B 3 4 4

Site Trt Rep Y

C 1 1 8

C 1 2 NA

C 1 3 8

C 1 4 6

C 2 1 5

C 2 2 4

C 2 3 4

C 2 4 7

SUMMARY STATISTICS

• R contains all the basic tools for calculating summary

statistics.

• cor(), cov() calculate covariances and correlations

• mean(), median(), sum(), var(), min(), max(), range() all are self explanatory

• mad() calculates the mean absolute deviation

• quantile() computes various quantiles of data

• summary() will be discussed on the next slide

SUMMARY STATISTICS : summary()

• Use to obtain a descriptive statistics of a data frame or specific variable.

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.000 4.000 5.000 5.167 7.000 9.000 2.000

• Output are the quartiles, min, max, median, mean and the count of NA’s.

• Ex1. To obtain summary statistics for the variable Y

> summary(data.serial$Y)

Page 2: Data Management and Statistical Analysis - Descriptive Statistics

• Ex2. To obtain summary statistics for all the columns of a data frame

Site Trt Rep Y

A:12 Min. :1.000 Min. :1.00 Min. :0.000

B:12 1st Qu.:1.000 1st Qu.:1.75 1st Qu.:4.000

C: 8 Median :2.000 Median :2.50 Median :5.000

Mean :1.875 Mean :2.50 Mean :5.167

3rd Qu.:2.250 3rd Qu.:3.25 3rd Qu.:7.000

Max. :3.000 Max. :4.00 Max. :9.000

NA's :2.000

> summary(data.serial)

SUMMARY STATISTICS : summary() SUMMARY STATISTICS : length()

• Use to obtain number of data points of a variable,

say Y

> length(data.serial$Y)

[1] 32

SUMMARY STATISTICS : var() and sd()

[1] 4.488506

• sd() is use to obtain the standard deviation of Y

[1] 2.118609

• var() is use to obtain the variance of Y

> Y.VAR <- var(data.serial$Y, na.rm=TRUE)

> Y.VAR

> Y.STD <- sd(data.serial$Y, na.rm=TRUE)

> Y.STD

• tapply() applies a function to a variable in a separate (non-empty) groups

X – an object, typically a vector

INDEX – list of factors, each of same length

as X

FUN – function to be applied

SUMMARY STATISTICS : tapply()

> tapply(X, INDEX, FUN)

Page 3: Data Management and Statistical Analysis - Descriptive Statistics

• Ex1. To obtain separate summary stat of Y for each Site

> tapply(data.serial$Y, data.serial$Site,

summary)$A

Min. 1st Qu. Median Mean 3rd Qu. Max.

2.000 4.000 4.500 5.167 6.250 9.000

$B

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

0.000 3.500 5.000 4.636 6.500 8.000 1.000

$C

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

4.0 4.5 6.0 6.0 7.5 8.0 1.0

SUMMARY STATISTICS : tapply()

• Ex2. To obtain separate standard deviation of Y for

each Site

> tapply(data.serial$Y,data.serial$Site,

sd)

A B C

2.081666 2.377929 1.732051

SUMMARY STATISTICS : tapply()

• Ex3. To obtain separate mean of Y for each Site x Trt

> tapply(data.serial$Y,

list(data.serial$Site,

data.serial$Trt), mean)

1 2 3

A 5.500000 5.75 4.25

B 4.666667 4.25 5.00

C 7.333333 5.00 NA

SUMMARY STATISTICS : tapply() SUMMARY STATISTICS : doBy Package

• doBy Package is use to calculate groupwise

summary statistics in a simple way, much in the spirit of PROC SUMMARY of SAS system.

summaryBy()

• Use for calculating quantities like the “mean and

variance” of a variable, for each combination of two or

more factors.

Page 4: Data Management and Statistical Analysis - Descriptive Statistics

# formula – a formula object, say Y~Site

# data – a data frame

# FUN – a list of functions to be applied.

# KEEP.NAME – logical, if TRUE and if there is only ONE

function in FUN, then the variables in the output will have

the same name as the variables in the input.

# Order – logical, if TRUE the resulting data frame is

ordered according to the variables on the right hand side

of the formula.

SUMMARY STATISTICS : summaryBy()

• Usage

> summaryBy(formula, data, FUN=mean,

keep.name=FALSE, order=TRUE,na.rm=TRUE,..)

• Ex1. To obtain Site x Trt summary of means for Y

> library(doBy)

> summaryBy(Y~Site+Trt, data=data.serial,

na.rm=TRUE)

Site Trt Y.mean

1 A 1 5.500000

2 A 2 5.750000

3 A 3 4.250000

4 B 1 4.666667

5 B 2 4.250000

6 B 3 5.000000

7 C 1 7.333333

8 C 2 5.000000

SUMMARY STATISTICS : summaryBy()

• Ex2. To obtain Site x Trt summary of minimum, mean,

maximum, variance and standard deviation of Y using

predefined functions.

> summaryBy(Y~Site+Trt, data=data.serial,

FUN=c(min, mean, max, var, sd), na.rm=TRUE)

SUMMARY STATISTICS : summaryBy()

Site Trt Y.min Y.mean Y.max Y.var Y.sd

1 A 1 3 5.500000 8 4.333333 2.081666

2 A 2 4 5.750000 9 5.583333 2.362908

3 A 3 2 4.250000 7 4.250000 2.061553

4 B 1 3 4.666667 6 2.333333 1.527525

5 B 2 0 4.250000 8 14.916667 3.862210

6 B 3 4 5.000000 7 2.000000 1.414214

7 C 1 6 7.333333 8 1.333333 1.154701

8 C 2 4 5.000000 7 2.000000 1.414214

HISTOGRAM

Page 5: Data Management and Statistical Analysis - Descriptive Statistics

DENSITY PLOT

# freq – logical, if FALSE probability densities are plotted so that histogram has a total area of one.

> hist(data.serial$Y,main='Histogram

of Y', col=‘yellow2',

border=‘tomato1',

freq = FALSE, xlab=“Y Class”,

ylab=“Probability", xlim=c(0, 20))

DENSITY PLOT: seq()

> x <- seq(from=0, to=20, length=100)

> x

• seq(from, to, length) generate regular sequences from

0 to 20 with length of 100.

[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010

[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222

. . .

[97] 19.3939394 19.5959596 19.7979798 20.0000000

dnorm(x, mean, sd)

• dnorm() is use to obtain the probability of x, given the values of mean and sd.

> y <- dnorm(x,

mean(data.serial$Y,na.rm=TRUE),

sd(data.serial$Y, na.rm=TRUE)))

> y

> lines(x, y)

[1] 0.0000000 0.2020202 0.4040404 0.6060606 0.8080808 1.0101010

[7] 1.2121212 1.4141414 1.6161616 1.8181818 2.0202020 2.2222222

. . .

[97] 19.3939394 19.5959596 19.7979798 20.0000000

DENSITY PLOT : lines()

> lines(x, y)

Page 6: Data Management and Statistical Analysis - Descriptive Statistics

HISTOGRAM WITH DENSITY PLOT:

mtext()

> mtext("Fitting to a normal

distribution")

• mtext(text, side=3…) displays text on top of the plot

# text – a character expression specifying the text to be

written

# side – on which side of the plot you want to display a

text

1 – bottom 2 – left

3 – top 4 – right

CASE1. HISTOGRAM WITH DENSITY PLOT

> mtext("Fitting to a normal

distribution")

> hist(RF$RLD0, main='Histogram of RLD0',

col='plum4', border='black', br=5,

xlab="RLD0 Class",

ylab="Probability",

freq=FALSE,

xlim=c(0, 20))> x <- seq(from=0, to=20, length=100)

> x

> y <- dnorm(x,

mean(data.serial$Y,na.rm=TRUE),

sd(data.serial$Y, na.rm=TRUE)))> lines(x, y)

HISTOGRAM WITH DENSITY PLOT:

lines(), dnorm(), and mtext()

Histogram of Y with Density plot

Y class

Probability

0 2 4 6 8 10

0.00

0.05

0.10

0.15

0.20

Fitting to a normal distribution

BOXPLOT

Page 7: Data Management and Statistical Analysis - Descriptive Statistics

• Ex1. To obtain boxplot of Y with other graphics parameters

> Boxplot(data.serial$Y,

boxwex=0.35,

main=“Boxplot of Y”,

xlab=“Y”,

horizontal=TRUE)

# boxwex = controls the width

of the boxplot

# horizontal = logical, if

TRUE, the boxplot is plotted

horizontally0 2 4 6 8

Boxplot of Y

Y

> boxplot(split

(data.serial$Y,

data.serial$Site))

> boxplot(Y~Site,

data=data.serial)

A B C

02

46

8

BOXPLOT :boxplot()

THANK YOU! ☺☺☺☺

Please do Exercise C