statistics primer

Statistics Primer

Xiayu (Stacy) Huang

Bioinformatics Shared ResourceEmail: bsr_help@sanfordburnham.org

Sanford | Burnham Medical Research Institute

Outline Overview of basic statistics

Introduction

Descriptive statistics

Inferential statistics

Most common statistical test and its applicationsT test

Power analysis using t test

What is statistics?

On American Statistical Association (ASA) website, statistics is defined as the science of collection, analysis, interpretation and presentation of data

Using Statistics to make decision can be a double-edged sword In the 1980s, Marriott conducted an extensive survey with

potential customers on their attitudes about current hotel offerings. After analyzing the data, the company launched Courtyard by Marriott, which has been a huge success

Coca-Cola performed a major consumer study in 1985 and, based on the results, decided to reformulate Coke, its flagship drink. After a huge public outcry, Coca-Cola had to backtrack and bring the original formulation back to market

History of statistics

Karl Pearson•Pearson correlation•Chi-square distribution

William Gosset•Student’s t

Ronald Aylmer Fisher•ANOVA, maximum likelihood

•17th-18th century

•19th century

•20th century

Jakob Bernoulli

•Bernoulli number•Bernoulli trial•Bernoulli process

•Bayes theorem

Thomas Bayes

Carl Friedrich Gauss

•Gaussian distribution

Why statistics is important to biologists?

• Designing experiment

• Analyzing biological data and understanding analysis results

• Preparing manuscript and grant applications

How many ???

How many replicates for my microarray exp???

No replicates=No statistics?

Identifying outlierNormalization/transformationStatistical test, etc.

Study Scheme

Study Hypothesis

Design Study

Conduct Study and Collect data

Data Analysis

Make Conclusions

Summarizing data usingDescriptive Statistics

Hypothesis Testing Using Inferential Statistics

Choose StatisticalTest

Compute test statisticCompute p-value

Compare p-value and α

Branches of statistics

Descriptive statistics (Summary statistics)Summarize data graphically or numericallyLead to hypothesis generating

Inferential statisticsDistinguish true difference from random variationAllow hypothesis testing

Types of data

Qualitative or Quantitative

Example

QualitativeGender

GenotypeTumor location

Qualitative or Quantitative

PerformanceGrade of tox

Disease stage

QuantitativeAge

Array intensities

Descriptive statistics—central tendency

Mean—average

Median—middle value of sorted data

Mode—most frequently observed value

Median

Mode is 24 with frequency of 3

Mean=(24+27+….+24)/13=24.8

Agei.e.

Descriptive statistics—dispersion

Sample Variance (s2)\ Standard deviation (s)

Values beyond two standard deviations from the mean can be considered as “outliers” (>mean+2s=24.8+2x2.2=29.2 or <mean-2s=24.8-2x2.2=20.4)

Standard error of mean (SEM)

Agei.e.

2 2 22

(22 ) (22 ) ... (29 )4.84

mean mean means

2.20.61

Range=highest value-lowest value=29-22=7

Descriptive statistics—data distributionHistogram (x-bin, y-frequency)

Graphical representation showing the distribution of data Summary graph showing how many data points falling in various ranges

28 29Frequency table

Bin Frequency

20-22 2

22-24 5

24-26 3

26-28 2

28-30 1

Histogram\frequency distribution

Bin percentage

20-22 0.155

22-24 0.38

24-26 0.23

26-28 0.155

28-30 0.08

Percentage table Histogram\probability distribution

Different data distributions

Descriptive statistics—data distribution

Approximate normal distribution i.e. height of people, length of dogs

Right skewed distribution Left skewed distributioni.e. FC of Microarray data i.e. distribution of age at retirement

•Bell-shaped curve

•Symmetrical about mean

•Mean, median and mode are equal

•~68% data points fall within 1 sd of mean

•~95% data points fall within 2 sd of mean

•~99.7% data points fall within 3 sd of mean

Normal (or Gaussian) distribution

mean=median=mode

Installing graphpad prism

You can install Prism on Institute supplied computers, including home and personal computers.

http://graphpad.com/paasl/index.cfm?sitecode=burnhm

SERIAL NUMBERS:

Macintosh versioncontacting IT (support@sanfordburnham.org) to get serial number Windows versioncontacting IT (support@sanfordburnham.org) to get serial number

Calculating descriptive statistics in excel

Calculating descriptive statistics in prism

Graphically displaying descriptive statistics

Histogram

Mean error bar plot

Line plot w/o error bar

Graphically displaying descriptive statistics in Prism

Mean error bar plot

Histogram and frequency distribution

Graphically displaying descriptive statistics in Prism

Group line plot

Group line plot witherror bar

Group line plot withouterror bar

Choosing right measures of descriptive statistics

Normal distribution Skewed distribution

Normal distribution: mean and standard deviation

Skewed distribution: transform data to normal distribution

Outline

Overview of basic statisticsBrief Introduction

Most common statistical tests and its applicationsT test

Power analysis using t test

Parametric Interval or ratio measurementsContinuous variableUsually assuming data are normally distributed

NonparametricOrdinal or nominal measurementsDiscreet variablesMaking no assumption about how data is

distributed

Inferential statistics-hypothesis

Null hypothesis (H0)

Alternative hypothesis (HA)• is the opposite of null hypothesis• is generally the hypothesis that is believed to be

true by the researcher

new drug effect = old drug effect tumor growth of MT = tumor growth of WT

new drug effect ≠ or > old drug effect tumor growth of MT ≠ or < tumor growth of WT

Inferential statistics-one and two sided tests

Hypothesis tests can be one or two sided (tailed)

One sided tests are directional:

Two sided tests are not directional:

H0 : new drug effect ≤ old drug effect

HA : new drug effect > old drug effect

H0 : new drug effect = old drug effect

HA : new drug effect ≠ old drug effect

Inferential statistics-type I and type II errors

Correct decision (TN)1-α

Type II error (FN)β

Type I error (FP)α

Correct decision (TP)1-β

“Actual situation”

No difference

Difference

“Measured”

1820 10

180 20

Correct decision (TN)1-α=1820/2000=0.91

Type II error (FN)FN=10/30=0.33

Type I error (FP)α=180/2000=0.09

Correct decision (TP)1-β=20/30=0.67

No difference (H0) Difference (HA)

“Actual situation”

“Measured”

2000 30

FOB screening(bowel cancer)

Inferential statistics-type I and type II errors

• Control type I and type II errors• Inverse relationship between type I and type II errors• Make a choice to control which error

• i.e. controlling type I error (FP) is more important for microarray data than type II error (FN)

• i.e. controlling type II error (FN) is more important for cancer screening test than type I error (FP)

• Choose type I and type II errors for statistical test?

• Common choices (α = 5%, β = 20%)• Exploratory study (α = 10%, β = 10%)• Confirmatory study (α = 1%, β = 10%)

Inferential Statistics-P-value

• the probability that an observed difference could have occurred by chance under null hypothesis

• Computed from test statistics score

• P-value is the same as false positive rate

• P-value below cut off (α) is referred as “statistically significant”

Inferential Statistics-Power

Power (1-β, aka true positive rate (TP))

• Probability of detecting a significant scientific difference when it does exist

Power depends on:Sample size (n)

Standard deviation (s)

Size of the difference you want to detect (δ)

False positive rate (α)

Effect size

Study scheme

Study Hypothesis

Design Study

Conduct Study and Collect data

Data Analysis

Make Conclusions

Calculating and Displaying Descriptive Statistics

Hypothesis Testing Using Inferential Statistics

Choose Statistical Test

Compute test statisticCompute p-value

Compare p-value and α

Type of data

Quantitative Qualitative

Type of research question

Association Correlation Comparison

Data structure

Independent Paired Matched

How to choose an appropriate statistical test?

Statistical test decision making tree

For qualitative or non-numerical data

For quantitative or numerical data

Two sample comparison

Multiple sample comparison

Relationship between variables

Outline

Overview of basic statisticsBrief Introduction

Most common statistical test and its applications

T testPower analysis using t test

Student’s t test

Guinness employee William Sealy Gosset published the 'Student's t-test' in 1908

Types of t test

One sample t test: test if a sample mean differs significantly from the given known mean

Unpaired t test: test if two independent sample means differ significantly

Paired t test: test if two dependent sample means differ significantly (mean of pre and post treatment for same set of patients

Application of t test in biology

Proteomics experiment

Technical reps

Biological reps

You need to have at least two replicates in each condition

to do t test, otherwise, t test is invalid and you won’t have statistics

Mincroarry experiment

Two sample unpaired t test

Assumptions Data is approximately normally distributed The sample has been independently and randomly

selected Similar variances between comparing groups

Hypothesis (two sided or one sided)

Test statistics1 2

1 2 1 2, 2

2 22 1 1 2 2

( ) ( )

1 / 1 /

( 1) ( 1)

X Xt t

n s n ss

-- sample means

-- population means

-- sample standard deviation

-- sample size

-- pooled sample variance

1 2,X X

1 2,s s

1 2,n n2ps

Sample data

1st Question to be answered:

Will the two treatments have different effect on patients’ remission time from cancer?

Patients TreatmentsRemission time

from cancer (years)

1 Drug 72 Drug 53 Drug 24 Drug 85 Drug 36 Drug 47 Drug 108 Drug 79 Drug 410 Drug 911 Placebo 412 Placebo 313 Placebo 114 Placebo 615 Placebo 216 Placebo 417 Placebo 918 Placebo 519 Placebo 320 Placebo 8

Summarizing sample data using descriptive statistics

Hypothesis testing of sample data using inferential statistics

Step1: Choosing an appropriate statistical test

Step2: Performing statistical test in software

Step3: Making conclusions

Two sample t test in Prism-normality check

Two sample t test in Prism

Two sample t test in excel

Power analysis using two sample t test

2nd question to be answered:

How many patients do we need in order to detect a significantly difference b/w two treatments?

N α β δ/s Test K:1 efficiency imbalance

2 2 21 /2 1 1 /2 1

( ) ( )

s t t t tn

Power analysis of t test in G*power

Basic Statistics toolsStatistics softwares and packages:

1.Excel and add-ins: EZAnalyze, Analysis Toolpak2. Our institute supported Prism3. SPSS, Statistica (commercial)4. SAS (commercial) and R 5. G*Power

Basic statistics books:

1. Intro Stats, SDSU, 2nd edition, Deveaux, Velleman, Bock2. Choosing and Using Statistics: A Biologist's Guide 3. Introduction to Statistics for Biology 4. Biostatistical analysis, fifth edition, Jerrold H. Zar

Statistics videos:

1. http://www.microbiologybytes.com/maths/videos2. http://www.youtube.com: descriptive statistics, basic statistics, install 2007 Excel data analysis add-ins…

Next.....

My presentation will be posted on website: http://bsrweb.burnham.org/

I am located in building 10, Office 2405, ext 3916

Feel free to come or call or send e-mail to ask questions (xyhuang@sanfordburnham.org)

Group email: bsr_help@sanfordburnham.org

statistics primer

Documents

academic ranking of world universities -...

basic statistics primer - courses.lsa.umich.edu

primer energy statistics

basic probability and statistics review six sigma black belt...

statistics in matlab -...

mle, conﬁdence intervals, and hypothesis...

primer on probabilities and statistics

a statistics primer to advance research knowledge and...

a primer of ecological statistics, second edition · part...

causal inference in statistics a primer, j. pearl, m...

elementary probability and statistics: a primer

statistics primer orc staff: xin xin (cindy) ryan glaman...

a primer on bayesian statistics - university of...

a primer of probability &...

chapter 12 a primer for inferential statistics what does...

cdn. · 2019. 3. 11. · licencia de pasw statistics 18.0...

statistics and probability primer for computational...

· 13809 13823 13967 13982 primer cup — large primer cup...

oferta primer mensual, primer bimestre ......budano roig...

technical introduction: a primer on probabilistic...