statistics primer
Post on 20-Jan-2016
59 Views
Preview:
DESCRIPTION
TRANSCRIPT
Statistics Primer
Xiayu (Stacy) Huang
Bioinformatics Shared ResourceEmail: bsr_help@sanfordburnham.org
Sanford | Burnham Medical Research Institute
Outline Overview of basic statistics
Introduction
Descriptive statistics
Inferential statistics
Most common statistical test and its applicationsT test
Power analysis using t test
What is statistics?
On American Statistical Association (ASA) website, statistics is defined as the science of collection, analysis, interpretation and presentation of data
Using Statistics to make decision can be a double-edged sword In the 1980s, Marriott conducted an extensive survey with
potential customers on their attitudes about current hotel offerings. After analyzing the data, the company launched Courtyard by Marriott, which has been a huge success
Coca-Cola performed a major consumer study in 1985 and, based on the results, decided to reformulate Coke, its flagship drink. After a huge public outcry, Coca-Cola had to backtrack and bring the original formulation back to market
History of statistics
Karl Pearson•Pearson correlation•Chi-square distribution
William Gosset•Student’s t
Ronald Aylmer Fisher•ANOVA, maximum likelihood
•17th-18th century
•19th century
•20th century
Jakob Bernoulli
•Bernoulli number•Bernoulli trial•Bernoulli process
•Bayes theorem
Thomas Bayes
Carl Friedrich Gauss
•Gaussian distribution
Why statistics is important to biologists?
• Designing experiment
• Analyzing biological data and understanding analysis results
• Preparing manuscript and grant applications
How many ???
DEGs
How many replicates for my microarray exp???
No replicates=No statistics?
Identifying outlierNormalization/transformationStatistical test, etc.
Study Scheme
Study Hypothesis
Design Study
Conduct Study and Collect data
Data Analysis
Make Conclusions
Summarizing data usingDescriptive Statistics
Hypothesis Testing Using Inferential Statistics
Choose StatisticalTest
Compute test statisticCompute p-value
Compare p-value and α
Branches of statistics
Descriptive statistics (Summary statistics)Summarize data graphically or numericallyLead to hypothesis generating
Inferential statisticsDistinguish true difference from random variationAllow hypothesis testing
Types of data
Qualitative or Quantitative
Example
QualitativeGender
GenotypeTumor location
Qualitative or Quantitative
PerformanceGrade of tox
Disease stage
QuantitativeAge
Array intensities
Descriptive statistics—central tendency
Mean—average
Median—middle value of sorted data
Mode—most frequently observed value
24
27 22
25 24
23
28 23
25
26 22
29 24
22
22 23
23 24
24
24
25
25
26 27
28 29
Median
Mode is 24 with frequency of 3
Mean=(24+27+….+24)/13=24.8
Agei.e.
Descriptive statistics—dispersion
Range
Sample Variance (s2)\ Standard deviation (s)
Values beyond two standard deviations from the mean can be considered as “outliers” (>mean+2s=24.8+2x2.2=29.2 or <mean-2s=24.8-2x2.2=20.4)
Standard error of mean (SEM)
Agei.e.
2 2 22
2
(22 ) (22 ) ... (29 )4.84
13 1
2.2
mean mean means
s s
2.20.61
13
sSEM
n
22
22 23
23 24
24
24 25
25
26 27
28 29
Range=highest value-lowest value=29-22=7
Descriptive statistics—data distributionHistogram (x-bin, y-frequency)
Graphical representation showing the distribution of data Summary graph showing how many data points falling in various ranges
22
22 23
23 24
24
24 25
25
26 27
28 29Frequency table
Bin Frequency
20-22 2
22-24 5
24-26 3
26-28 2
28-30 1
Histogram\frequency distribution
Bin percentage
20-22 0.155
22-24 0.38
24-26 0.23
26-28 0.155
28-30 0.08
Percentage table Histogram\probability distribution
Different data distributions
Descriptive statistics—data distribution
Approximate normal distribution i.e. height of people, length of dogs
Right skewed distribution Left skewed distributioni.e. FC of Microarray data i.e. distribution of age at retirement
•Bell-shaped curve
•Symmetrical about mean
•Mean, median and mode are equal
•~68% data points fall within 1 sd of mean
•~95% data points fall within 2 sd of mean
•~99.7% data points fall within 3 sd of mean
Normal (or Gaussian) distribution
mean=median=mode
Installing graphpad prism
You can install Prism on Institute supplied computers, including home and personal computers.
http://graphpad.com/paasl/index.cfm?sitecode=burnhm
SERIAL NUMBERS:
Macintosh versioncontacting IT (support@sanfordburnham.org) to get serial number Windows versioncontacting IT (support@sanfordburnham.org) to get serial number
Calculating descriptive statistics in excel
Calculating descriptive statistics in prism
Calculating descriptive statistics in prism
Graphically displaying descriptive statistics
Histogram
Mean error bar plot
Line plot w/o error bar
Graphically displaying descriptive statistics in Prism
Mean error bar plot
Histogram and frequency distribution
Graphically displaying descriptive statistics in Prism
Group line plot
Group line plot witherror bar
Group line plot withouterror bar
Choosing right measures of descriptive statistics
Normal distribution Skewed distribution
Normal distribution: mean and standard deviation
Skewed distribution: transform data to normal distribution
Outline
Overview of basic statisticsBrief Introduction
Descriptive statistics
Inferential statistics
Most common statistical tests and its applicationsT test
Power analysis using t test
Inferential statistics
Parametric Interval or ratio measurementsContinuous variableUsually assuming data are normally distributed
NonparametricOrdinal or nominal measurementsDiscreet variablesMaking no assumption about how data is
distributed
Inferential statistics-hypothesis
Null hypothesis (H0)
Alternative hypothesis (HA)• is the opposite of null hypothesis• is generally the hypothesis that is believed to be
true by the researcher
new drug effect = old drug effect tumor growth of MT = tumor growth of WT
new drug effect ≠ or > old drug effect tumor growth of MT ≠ or < tumor growth of WT
Inferential statistics-one and two sided tests
Hypothesis tests can be one or two sided (tailed)
One sided tests are directional:
Two sided tests are not directional:
H0 : new drug effect ≤ old drug effect
HA : new drug effect > old drug effect
H0 : new drug effect = old drug effect
HA : new drug effect ≠ old drug effect
Inferential statistics-type I and type II errors
Correct decision (TN)1-α
Type II error (FN)β
Type I error (FP)α
Correct decision (TP)1-β
“Actual situation”
No difference
Difference
“Measured”
1820 10
180 20
Correct decision (TN)1-α=1820/2000=0.91
Type II error (FN)FN=10/30=0.33
Type I error (FP)α=180/2000=0.09
Correct decision (TP)1-β=20/30=0.67
No difference (H0) Difference (HA)
“Actual situation”
“Measured”
- +-
+
1830
200
2000 30
FOB screening(bowel cancer)
Inferential statistics-type I and type II errors
• Control type I and type II errors• Inverse relationship between type I and type II errors• Make a choice to control which error
• i.e. controlling type I error (FP) is more important for microarray data than type II error (FN)
• i.e. controlling type II error (FN) is more important for cancer screening test than type I error (FP)
• Choose type I and type II errors for statistical test?
• Common choices (α = 5%, β = 20%)• Exploratory study (α = 10%, β = 10%)• Confirmatory study (α = 1%, β = 10%)
Inferential Statistics-P-value
• the probability that an observed difference could have occurred by chance under null hypothesis
• Computed from test statistics score
• P-value is the same as false positive rate
• P-value below cut off (α) is referred as “statistically significant”
Inferential Statistics-Power
Power (1-β, aka true positive rate (TP))
• Probability of detecting a significant scientific difference when it does exist
Power depends on:Sample size (n)
Standard deviation (s)
Size of the difference you want to detect (δ)
False positive rate (α)
Effect size
s
Study scheme
Study Hypothesis
Design Study
Conduct Study and Collect data
Data Analysis
Make Conclusions
Calculating and Displaying Descriptive Statistics
Hypothesis Testing Using Inferential Statistics
Choose Statistical Test
Compute test statisticCompute p-value
Compare p-value and α
Type of data
Quantitative Qualitative
Type of research question
Association Correlation Comparison
Data structure
Independent Paired Matched
How to choose an appropriate statistical test?
Statistical test decision making tree
For qualitative or non-numerical data
For quantitative or numerical data
Two sample comparison
Multiple sample comparison
Relationship between variables
Statistical test decision making tree
Outline
Overview of basic statisticsBrief Introduction
Descriptive statistics
Inferential statistics
Most common statistical test and its applications
T testPower analysis using t test
Student’s t test
Guinness employee William Sealy Gosset published the 'Student's t-test' in 1908
Types of t test
One sample t test: test if a sample mean differs significantly from the given known mean
Unpaired t test: test if two independent sample means differ significantly
Paired t test: test if two dependent sample means differ significantly (mean of pre and post treatment for same set of patients
Application of t test in biology
Proteomics experiment
WT MT
Technical reps
Biological reps
You need to have at least two replicates in each condition
to do t test, otherwise, t test is invalid and you won’t have statistics
Mincroarry experiment
WT MT
Two sample unpaired t test
Assumptions Data is approximately normally distributed The sample has been independently and randomly
selected Similar variances between comparing groups
Hypothesis (two sided or one sided)
Test statistics1 2
1 2 1 2, 2
1 2
2 22 1 1 2 2
1 2
( ) ( )
1 / 1 /
( 1) ( 1)
2
n np
p
X Xt t
sn n
n s n ss
n n
-- sample means
-- population means
-- sample standard deviation
-- sample size
-- pooled sample variance
1 2,X X
1 2,s s
1 2,
1 2,n n2ps
0 1 2
1 2
: 0
: 0A
H
H
Sample data
1st Question to be answered:
Will the two treatments have different effect on patients’ remission time from cancer?
Patients TreatmentsRemission time
from cancer (years)
1 Drug 72 Drug 53 Drug 24 Drug 85 Drug 36 Drug 47 Drug 108 Drug 79 Drug 410 Drug 911 Placebo 412 Placebo 313 Placebo 114 Placebo 615 Placebo 216 Placebo 417 Placebo 918 Placebo 519 Placebo 320 Placebo 8
Summarizing sample data using descriptive statistics
Hypothesis testing of sample data using inferential statistics
Step1: Choosing an appropriate statistical test
Step2: Performing statistical test in software
Step3: Making conclusions
Statistical test decision making tree
Two sample t test in Prism-normality check
Two sample t test in Prism
Two sample t test in excel
Power analysis using two sample t test
2nd question to be answered:
How many patients do we need in order to detect a significantly difference b/w two treatments?
N α β δ/s Test K:1 efficiency imbalance
2 2 21 /2 1 1 /2 1
22
( ) ( )
( )
s t t t tn
s
Power analysis of t test in G*power
Power analysis of t test in G*power
Basic Statistics toolsStatistics softwares and packages:
1.Excel and add-ins: EZAnalyze, Analysis Toolpak2. Our institute supported Prism3. SPSS, Statistica (commercial)4. SAS (commercial) and R 5. G*Power
Basic statistics books:
1. Intro Stats, SDSU, 2nd edition, Deveaux, Velleman, Bock2. Choosing and Using Statistics: A Biologist's Guide 3. Introduction to Statistics for Biology 4. Biostatistical analysis, fifth edition, Jerrold H. Zar
Statistics videos:
1. http://www.microbiologybytes.com/maths/videos2. http://www.youtube.com: descriptive statistics, basic statistics, install 2007 Excel data analysis add-ins…
Next.....
My presentation will be posted on website: http://bsrweb.burnham.org/
I am located in building 10, Office 2405, ext 3916
Feel free to come or call or send e-mail to ask questions (xyhuang@sanfordburnham.org)
Group email: bsr_help@sanfordburnham.org
top related