introduction to biostatistics and bioinformatics experimental design
TRANSCRIPT
![Page 1: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/1.jpg)
Introduction to Biostatistics and Bioinformatics
Experimental Design
![Page 2: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/2.jpg)
Experimental Design
Experimental Design by Christine Ambrosinowww.hawaii.edu/fishlab/Nearside.htm
![Page 3: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/3.jpg)
Experimental Design
Overcoming the threat from chance and bias to the validity of conclusion.
![Page 4: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/4.jpg)
Experimental Design
Inputs Process Outputs
Controllable Factors
Uncontrollable Factors
![Page 5: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/5.jpg)
Experimental Design
• Recognition and statement of the problem (e.g. testing a specific hypothesis or open ended discovery).
• Selecting a response variable.
• Choosing controllable factors and their range.
• Listing uncontrollable factors and estimate their effect.
• Choosing experimental design.
• Performing experiment.
• Statistical analysis of data.
• Designing the next experiment based on the results.
![Page 6: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/6.jpg)
Exploring the Parameter SpaceOne factor at a
time
Factor 1
Sco
re
Factor 2
Sco
re
Factor 3
Sco
re
Factor 1
Facto
r 2
2-factor factorial design 3-factor factorial design
k-factor factorial design (2k experiments)
k factors : 2k experiments
4 experiments 8 experiments
For example, 7 factors: 128 experiments, 10 factors: 1,024 experiments
![Page 7: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/7.jpg)
Randomization
• Statistical methods require that observations are independently distributed random variables. Randomization usually makes this assumption valid.
• Randomization guards against unknown and uncontrolled factors.
• Randomize with respect to analysis order, location, material etc.
Order of MeasurementsOrder of Measurements
p = 0.19 p = 0.32
Not Randomized Randomized
No change in sensitivity
duringmeasurement
![Page 8: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/8.jpg)
Randomization
Order of MeasurementsOrder of Measurements
p = 0.19 p = 0.32
Not Randomized Randomized
Order of MeasurementsOrder of Measurements
p = 5.7x10-6
No change in sensitivity
duringmeasurement
Change in sensitivity
duringmeasurement
p = 0.20
StandardDeviation:
0.8, 0.8
StandardDeviation:
0.7, 0.9
StandardDeviation:
1.8, 1.3
![Page 9: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/9.jpg)
Blocking
Blocking is used to control for known and controllable factors.Randomized Complete Block Design - minimizing the effect of variability associated with e.g. location, operator, plant, batch, time.
The Latin Square Design - minimizing the effect of variability associated with two independent factors
The rows and columns represent two restrictions on randomization
Intrument 1 Intrument 2 Intrument 3 Intrument 4Operator 1 Sample 1 Sample 2 Sample 3 Sample 4Operator 2 Sample 2 Sample 3 Sample 4 Sample 1Operator 3 Sample 4 Sample 1 Sample 2 Sample 3Operator 4 Sample 3 Sample 4 Sample 1 Sample 2
Intrument 1 Intrument 2 Intrument 3 Intrument 4Sample 3 Sample 3 Sample 2 Sample 1Sample 1 Sample 4 Sample 1 Sample 4Sample 4 Sample 2 Sample 3 Sample 2Sample 2 Sample 1 Sample 4 Sample 3
![Page 10: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/10.jpg)
Replication
Replication is needed to estimate the variance in the measurements.
• Technical replicates (repeat measurements).
• Process replicates
• Biological replicates
![Page 11: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/11.jpg)
Uncertainty in Determining the MeanComplex Normal Skewed Long tails
n=3
n=10
Mean
n=100
n=3
n=10
n=100
n=3
n=10
n=100
n=10
n=100
n=1000
![Page 12: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/12.jpg)
Before/After Treatment
Gradient Length
Date Laboratory Patient
Before 3h 2010/07/02 13:08 1 6Before 3h 2010/07/02 19:15 1 11Before 3h 2010/07/04 18:19 1 4Before 3h 2010/07/05 00:26 1 10Before 3h 2010/07/11 05:29 1 16Before 3h 2010/07/11 08:33 1 17Before 3h 2010/07/11 14:39 1 19Before 3h 2010/07/11 20:46 1 29Before 3h 2010/07/19 00:12 1 20Before 3h 2010/07/19 09:22 1 53Before 3h 2010/07/19 12:26 1 58Before 3h 2010/07/19 15:29 1 61Before 3h 2010/07/25 09:17 1 35Before 3h 2010/07/25 12:20 1 39After 1h 2011/02/20 10:49 1 4After 1h 2011/02/20 13:57 1 6After 1h 2011/02/20 17:05 1 11After 1h 2011/03/04 14:07 2 15After 1h 2011/03/04 15:47 2 16After 1h 2011/03/04 17:06 2 17After 1h 2011/03/04 18:25 2 19After 1h 2011/03/04 19:44 2 20After 1h 2011/03/04 21:03 2 29After 1h 2011/03/05 02:19 2 35After 1h 2011/03/05 03:39 2 39After 1h 2011/03/05 04:57 2 53After 1h 2011/03/07 00:35 2 65After 1h 2011/03/07 02:51 2 58
Before 3h 2011/04/16 20:43 1 11After 3h 2011/04/21 04:54 1 10After 3h 2011/04/21 11:00 1 15After 1h 2011/04/22 08:20 1 17After 1h 2011/04/23 09:03 1 65
Before 3h 2011/04/23 21:20 1 20
An example of bad experimental design
![Page 13: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/13.jpg)
Protein Identification and Quantitation by Mass Spectrometry
MassSpectrometry
m/z
inte
nsit
y
Identity
Quantity
Samples
Peptides
![Page 14: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/14.jpg)
A proteomics example – no replicates
three replicatesLog
2 S
um
Sp
ectr
um
Cou
nt
Log 2 Spectrum Count Ratio
![Page 15: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/15.jpg)
Analytical Measuments: Precision and Accuracy
Theoretical Concentration
Theoretical Concentration
Measu
red
C
on
cen
trati
on
Measu
red
C
on
cen
trati
on
![Page 16: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/16.jpg)
Testing multiple hypothesis
• Is the concentration of calcium/calmodulin-dependent protein kinase type II different between the two samples?
• What protein concentration are different between the two samples?
p = 2x10-
6
The p-value needs to be corrected taking into account the we perform many tests.
Bonferroni correction: multiply the p-value with The number of tests performed (n): pcorr = puncorr x n
In this case where 3685 proteins are identified, so the Bonferroni corrected p-value for calcium/calmodulin-dependent protein kinase type II is pcorr = 2x10-6 x 3685 = 0.007
![Page 17: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/17.jpg)
Testing multiple hypothesis
The p-value distribution is uniform when testing differences between samples from the same distribution.
Normal distributionSample size = 10
p-value 10
# o
f te
st
p-value 10
# o
f te
st
p-value 10
# o
f te
st
0
8
0
60
0
500
10,000 tests1,000 tests100 tests
![Page 18: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/18.jpg)
Testing multiple hypothesis
The p-value distribution is uniform when testing differences between samples from the same distribution.
Normal distributionSample size = 10
30 tests from a distribution with a different mean (μ1-
μ2>>σ)
p-value 1
# o
f te
st
p-value 1
# o
f te
st
p-value 10
# o
f te
st
0
30
0
100
0
500
10,000 tests1,000 tests100 tests
00
![Page 19: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/19.jpg)
Testing multiple hypothesis
Controlling for False Discovery Rate (FDR)
Normal distributionSample size = 10
30 tests from a distribution with a different mean (μ1-
μ2>>σ)
p-value 1
Fals
e R
ate
p-value 1
Fals
e R
ate
p-value 10
Fals
e R
ate
0
1
0
1
0
1
00
False Discovery
Rate
False Discovery
Rate
False Discovery
Rate
10,000 tests1,000 tests100 tests
![Page 20: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/20.jpg)
Testing multiple hypothesis
False Discovery Rate (FDR) and False Negative Rate (FNR)
Normal distributionSample size = 10
100 tests30 tests from a distribution
with a different mean
p-value 1
Fals
e R
ate
p-value 1
Fals
e R
ate
p-value 10
Fals
e R
ate
0
1
0
1
0
1
00
μ1-μ2=2σμ1-μ2=σμ1-μ2=σ/2
False Discovery
Rate
False Negative
Rate
False Discovery
Rate
False Negative
Rate
False Discovery
Rate
False Negative Rate
![Page 21: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/21.jpg)
Sampling – Gaussian Peak
Retention Time
Inte
nsi
ty
![Page 22: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/22.jpg)
0.5
0.6
0.7
0.8
0.9
1
1.1
1 2 3 4 5 6 7 8 9 10
Th
res
ho
lds
(90
%)
# of points
Sampling – Gaussian Peak
![Page 23: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/23.jpg)
Definition of a molecular signature
FDA calls them “in vitro diagnostic multivariate assays”
A molecular signature is a computational or mathematical model that links high-dimensional molecular information to phenotype or other response variable of interest.
![Page 24: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/24.jpg)
1. Models of disease phenotype/clinical outcome• Diagnosis• Prognosis, long-term disease management• Personalized treatment (drug selection,
titration)
2. Biomarkers for diagnosis, or outcome prediction• Make the above tasks resource efficient, and
easy to use in clinical practice
3. Discovery of structure & mechanisms (regulatory/interaction networks, pathways, sub-types)• Leads for potential new drug candidates
Uses of molecular signatures
![Page 25: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/25.jpg)
Oncotype DX Breast Cancer Assay
• Developed by Genomic Health (www.genomichealth.com)
• 21-gene signature to predict whether a woman with localized, ER+ breast cancer is at risk of relapse
• Independently validated in thousands of patients• So far performed >100,000 tests• Price of the test is $4,175• Not FDA approved but covered by most insurances
including Medicare• Its sales in 2010 reached $170M and with a compound
annual growth rate is projected to hit $300M by 2015.
![Page 26: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/26.jpg)
EF Petricoin III, AM Ardekani, BA Hitt, PJ Levine, VA Fusaro, SM Steinberg, GB Mills, C Simone, DA Fishman, EC Kohn, LA Liotta, "Use of proteomic patterns in serum to identify ovarian cancer", Lancet 359 (2002) 572–77
![Page 27: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/27.jpg)
Check E., Proteomics and cancer: running before we can walk? Nature. 2004 Jun 3;429(6991):496-7.
![Page 28: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/28.jpg)
Example: OvaCheck
• Developed by Correlogic (www.correlogic.com)• Blood test for the early detection of epithelial ovarian
cancer • Failed to obtain FDA approval • Looks for subtle changes in patterns among the tens of
thousands of proteins, protein fragments and metabolites in the blood
• Signature developed by genetic algorithm• Significant artifacts in data collection & analysis
questioned validity of the signature:- Results are not reproducible- Data collected differently for different groups of
patientshttp://www.nature.com/nature/journal/v429/n6991/full/
429496a.html
![Page 29: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/29.jpg)
Main ingredients for developing a molecular signature
![Page 30: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/30.jpg)
Base-Line Characteristics
DF Ransohoff, "Bias as a threat to the validity of cancer molecular-marker research", Nat Rev Cancer 5 (2005) 142-9.
![Page 31: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/31.jpg)
How to Address Bias
DF Ransohoff, "Bias as a threat to the validity of cancer molecular-marker research", Nat Rev Cancer 5 (2005) 142-9.
![Page 32: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/32.jpg)
Experimental Design - Summary
• Chance and bias is a threat to the conclusions from experiments
• Controllable and uncontrollable factors
• Randomization to guard against unknown and uncontrolled factors
• Replication (technical, process, and biological replicates) is used to estimate error in measurement and yields a more precise estimate.
• Blocking to control for known and controllable factors
• Multiple testing
• Molecular markers
![Page 33: Introduction to Biostatistics and Bioinformatics Experimental Design](https://reader035.vdocument.in/reader035/viewer/2022062222/5697bfe01a28abf838cb354f/html5/thumbnails/33.jpg)
Experimental Design - Summary
• Use your domain knowledge: using a designed experiment is not a substitute for thinking about the problem.
• Keep the design and analysis as simple as possible.
• Recognize the difference between practical and statistical significance.
• Design iterative experiments.