functional genomics ihyphy.org/w/images/0/03/20140217_week7lecture_jyoung.pdf · functional...
TRANSCRIPT
Functional Genomics IMED263: Bioinformatics Applications to Human Disease
Jason Young | Email: [email protected] | MED 263 | Winter 2015
What You Will Learn Today...
• Functional genomic methods for gene expression analysis
• Typical workflow for a gene expression study
• Aspects of microarray data analysis• Kicic et al. (2010): Example of a
differential expression microarray study
Jason Young | Email: [email protected] | MED 263 | Winter 2015
The Central Dogma of Biology
Jason Young | Email: [email protected] | MED 263 | Winter 2015
The Central Dogma of Biology
Genomics
Transcriptomics
Proteomics
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Genomics
Transcriptomics
ProteomicsFunctional Genomics
The Central Dogma of Biology
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Functional Genomics
“Fishing Expeditions”
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Functional genomic studies often labeled as “descriptive” versus “hypothesis-driven” research.
Functional Genomics
“Fishing Expeditions”
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Functional genomic studies often labeled as “descriptive” versus “hypothesis-driven” research.
“Without speculation there is no good and original observation” - Charles Darwin
Functional Genomics
Functional genomic studies often labeled as “descriptive” versus “hypothesis-driven” research.
“Fishing Expeditions”
Use functional genomics data to generate hypotheses that can then be tested with further experimentation.
“Without speculation there is no good and original observation” - Charles Darwin
Jason Young | Email: [email protected] | MED 263 | Winter 2015
History of Transcript AnalysisPre-Functional Genomics: One transcript at a time
- Northern blotting (1977)- Reverse Transcriptase PCR (RT-PCR)- RNase protection
* Highly-quantitative, still essential for validation of functional genomic gene expression results
Functional Genomics: Many transcripts at a time - cDNA libraries (early 1990s)- Serial Analysis Gene Expression (SAGE) (mid 1990s)- DNA Microarrays (late 1990s)- RNA-Seq (late 2000s)
* Less-quantitative, but provide a rapid, broad overview of genome-wide transcript abundance
Jason Young | Email: [email protected] | MED 263 | Winter 2015
History of Transcript AnalysisPre-Functional Genomics: One transcript at a time
- Northern blotting (1977)- Reverse Transcriptase PCR (RT-PCR)- RNase protection
* Highly-quantitative, still essential for validation of functional genomic gene expression results
Functional Genomics: Many transcripts at a time - cDNA libraries (early 1990s)- Serial Analysis Gene Expression (SAGE) (mid 1990s)- DNA Microarrays (late 1990s)- RNA-Seq (late 2000s)
* Less-quantitative, but provide a rapid, broad overview of genome-wide transcript abundance
Jason Young | Email: [email protected] | MED 263 | Winter 2015
History of Transcript AnalysisPre-Functional Genomics: One transcript at a time
- Northern blotting (1977)- Reverse Transcriptase PCR (RT-PCR)- RNase protection
* Highly-quantitative, still essential for validation of functional genomic gene expression results.
Functional Genomics: Many transcripts at a time - cDNA libraries (early 1990s)- Serial Analysis Gene Expression (SAGE) (mid 1990s)- DNA Microarrays (late 1990s)- RNA-Seq (late 2000s)
* Less-quantitative, but provide a rapid, broad overview of genome-wide transcript abundance.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
History of Transcript AnalysisPre-Functional Genomics: One transcript at a time
- Northern blotting (1977)- Reverse Transcriptase PCR (RT-PCR)- RNase protection
* Highly-quantitative, still essential for validation of functional genomic gene expression results
Functional Genomics: Many transcripts at a time - cDNA libraries (early 1990s) - Serial Analysis Gene Expression (SAGE) (mid 1990s)- DNA Microarrays (late 1990s)- RNA-Seq (late 2000s)
* Less-quantitative, but provide a rapid, broad overview of genome-wide transcript abundance
Jason Young | Email: [email protected] | MED 263 | Winter 2015
cDNA Libraries
cDNA Library Construction 1. Isolate mRNA from organism,
cell type, developmental stage, or physiological condition
2. Reverse transcribe to cDNA3. Clone into a vector for
propagation in bacteria4. Sequence cDNA inserts to
produce Expressed Sequence Tags (ESTs)
5. ESTs represent a sampling of the expression repertoire of the original samples
Shotgun Single-Pass Approach Adams et al. 1991, 1993
Jason Young | Email: [email protected] | MED 263 | Winter 2015
cDNA Libraries
Caveats of cDNA libraries and ESTs 1. Time consuming and laborious2. Depth of sequencing of library
determines how well rare transcripts are represented (counting)
3. Incomplete transcripts often present (5’ end missing)
4. Clones can be used to express protein products in addition to measuring ESTs
Shotgun Single-Pass Approach Adams et al. 1991, 1993
Jason Young | Email: [email protected] | MED 263 | Winter 2015
History of Transcript AnalysisPre-Functional Genomics: One transcript at a time
- Northern blotting (1977)- Reverse Transcriptase PCR (RT-PCR)- RNase protection
* Highly-quantitative, still essential for validation of functional genomic gene expression results
Functional Genomics: Many transcripts at a time - cDNA libraries (early 1990s)- Serial Analysis Gene Expression (SAGE) (mid 1990s) - DNA Microarrays (late 1990s)- RNA-Seq (late 2000s)
* Less-quantitative, but provide a rapid, broad overview of genome-wide transcript abundance
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Serial Analysis Gene Expression (SAGE)
SAGE Library Construction 1. Isolate mRNA from organism, cell
type, developmental stage, or physiological condition
2. Reverse transcribe to cDNA (with biotin tag)
3. Cleave w/ AE & attach to beads4. Divide into two pools and ligate
distinct linkers (A & B)5. Cleave using blunt end TE6. Perform blunt ligation to generate
ditags7. Concatenate, clone and
sequence
Velculescu et al. 1995
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Serial Analysis Gene Expression (SAGE)
Caveats of SAGE libraries 1. Still time consuming and
laborious, but shorter tags make SAGE more cost efficient than EST libraries (specialized skill)
2. Only detect 3’ end of transcripts and relies on the presence of an appropriately spaced AE site
3. Relies on counting like ESTs, although genes expressed at low levels are difficult to reproduce
4. Like cDNA libraries, no need for knowledge of genome sequence to obtain tags (not true of microarrays)
Velculescu et al. 1995
Jason Young | Email: [email protected] | MED 263 | Winter 2015
History of Transcript AnalysisPre-Functional Genomics: One transcript at a time
- Northern blotting (1977)- Reverse Transcriptase PCR (RT-PCR)- RNase protection
* Highly-quantitative, still essential for validation of functional genomic gene expression results
Functional Genomics: Many transcripts at a time - cDNA libraries (early 1990s)- Serial Analysis Gene Expression (SAGE) (mid 1990s)- DNA Microarrays (late 1990s) - RNA-Seq (late 2000s)
* Less-quantitative, but provide a rapid, broad overview of genome-wide transcript abundance
Jason Young | Email: [email protected] | MED 263 | Winter 2015
DNA Microarrays
Jason Young | Email: [email protected] | MED 263 | Winter 2015
DNA Microarrays
Sequences need to be known for probe design
Relies on hybridization rather than sequence counting
Jason Young | Email: [email protected] | MED 263 | Winter 2015
DNA Microarrays
Sequences need to be known for probe design
Relies on hybridization rather than sequence counting
Fast: Can obtain genome-wide transcript levels in days
Comprehensive: Entire transcriptomes can be represented on one array
Flexible: Probes against any gene can be represented on a chip.
Affordable: Technology is >10 years old.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Spotted Array Generally 60-80 nucleotidesSpotted mechanicallyGenerally <10k features+s: flexibility-s: low-density, reproducibilityDual color (intra array)
In Situ Synthesized 25-80 nucleotidesGenerated using photolithography>1 million features, static+s: high-density, reproducibility-s: flexibilitySingle or dual color (inter or intra array)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Spotted Array Generally 60-80 nucleotidesSpotted mechanicallyGenerally <10k features+ flexibility- low-density, reproducibilityDual color (intra array)
In Situ Synthesized 25-80 nucleotidesGenerated using photolithography>1 million features, static+s: high-density, reproducibility-s: flexibilitySingle or dual color (inter or intra array)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Spotted Array Generally 60-80 nucleotidesSpotted mechanicallyGenerally <10k features+ flexibility- low-density, reproducibilityDual color (intra array)
In Situ Synthesized 25-80 nucleotidesGenerated using photolithography>1 million features, static+s: high-density, reproducibility-s: flexibilitySingle or dual color (inter or intra array)
Cy5 (Red)
Cy3 (Green)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Spotted Array Generally 60-80 nucleotidesSpotted mechanicallyGenerally <10k features+ flexibility- low-density, reproducibilityDual color (intra array)
In Situ Synthesized 25-80 nucleotidesGenerated using photolithography>1 million features, static+ high-density, reproducibility- flexibilitySingle or dual color (inter or intra array)
Cy5 (Red)
Cy3 (Green)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Spotted Array Generally 60-80 nucleotidesSpotted mechanicallyGenerally <10k features+ flexibility- low-density, reproducibilityDual color (intra array)
In Situ Synthesized 25-80 nucleotidesGenerated using photolithography>1 million features, static+ high-density, reproducibility- flexibilitySingle or dual color (inter or intra array)
Affymetrix / Nimblegen (Roche) / Agilent / Illumina
Cy5 (Red)
Cy3 (Green)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Affymetrix GeneChips • Traditionally have
dominated the market• 11-20 distinct 25nt
probes measure expression of each gene
• Attempted to account for non-specific hybridization to PM probes using MM probes
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Nimblegen - Madison, WI
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Nimblegen (Roche)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Types of DNA Microarrays
Illumina
23 & Me
Jason Young | Email: [email protected] | MED 263 | Winter 2015
But Wait!?! Who uses microarrays anymore?!?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays NGS
Q1: Do you know what transcripts you’re looking for?
Microarrays vs NGS
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays NGS
Yes No
Q1: Do you know what transcripts you’re looking for?
Microarrays vs NGS
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays NGS
Microarrays vs NGSQ2: Do you have a lot of money to spend on experiments?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays NGS
YesNo
Microarrays vs NGSQ2: Do you have a lot of money to spend on experiments?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays NGS
Microarrays vs NGSQ3: Do you want to rely on the most well-tested and developed methods?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays vs NGS
From: David M. Rocke, UC-Davis
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays vs NGS
From: David M. Rocke, UC-Davis
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays vs NGS
From: David M. Rocke, UC-Davis
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays vs NGS
From: David M. Rocke, UC-Davis
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays vs NGS
From: David M. Rocke, UC-Davis
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays vs NGS
From: David M. Rocke, UC-Davis
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarrays NGS
Yes No
Microarrays vs NGSQ3: Do you want to rely on the most well-tested and developed methods?
Other Microarray Applications
• Genotyping arrays
• Methylation arrays
• Target enrichment (pre-sequencing)
• Rapid pathogen detection in-the-field (Influenza sub-typing)
• Protein arrays (parallelized ELISA)
• Antibody arrays
• High-throughput standardized testing (drug development) ($$$)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
5. Data Analysis
6. Biological Confirmation
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
5. Data Analysis
6. Biological Confirmation
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Experimental Design
Define Biological Question and Samples Needed • Tissue comparison. Ex. Regions of the brain• Time course. Ex. Pathogen life cycle• +/- Treatment. Ex. Drug treatment
Determine Appropriate Array Platform and Labeling Procedures • Are arrays commercially available for your purpose?• 1 or 2 color labeling needed? (2-color requires reverse labeling)• Amount of material needed? (1 to 5 ug total RNA/sample)• Make sure probes are randomized on an array
Plan entire workflow ahead of time to maximize experimental control • Prepare a well defined sample preparation procedure!• Do all steps for samples in parallel if possible, from RNA isolation, to
labeling, to hybridization, to scanning (same person and machine too).
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Experimental Design
Define Biological Question and Samples Needed • Tissue comparison. Ex. Regions of the brain• Time course. Ex. Pathogen life cycle• +/- Treatment. Ex. Drug treatment
Determine Appropriate Array Platform and Labeling Procedures • Are arrays commercially available for your purpose?• 1 or 2 color labeling needed? (2-color requires reverse labeling)• Amount of material needed? (1 to 5 ug total RNA/sample)• Make sure probes are randomized on an array
Plan entire workflow ahead of time to maximize experimental control • Prepare a well defined sample preparation procedure!• Do all steps for samples in parallel if possible, from RNA isolation, to
labeling, to hybridization, to scanning (same person and machine too).
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Experimental Design
Define Biological Question and Samples Needed • Tissue comparison. Ex. Regions of the brain• Time course. Ex. Pathogen life cycle• +/- Treatment. Ex. Drug treatment
Determine Appropriate Array Platform and Labeling Procedures • Are arrays commercially available for your purpose?• 1 or 2 color labeling needed? (2-color requires reverse labeling)• Amount of material needed? (1 to 5 ug total RNA/sample)• Make sure probes are randomized on an array
Plan entire workflow ahead of time to maximize experimental control • Prepare a well defined sample preparation procedure!• Do all steps for samples in parallel if possible, from RNA isolation, to
labeling, to hybridization, to scanning (same person and machine too).
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
5. Data Analysis
6. Biological Confirmation
Jason Young | Email: [email protected] | MED 263 | Winter 2015
RNA Isolation and Labeling
Isolate RNA • Total RNA with Trizol• Further isolation of mRNA if needed• Assess quality of RNA
Agilent 2100 Bioanalyzer • Calculates RNA Integrity Number (RIN)• Examines the entire electrophoretic trace of the RNA sample including the presence/absence of degradation products
Jason Young | Email: [email protected] | MED 263 | Winter 2015
RNA and Probe Preparation
Direct Labeling
Jason Young | Email: [email protected] | MED 263 | Winter 2015
RNA and Probe Preparation
Indirect Labeling• Improved
efficiency of nucleotide incorporation
Direct Labeling
Jason Young | Email: [email protected] | MED 263 | Winter 2015
RNA and Probe Preparation
Affymetrix Protocol Indirect labeling w/ Amplification (1 color)
1. Reverse Transcription2. In Vitro Transcriptionto produce cRNA (signal amplification)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
6. Data Analysis
7. Biological Confirmation
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Hybridization
Affymetrix Protocol
1. Pre-Hyb (10’)2. Hyb (16hr)3. Streptavidin -
Phycoerythrin (SAPE)
4. anti-SA Ab-biotin(more signal amplification!)5. SAPE6. Scan
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Hybridization
Affymetrix Protocol
1. Pre-Hyb (10’)2. Hyb (16hr)3. Streptavidin -
Phycoerythrin (SAPE)
4. anti-SA Ab-biotin(more signal amplification!)5. SAPE6. Scan
~3 days from RNA isolation to scan
Jason Young | Email: [email protected] | MED 263 | Winter 2015
RNA and Probe Preparation
Why all the signal amplification?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
RNA and Probe Preparation
Why all the signal amplification?
1-5 ugtotal RNA
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
5. Data Analysis
6. Biological Confirmation
Jason Young | Email: [email protected] | MED 263 | Winter 2015
PreprocessingGoal: To remove the systematic bias in the data as completely as possible while
preserving the variation in gene expression that occurs because of biologically relevant changes in transcription
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Preprocessing
Steps:
• Quantitation - Convert image into a series of numbers (image analysis) (.CEL Files).• Data import - Data must be collated from different formats housed in different files/
databases.• Quality assessment - Detects divergent measurements beyond the level of random
fluctuations. • Background adjustment - Adjustment of observed expression levels to account for
non-specific hybridization (noise).• Normalization - Allows for arrays to be compared to one another by controlling for
different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects and different laboratory conditions.
• Summarization - Combines multiple probe intensities for a particular gene to produce a single expression value for that gene.
Goal: To remove the systematic bias in the data as completely as possible while preserving the variation in gene expression that occurs because of biologically
relevant changes in transcription
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Preprocessing
Steps:
• Quantitation - Convert image into a series of numbers (image analysis) (.CEL Files).• Data import - Data must be collated from different formats housed in different files/
databases.• Quality assessment - Detects divergent measurements beyond the level of random
fluctuations. • Background adjustment - Adjustment of observed expression levels to account for
non-specific hybridization (noise).• Normalization - Allows for arrays to be compared to one another by controlling for
different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects and different laboratory conditions.
• Summarization - Combines multiple probe intensities for a particular gene to produce a single expression value for that gene.
Goal: To remove the systematic bias in the data as completely as possible while preserving the variation in gene expression that occurs because of biologically
relevant changes in transcription
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Preprocessing
Steps:
• Quantitation - Convert image into a series of numbers (image analysis) (.CEL Files).• Data import - Data must be collated from different formats housed in different files/
databases.• Quality assessment - Detects divergent measurements beyond the level of random
fluctuations. • Background adjustment - Adjustment of observed expression levels to account for
non-specific hybridization (noise).• Normalization - Allows for arrays to be compared to one another by controlling for
different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects and different laboratory conditions.
• Summarization - Combines multiple probe intensities for a particular gene to produce a single expression value for that gene.
Goal: To remove the systematic bias in the data as completely as possible while preserving the variation in gene expression that occurs because of biologically
relevant changes in transcription
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Quality Assessment• First thing to do: obtain overview of array signal • Box plots and histograms• Identify outlier arrays by examining probe intensities across all arrays at once.• Array “f” appears to stand out in box plot (Note: normalization can often correct thisdifference)• Array “a” appears to have a bimodal distribution in the histogram which usually indicatesa spatial artifact, i.e. large section of the array has abnormally high values.
Arrays
log
Inte
nsity
log Intensity
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Quality Assessment• First thing to do: obtain overview of array signal • Box plots and histograms• Identify outlier arrays by examining probe intensities across all arrays at once.• Array “f” appears to stand out in box plot (Note: normalization can often correct thisdifference)• Array “a” appears to have a bimodal distribution in the histogram which usually indicatesa spatial artifact, i.e. large section of the array has abnormally high values.
Arrays
log
Inte
nsity
log Intensity
What do you see?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
• First thing to do: obtain overview of array signal• Box plots and histograms• Identify outlier arrays by examining probe intensities across all arrays at once.• Array “f” appears to stand out in box plot (Note: normalization can often correct thisdifference)• Array “a” appears to have a bimodal distribution in the histogram which usually indicatesa spatial artifact, i.e. large section of the array has abnormally high values.
Arrays
log
Inte
nsity
log Intensity
Quality Assessment
Jason Young | Email: [email protected] | MED 263 | Winter 2015
• First thing to do: obtain overview of array signal • Box plots and histograms• Identify outlier arrays by examining probe intensities across all arrays at once.• Array “f” appears to stand out in box plot (Note: normalization can often correct thisdifference)• Array “a” appears to have a bimodal distribution in the histogram which usually indicatesa spatial artifact, i.e. large section of the array has abnormally high values.
Arrays
log
Inte
nsity
log Intensity
Quality Assessment
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Raw Image Inspection
Crop circles
Ring of fire
Full moon
Tricolor
Thumb print
Arcs
http://plmimagegallery.bmbolstad.com/
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Preprocessing
Steps:
• Quantitation - Convert image into a series of numbers (image analysis) (.CEL Files).• Data import - Data must be collated from different formats housed in different files/
databases.• Quality assessment - Detects divergent measurements beyond the level of random
fluctuations. • Background adjustment - Adjustment of observed expression levels to account for
non-specific hybridization (noise).• Normalization - Allows for arrays to be compared to one another by controlling for
different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects and different laboratory conditions.
• Summarization - Combines multiple probe intensities for a particular gene to produce a single expression value for that gene.
Goal: To remove the systematic bias in the data as completely as possible while preserving the variation in gene expression that occurs because of biologically
relevant changes in transcription
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Background Adjustment
• Background noise distribution calculated using negative controls or empty spots• Subtract background noise from raw probe intensities
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Preprocessing
Steps:
• Quantitation - Convert image into a series of numbers (image analysis) (.CEL Files).• Data import - Data must be collated from different formats housed in different files/
databases.• Quality assessment - Detects divergent measurements beyond the level of random
fluctuations. • Background adjustment - Adjustment of observed expression levels to account for
non-specific hybridization (noise).• Normalization - Allows for arrays to be compared to one another by controlling for
different efficiencies of reverse transcription, labeling or hybridization reactions, physical problems with the arrays, reagent batch effects and different laboratory conditions.
• Summarization - Combines multiple probe intensities for a particular gene to produce a single expression value for that gene.
Goal: To remove the systematic bias in the data as completely as possible while preserving the variation in gene expression that occurs because of biologically
relevant changes in transcription
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Why do we need normalization?
• Some arrays are brighter than others.• Not due to the biological data but to
unavoidable experimental differences.• Goal: Normalization corrects this kind of
difference w/o altering the biological data so that cross array analyses can be conducted (differential expression, etc.).
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Why do we need normalization?
• Some arrays are brighter than others.• Not due to the biological data but to
unavoidable experimental differences.• Goal: Normalization aims to correct this kind
of difference w/o altering the biological data so that cross array analyses can be conducted (differential expression, etc.).
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Why do we need normalization?Before Normalization
• Some arrays are brighter than others.• Not due to the biological data but to
unavoidable experimental differences.• Goal: Normalization aims to correct this kind
of difference w/o altering the biological data so that cross array analyses can be conducted (differential expression, etc.).
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Why do we need normalization?
After Normalization
Before Normalization
• Some arrays are brighter than others.• Not due to the biological data but to
unavoidable experimental differences.• Goal: Normalization aims to correct this kind
of difference w/o altering the biological data so that cross array analyses can be conducted (differential expression, etc.).
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Scatter PlotsSimple to compare inter-array expression, no normalization
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Scatter PlotsSimple to compare inter-array expression, no normalization
Genes on 45 degree angle expressed the same in both
1 - Higher expressed genes in Control2 - Higher expressed genes in Downs3 - Low expression genes in both4 - High expression genes in both
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Scatter PlotsSimple to compare inter-array expression, no normalization
Genes on 45 degree angle expressed the same in both
1 - Higher expressed genes in Control2 - Higher expressed genes in Downs3 - Low expression genes in both4 - High expression genes in both
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Scatter PlotsSimple to compare inter-array expression, no normalization
Genes on 45 degree angle expressed the same in both
1 - Higher expressed genes in Control2 - Higher expressed genes in Downs3 - Low expression genes in both4 - High expression genes in both
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Why Log Transformation?• Experimentalists using microarrays are very often interested in fold change• Log scale provides symmetry in expression ratios• Example: 2-fold up-regulation = 2, but 2-fold down-regulation= 0.5• Without transformation, all down-regulated fold changes compressed between 0 and 1
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Why Log Transformation?• Experimentalists using microarrays are very often interested in fold change• Log scale provides symmetry in expression ratios• Example in raw ratio space: 2-fold up-regulation = 2, but 2-fold down-regulation= 0.5• Without transformation, all down-regulated fold changes compressed between 0 and 1
t1 t2 t3
Raw Ratio 1 2 0.5
Log2 Ratio 0 1 -1
Jason Young | Email: [email protected] | MED 263 | Winter 2015
MA plots are used to determine data needs normalization and assess if the normalization worked (sideways scatter plot).
M = log fold change for a gene xA = average log intensity for gene x
• A local regression (LOESS) curve can be fitted to the scatter plot to summarize non-linear data.
• A LOESS curve that oscillates and/or has variability of M values greater than other arrays indicates an issue.
MA Plots
Arra
y1/A
rray2
Arra
y1/A
rray2
Before Norm.
After Norm.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
MA plots are used to determine data needs normalization and assess if the normalization worked (sideways scatter plot).
M = log fold change for a gene xA = average log intensity for gene x
• A local regression (LOESS) curve can be fitted to the scatter plot to summarize non-linear data.
• A LOESS curve that oscillates and/or has variability of M values greater than other arrays indicates an issue.
MA Plots
Arra
y1/A
rray2
Arra
y1/A
rray2
Before Norm.
After Norm.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
MA plots are used to determine data needs normalization and assess if the normalization worked (sideways scatter plot).
M = log fold change for a gene xA = average log intensity for gene x
• A local regression (LOESS) curve can be fitted to the scatter plot to summarize non-linear data.
• A LOESS curve that oscillates and/or has variability of M values greater than other arrays indicates an issue.
MA Plots
Arra
y1/A
rray2
Arra
y1/A
rray2
Before Norm.
After Norm.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
MA plots are used to determine data needs normalization and assess if the normalization worked (sideways scatter plot).
M = log fold change for a gene xA = average log intensity for gene x
• A local regression (LOESS) curve can be fitted to the scatter plot to summarize non-linear data.
• A LOESS curve that oscillates and/or has variability of M values greater than other arrays indicates an issue.
• Instead of 1-to-1 comparisons, each array can also be compared to a “synthetic” array calculated by taking probe-wise medians
MA Plots
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Normalization Strategies
Simplest idea:• Calculate median expression from all arrays • Do global normalization by multiplying all probes by a normalization constant
However...
Often there is a non-linear dependence on intensity
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Normalization Strategies
Simplest idea:• Calculate median expression from all arrays • Do global normalization by multiplying all probes by a normalization constant
However...
Often there is a non-linear dependence on intensity
Array 1 Array 2
Median Expression 5,000 10,000
Normalization Factor 2 1
Normalized Mean
Expression10,000 10,000
Global Normalization
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Normalization Strategies
Simplest idea:• Calculate median expression from all arrays • Do global normalization by multiplying all probes by a normalization constant
However...
Often there is a non-linear dependence on intensity
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Normalization Strategies
Gene 1 Gene 2 Gene 3 Gene 4 Total Reads
Sample 1 10,000 100 150 200 10,450
Sample 2 20,000 10 150 200 20,360
Global Normalization - NGS
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Before
Normalization Strategies
Gene 1 Gene 2 Gene 3 Gene 4 Total Reads
Sample 1 10,000 100 150 200 10,450
Sample 2 20,000 10 150 200 20,360
Global Normalization - NGS
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Gene 1 Gene 2 Gene 3 Gene 4 Total Reads
Sample 1 14,742 147 221 294 15,405
Sample 2 15,133 8 113 151 15,405
Before
After
Normalization Strategies
Gene 1 Gene 2 Gene 3 Gene 4 Total Reads
Sample 1 10,000 100 150 200 10,450
Sample 2 20,000 10 150 200 20,360
Global Normalization - NGS
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Gene 1 Gene 2 Gene 3 Gene 4 Total Reads
Sample 1 14,742 147 221 294 15,405
Sample 2 15,133 8 113 151 15,405
Before
After
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
• Simple and effective!Jason Young | Email: [email protected] | MED 263 | Winter 2015
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
I
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Arrays
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
I
II
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Arrays
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
I
II
III
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Arrays
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
I
II
III
IV
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Arrays
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
I
II
III
IV
V
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Arrays
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
I
II
III
IV
V
VI
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Arrays
Normalization StrategiesParametric methods: Force distributions (not just medians) to be the same: • Amaratunga and Cabrera (2001)• Bolstad et al. (2003)
Use curve estimators such as splines to adjust for the effect: • Li and Wong (2001)• Colantuoni et al. (2002)• Dudoit et al. (2002)
Adjustments based on additive/multiplicative model: • Rocke and Durbin (2003)• Huber et al. (2002)• Cui et al. (2003)
Quantile Normalization (non-parametric) • Bolstad et al. (2003)
• Every probe value on any one chip is mapped to the corresponding quantile of the standard distribution; hence quantile normalization
• The average of all available arrays can be used to form an average empirical distribution
• Simple & effective!
I
II
III
IV
V
VI
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Arrays
Normalization StrategiesBefore Normalization
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Normalization StrategiesAfter Normalization
Jason Young | Email: [email protected] | MED 263 | Winter 2015
MAS5.0, RMA, GCRMAMAS 5.0 (Microarray Suite - Affymetrix): • Adjusts for background noise by subtracting MM from PM signal but this is an over adjustment.• MM probes detect specific signal such that a third of all MM probes are brighter than their PM counterpart. Due to specific + non-specific binding.
RMA (Robust Multiarray Averaging): • Increases precision but sacrifices some accuracy by using a background adjustment step that corrects PM probe-intensities chip by chip but ignores MM intensities.• Also uses quantile normalization
GCRMA (GeneChip Robust Multiarray Averaging): • Similar to RMA, but corrects background using sequence data of probes to account for non-specific binding (NSB).• MM probes not ignored, improved precision and accuracy.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
MAS5.0, RMA, GCRMAMAS 5.0 (Microarray Suite - Affymetrix): • Adjusts for background noise by subtracting MM from PM signal but this is an over adjustment.• MM probes detect specific signal such that a third of all MM probes are brighter than their PM counterpart. Due to specific + non-specific binding.
RMA (Robust Multiarray Averaging): • Increases precision but sacrifices some accuracy by using a background adjustment step that corrects PM probe-intensities chip by chip but ignores MM intensities.• Also uses quantile normalization
GCRMA (GeneChip Robust Multiarray Averaging): • Similar to RMA, but corrects background using sequence data of probes to account for non-specific binding (NSB).• MM probes not ignored, improved precision and accuracy.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
MAS5.0, RMA, GCRMAMAS 5.0 (Microarray Suite - Affymetrix): • Adjusts for background noise by subtracting MM from PM signal but this is an over adjustment.• MM probes detect specific signal such that a third of all MM probes are brighter than their PM counterpart. Due to specific + non-specific binding.
RMA (Robust Multiarray Averaging): • Increases precision but sacrifices some accuracy by using a background adjustment step that corrects PM probe-intensities chip by chip but ignores MM intensities.• Also uses quantile normalization
GCRMA (GeneChip Robust Multiarray Averaging): • Similar to RMA, but corrects background using sequence data of probes to account for non-specific binding.• MM probes not ignored, improved precision and accuracy.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
5. Data Analysis
6. Biological Confirmation
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. T-test Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value assesses statistical significance based on normal distributionMultiple Testing Problem:• 1 gene, p = 0.05 means 5% chance difference by chance alone. (OK)• 1000 genes, p = 0.05 means 50 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. T-test Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value assesses statistical significance based on normal distributionMultiple Testing Problem:• 1 gene, p = 0.05 means 5% chance difference by chance alone. (OK)• 1000 genes, p = 0.05 means 50 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. T-test Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value assesses statistical significance based on normal distributionMultiple Testing Problem:• 1 gene, p = 0.05 means 5% chance difference by chance alone. (OK)• 1000 genes, p = 0.05 means 50 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. T-test Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value assesses statistical significance based on normal distributionMultiple Testing Problem:• 1 gene, p = 0.05 means 5% chance difference by chance alone. (OK)• 1000 genes, p = 0.05 means 50 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value assesses statistical significance based on normal distributionMultiple Testing Problem:• 1 gene, p = 0.05 means 5% chance difference by chance alone. (OK)• 1000 genes, p = 0.05 means 50 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value < threshold (x) indicates only x% of the time would the observed differences be due to chance (norm. dist.)Multiple Testing Problem:• 1 gene, p = 0.05 means 5% chance difference by chance alone. (OK)• 1000 genes, p = 0.05 means 50 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value < threshold (x) indicates only x% of the time would the observed differences be due to chance (norm. dist.)Multiple Testing Problem:• 1 gene, p = 0.05 means 5% difference by chance alone. (OK)• 100 genes, p = 0.05 means 5 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value < threshold (x) indicates only x% of the time would the observed differences be due to chance (norm. dist.)Multiple Testing Problem:• 1 gene, p = 0.05 means 5% difference by chance alone. (OK)• 100 genes, p = 0.05 means 5 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (FDR-adjusted p-value)
• p/# tests - too stringent!
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value < threshold (x) indicates only x% of the time would the observed differences be due to chance (norm. dist.)Multiple Testing Problem:• 1 gene, p = 0.05 means 5% difference by chance alone. (OK)• 100 genes, p = 0.05 means 5 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (False Discovery Rate-adjusted p-value)
• p/# tests - too stringent!
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value < threshold (x) indicates only x% of the time would the observed differences be due to chance (norm. dist.)Multiple Testing Problem:• 1 gene, p = 0.05 means 5% difference by chance alone. (OK)• 100 genes, p = 0.05 means 5 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (False Discovery Rate-adjusted p-value)
• p/# tests - too stringent!
FDR = # false positives
# called significant
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value < threshold (x) indicates only x% of the time would the observed differences be due to chance (norm. dist.)Multiple Testing Problem:• 1 gene, p = 0.05 means 5% difference by chance alone. (OK)• 100 genes, p = 0.05 means 5 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (False Discovery Rate-adjusted p-value)
• p/# tests - too stringent!
FDR = # false positives
# called significant1 - 0.95100 = 0.994
Example: Assuming the 100 tests are statically independent, the probability of obtaining at least one significant result is…
Jason Young | Email: [email protected] | MED 263 | Winter 2015
How to Identify Differential Expression1. Calculate expression ratio and rank order
Problems:• What threshold? Background subtraction? Ex. Two-fold change, 50 background (150/100 NS, 100/50 S!)
2. Percentage Problems:• What threshold? What is significant? Always a top 5%.
3. Statistical Tests (t-test, ANOVA) Null hypothesis is there is no difference in a gene’s expression between groups. Example: 7 treated, 7 untreated cellsp-value < threshold (x) indicates only x% of the time would the observed differences be due to chance (norm. dist.)Multiple Testing Problem:• 1 gene, p = 0.05 means 5% difference by chance alone. (OK)• 100 genes, p = 0.05 means 5 would be false positives by chance alone. (!!!!)
Solution:• Filter out non-expressed genes to limit number of tests• Use a correction for multiple tests (False Discovery Rate-adjusted p-value)
• p/# tests - too stringent!
FDR = # false positives
# called significant1 - 0.95100 = 0.994
Benjamini-Hochberg procedure (1995) - produces an adjusted p-value
Example: Assuming the 100 tests are statically independent, the probability of obtaining at least one significant result is…
Jason Young | Email: [email protected] | MED 263 | Winter 2015
ClusteringWhich genes are associated with each other or a particular state/condition?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Clustering
Unsupervised (no prior knowledge used) • Hierarchical (Trees)• Non-hierarchical (K-means)• Cluster 3.0
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv
1. Filter out genes that are not expressed in any samples2. Calculate distance between samples using expression of
genes• Euclidean• Pearson
3. Cluster samples based on these distances• Single• Complete• Centroid
Supervised (prior knowledge used) Many methods available...
Ontology-based Pattern Identification (OPI)
Which genes are associated with each other or a particular state/condition?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Clustering
Unsupervised (no prior knowledge used) • Hierarchical (Trees)• Non-hierarchical (K-means)• Cluster 3.0
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv
1. Filter out genes that are not expressed in any samples2. Calculate distance between samples using expression of
genes• Euclidean• Pearson
3. Cluster samples based on these distances• Single• Complete• Centroid
Supervised (prior knowledge used) Many methods available...
Ontology-based Pattern Identification (OPI)
Which genes are associated with each other or a particular state/condition?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Clustering
Unsupervised (no prior knowledge used) • Hierarchical (Trees)• Non-hierarchical (K-means)• Cluster 3.0
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv
1. Filter out genes (not expressed and/or stable)2. Calculate distance between samples based on gene
expression• Euclidean• Pearson
3. Cluster samples based on these distances• Single (Maximum similarity)• Complete (Minimum similarity)• Centroid (Average similarity)
Supervised (prior knowledge used) Many methods available...
Ontology-based Pattern Identification (OPI)
Which genes are associated with each other or a particular state/condition?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
ClusteringClustering
Unsupervised (no prior knowledge used) • Hierarchical (Trees)• Non-hierarchical (K-means)• Cluster 3.0
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv
1. Filter out genes (not expressed and/or stable)2. Calculate distance between samples based on gene
expression• Euclidean• Pearson
3. Cluster samples based on these distances• Single (Maximum similarity)• Complete (Minimum similarity)• Centroid (Average similarity)
Supervised (prior knowledge used) Many methods available...
Ontology-based Pattern Identification (OPI)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
ClusteringClustering
Unsupervised (no prior knowledge used) • Hierarchical (Trees)• Non-hierarchical (K-means)• Cluster 3.0
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv
1. Filter out genes (not expressed and/or stable)2. Calculate distance between samples based on gene
expression• Euclidean• Pearson
3. Cluster samples based on these distances• Single (Maximum similarity)• Complete (Minimum similarity)• Centroid (Average similarity)
Supervised (prior knowledge used) Many methods available...
Ontology-based Pattern Identification (OPI)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
ClusteringClustering
Unsupervised (no prior knowledge used) • Hierarchical (Trees)• Non-hierarchical (K-means)• Cluster 3.0
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv
1. Filter out genes (not expressed and/or stable)2. Calculate distance between samples based on gene
expression• Euclidean• Pearson
3. Cluster samples based on these distances• Single (Maximum similarity)• Complete (Minimum similarity)• Centroid (Average similarity)
Supervised (prior knowledge used) Many methods available...
Ontology-based Pattern Identification (OPI)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Centroid Clustering
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Centroid Clustering
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Clustering
Unsupervised (no prior knowledge used) • Hierarchical (Trees)• Non-hierarchical (K-means)• Cluster 3.0
http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm#ctv
1. Filter out genes (not expressed and/or stable)2. Calculate distance between samples based on gene
expression• Euclidean• Pearson
3. Cluster samples based on these distances• Single• Complete• Average (Centroid)
Supervised (prior knowledge used) Many methods available (machine learning, etc.)...Ex. Ontology-based Pattern Identification (OPI)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
5. Data Analysis
6. Biological Confirmation
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Biological ConfirmationMicroarray gene expression must be confirmed using other
experimental techniques.
• Northern Blot• RT PCR• qPCR
• Also functionalconfirmation• mRNA != protein
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Biological ConfirmationMicroarray gene expression must be confirmed using other
experimental techniques.
• Northern Blot• RT PCR• qPCR
• Also functionalconfirmation• mRNA != protein
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Microarray Gene Expression Workflow
1. Experimental Design
2. RNA Isolation and Labeling
3. Hybridization
4. Preprocessing
5. Data Analysis
6. Biological Confirmation
7. Sharing of Data
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Sharing of DataMinimum Information About a Microarray Experiment (MIAME) (2001)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Sharing of Data
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Kicic, et al., Decreased fibronectin production significantly contributes to dysregulated repair of asthmatic epithelium. Am J. Respir Crit Care Med, 2010. 181(9): p.889-98.
AIM: Identify differentially expressed genes between disease and control groupWhat differences in gene expression may be responsible for differences in phenotype?
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Kicic, et al., Decreased fibronectin production significantly contributes to dysregulated repair of asthmatic epithelium. Am J. Respir Crit Care Med, 2010. 181(9): p.889-98.
AIM: Identify differentially expressed genes between disease and control groupWhat differences in gene expression may be responsible for differences in phenotype?
a.k.a. A Fishing Expedition!Jason Young | Email: [email protected] | MED 263 | Winter 2015
Methods• Epithelial cells were collected by bronchial brushing and cultured, and then classified as
healthy non-atopic (pAECHNA), healthy atopic (pAECHA), or atopic asthmatic (pAECAA).• RNA from 16 hybridizations (9 pAECHNA, 7 pAECAA) was quantified, assessed for quality
using Agilent Bioanalyser, and processed for hybridization to Affymetrix Human Genome U133 Arrays.
• Data were normalized by GCRMA and differential gene expression between groups assessed using LIMMA (supervised method).
• LIMMA: fits a linear model to the expression data of each gene and uses empirical Bayes to calculate a moderate t-statistic which smooths the standard errors across genes giving a more reliable results.
For more information see the LIMMA user guide (http://www.bioconductor.org/packages/2.5/bioc/html/limma.html)
Note: atopic = caused by a hereditary predisposition towards developing certain hypersensitivity reactions, such as asthma.
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Heatmap
• Figure 2. Differences in lower airway epithelial gene expression between healthy non-atopic children (HNA) and children with atopic asthma (AA). Differentially expressed genes based on false discovery rate of less than 0.25 and an absolute fold change of greater than or equal to 1.5 were arranged using unsupervised two-dimensional hierarchical clustering. Each column represents a differentially expressed gene and each row represents an individual subject. Colors represent fold change in each individual, with red indicating up-regulated genes and green indicating down-regulated genes with respect to the average of HNA subjects.
• Differentially regulated genes: 1612 (763 up, 848 down)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
Conclusion• Deposition of the extracellular matrix (ECM) is required to heal wounded epithelial cells.• Kicic, et al. noted that fibronectin (FN1) was the only down regulated ECM component in
asthmatic epithelial cell samples and hypothesized this was the reason for their inability to heal wounds.
Practical 1: You will reanalyze the data from this study to see if you arrive at the same conclusions as the original authors. (R - http://www.bioconductor.org)
Jason Young | Email: [email protected] | MED 263 | Winter 2015
What You Learned Today...
Evaluations!
• Functional genomic methods for gene expression analysis
• Typical workflow for a microarray gene expression study
• Aspects of microarray data analysis• Kicic et al. (2010): Example of a
differential expression microarray study
Jason Young | Email: [email protected] | MED 263 | Winter 2015