3. microarrays: experimental design, statistical analysis and gene
TRANSCRIPT
John BennettInternational Rice Research Institute
Los Baños, Philippines
WUEMED – Drought Course
3. Microarrays: experimental design,statistical analysis and gene clustering
IRRI
3.1: Experimental overview
IRRI
IRRI
Photolithography for oligonucleotide synthesis
Applied Biosystems Agilent TaqMan Gene Expression Assays
Platforms Human Genome Human Whole TaqMan Expression Survey Microarray Genome Arrays
Technology Hybridization Comparative Hybridization 5' Nuclease Chemistry & Real Time PCR
Probe (bases) 60mer 60mer TaqMan Primer & Probes
Substrate Nylon Glass slide -
Deposition Contact Spotting In-situ Ink Jet Printing -
Detection One-color Two-color One-color FAM Chemilumin. Cy3/Cy5 Fluorescence
FluorescenceSoftware 1700 Chemilumin. Feature Extraction SDS 2.1
Microarray Analyzer A 7.5.1Software v 1.1
Total Probes 33,096 probes 44,000 probes 1375 Selected Targets
Overview of the two microarray platforms and TaqMan® GeneExpression Assay based real-time PCR
Wang et al. (2006). Large scale real-time PCR validation on gene expression measurements from two commercial long-oligonucleotide microarrays. BMC Genomics 7: 59.
IRRI
Management of library of cDNA clones withBiomek 2000 liquid transferring system
PCR and library replication use Biomek
IRRI
Slides printing using GeneTAC printer
Microtiter plates Glass slides
3000 spots per slidePCR productsfrom >9000 genes
IRRI
UV-crossing linking after printing
IRRI
Reverse transcription andlabeling with fluorescent dyes
Smears showing labeled reverse transcripts
Un-incorporated dyes
IRRI
GeneTAC Hyb station
Automated and manual hybridization chambers
Manual hyb chamberin water bath
IRRI
Slides Scanning with ScanArray
10K rice panicle cDNA library printed at IRRI59 K oligo array from BGI, Beijing IRRI
22K chips from Agilent
Images captured by scanner
IRRI
Quantification---gridding
IRRI
3.2: Experimental design
IRRI
Steps in microarray analysis
Steps• Biological experiment• Sample collection• RNA extraction• RNA labeling• Array printing • Hybridization• Scanning• Data acquisition• Data analysis• Data interpretation
Sources of error• Plant growth and stress conditions• Tissue variation• RNA quality• Efficiency of labeling (esp. Cy3 vs Cy5)• Reproducibility, pin effects• Non-uniformity, background, cross-hybrid’n• Varaiable scanner performance• Inaccurate gridding• Inconsistent background subtraction• Faulty annotation
IRRI
Types of replication
• Technical replication (on same RNA sample to gauge effects of different arrays, hybridization conditions, etc.)
• Dye swap (to gauge the effect of using different dyes)• Biological replication (different plants from the same treatment in the same
experiment)• Experimental replication (same experimental design but conducted at
different times)
log-transformed gene expression signal= log(y) = µ + A + D + V + G + (AG) + (V G) + ε (1)
where: µ is the average expression signalA array effectD dye effectV sample variety effectG gene effect(AG) combination of array and gene(VG) combination of variety and geneε independent noise. IRRI
Random factors contributing to technical variance include:• variation among replicate spots within a slide hybridization (corr > 95%) • variation among replicate spots between slides (corr ~60-80%) • variation introduced by scratches or dust or local hybridization effects• variation introduced by subtraction of background from spot signal intensities• variation introduced by tissue sampling• variation introduced by RNA extraction
Sources of error
Systematic sources of variation include:• different dyes (corr <60–80%) – include dye swaps• multiple print tips (print group effects) – local data normalization
Unlike earlier microarray studies, most journals will no longer accept manuscripts without adequate sampling.
IRRI
• Replication requires more resources and appropriate experimental design canincrease the efficiency of resource utilization and optimize statistical power.
• Reference and balanced are the two basic designs.
• In reference designs, all experimental samples are labelled with one dye and each co-hybridized with a common reference sample that is labelled with a second dye.
• In balanced designs such as loops, experimental samples are labelled with bothdyes and hybridized to each other.
• For the same number of slides, twice the number of experimental samples can be included in a balanced design compared to a reference design, leading to improved precision and increased statistical power.
• Furthermore, error due to technical variability is highest for reference designs.
Reference and balanced designs
IRRI
Two treatments X two replicates X two dyes X dye swap ÷ two scan λs = 4 slides
Simple two-treatment design
IRRI
Gary A. Churchill GA. 2002. Fundamentals of experimental design for cDNAmicroarrays. Nature Genetics 32: 490 - 495
Design with and without reference samples
IRRI
3.3: Statistical analysis
IRRI
The TM4 suite of tools consist of four major applications:
1. Microarray Data Manager (MADAM) 2. TIGR_Spotfinder3. Microarray Data Analysis System (MIDAS) 4. Multiexperiment Viewer (MeV)
Plus
5. A (MIAME*)-compliant MySQL database
Freely available at http://www.tigr.org/software
TM4 from The Institute for Genome Research(TIGR)
*Minimal Information About a Microarray Experiment
Saeed et al. (2003). TM4: a free, open-source system for microarray data management and analysis. Biotechniques. 34: 374-378.
IRRI
• After the spot intensity values are measured in TIGR Spotfinder, they must be normalized to help compensate for variability between slides and fluorescent dyes, as well as other systematic sources of error, by appropriately adjusting the measured array intensities.
• Data filtering can reduce the dataset by removing poor or questionable data. TIGR’sMIDAS, a Java application, provides an interface to design analysis protocols combining one or more normalization and filtering steps. MIDAS reads “.tav” files generated by TIGR Spotfinder or retrieved from the database via MADAM.
• Normalization modules include locally weighted linear regression [lowess] and total intensity normalization. These can be linked with filters, including low-intensity cutoff, intensity-dependent Z-score cutoffs, and replicate consistency trimming, creating a highly customizable method for preparing expression data for subsequent comparison and analysis. When the normalization and filtering steps are complete, MIDAS outputs thedata in “.tav” format.
Data normalization and filtering via MIDAS
IRRIQuackenbush J. 2002. Microarray data normalization and transformation. Nature Genetics 32: 496 – 501.
Global versus local normalization. Most normalization algorithms, including lowess, can be applied either globally (to the entire data set) or locally (to some physical subsetof the data). For spotted arrays, local normalization is often applied to each group of array elements deposited by a single spotting pen (sometimes referred to as a 'pen group' or 'subgrid').
• TIGR Spotfinder was designed for the rapid, reproducible, and computer-aidedanalysis of microarray images and the quantification of gene expression. It readspaired 16-bit TIFF image files generated by most microarray scanners.
• Automatic and manual grid adjustments help to ensure that each rectangular gridcell is centered on a spot. Spot intensities are calculated as an integral of non-saturated pixels. Local background is subtracted from each intensity value.
• These calculated intensities, along with each spot’s position on the array, spotarea, background values, and quality control flags, are written to a TIGR ArrayViewer (“.tav”) file format, a Microsoft Excel® workbook, or the database.
• In noisy areas of the slide, the user may manually identify or discard spots. Quality-control views allow the user to assess systematic biases in the data.
TIGR Spotfinder for image analysis
IRRI
• ANOVA log-transformed gene expression signal (Kerr et al., 2000)• mixture models for gene effect (Lee et al., 2000)• multiplicative model (not logarithm-transformed) (Yang et al., 2001; Sasik
et al., 2002)• ratio-distribution model (Chen et al., 1997, 2002)• binary model (Shmulevich and Zhang, 2002)• rank-based models not sensitive to noise distributions (Ben-Dor et al., 2000)• replicates using mixed models (Wernisch et al., 2003)• quantitative noise analysis (Tu et al., 2002; Fathallah-Shaykh et al., 2002)• design of reverse dye microarrays (Dobbin et al., 2003).
Proposed models for statistical analysis of microarray expression data
Pan (2002) compared different microarray statistical analysis methods: • log-linear ANOVA mixed model (Pan et al., 2001; based on Tusher et al., 2001)• two-sample t-test (Devore & Peck, 1997)• regression (Thomas et al., 2001)
Pan W. 2002. A comparative review of statistical methods for discovering differentially expressed genes in replicated microarray experiments. Bioinformatics 18: 546-554. IRRI
Quantification---spot quality
IRRI
Spot scanning for control of spot quality
For 16 positions across each spot,determine intensity and calculatep value from t-test.
Spots with low p values are acceptable.
High p values could result from poorprinting, damage, poor hybridization,poor gridding
IRRI
LOWESS normalized data in GPR format
IRRI
Distribution plot view
IRRI
Data loaded into TMeV(TIGR) for statistical analysis
IRRI
Comparison of biological replicates
IRRI
Quackenbush J. 2002. Microarray data normalization and transformation. Nature Genetics 32: 496 – 501.
Lowess - Locally Weighted Linear Regression
log10R*Glog10R*G
log 1
0(R
/G)
http://www.ucl.ac.uk/oncology/MicroCore/HTML_resource/Norm_Lowess1.htm
IRRI
Quackenbush J. 2002. Microarray data normalization and transformation. Nature Genetics 32: 496 – 501.
Replicated determination of ratio of two treatments
log2(A/B)1
log 2
(A/B
) 2What was wrong here?
IRRI
Quackenbush J. 2002. Microarray data normalization and transformation. Nature Genetics 32: 496 – 501.
log10R*G
log 1
0R/G
Intensity-dependent Z scores for identifying differential expression
Z>2
1<Z<2
Z<1
IRRI
ExpressConverterExpressConverter is a file transformation tool that reads microarray data files in a variety of file formats and generates .mev or .tav files as output for uploading microarray data to the database with MADAM and analyzed withMIDAS and MEV. These supported formats include Genepix, ImaGene, ScanArray, ArrayVersion and Agilent files. Affymetrix data files cannot be converted with the ExpressConverter, but can be loaded directly into MeV.
TM4 utilities: SlideMap and ExpressConverter
SlideMapSlideMap.pm is a Perl module used for conversion of spots to wells and wells to spots. It is useful when the array is custom-printed from PCR products presented to the arrayer in microtiter plates. SlideMap currently supports several commercial arrayers and 'generic' arrayers.
FAQhttp://www.tm4.org/faq.html
IRRI
Normalized and filtered expression files are analyzed from “.tav” files using TIGR MeV, which generates informative and interrelated displays of expression and annotation data from single or multiple experiments.
Analysis modules currently implemented in MeV include:• hierarchical clustering (8)• k-means clustering (18)• self-organizing maps (15)• principal components analysis (17)• cluster affinity search technique (3)• self-organizing trees (13)• template matching• between-groups tests (including t-tests)• bootstrapping and jackknifing resample the dataset to generate consensusclusters.
Data analysis via TIGR MeV
IRRI
3.4: Gene clustering
IRRI
Datta S, Datta S. 2003. Comparisons and validation of statistical clustering techniquesfor microarray gene expression data. Bioinformatics 19: 459-466.
• At first a mainly visual analysis was used for clustering of genes into similar groups(e.g., DeRisi et al., 1997)
• Subsequently, simple sorting of expression ratios and some form of ‘correlation distance’ were used to identify genes (Spellman et al., 1998; Eisen et al., 1998).
• Datta & Datta (2003) compared six different clustering methods:(i) Hierarchical clustering with correlation (e.g., UPGMA)(Eisen et al., 1998)(ii) Clustering by K-means(iii) Diana (divisive clustering)(iv) Fanny (Fuzzy logic)(v) Model-based clustering(vi) Hierarchical clustering with partial least squaresThey used microarray data of Chu et al. (1998) for yeast sporulation:6118 genes, seven time points during the onset of sporulation (0-12 h) [http://cmgm.stanford.edu/pbrown/sporulation]
Comparison of clustering methods
IRRI
Chung et al. (2002). Molecular portraits and the family tree of cancer. Nature Genetics 32: 533 - 540 (2002)
Use of microarray data to cluster cancer types
IRRI
Cluster analysis requires a suitable co-variable
Examples:• Time (e.g., duration of treatment)• Genotypes • Stress level (e.g., salt concentration, temperature, water status)• Any other suitable independent variable, or dummy independent
variable, or co-variable• Certain suitable combinations (e.g., temperature and water status)
IRRI
0.00.20.40.60.81.00.0
0.2
0.4
0.6
0.8
1.0
1.2
1.4
NTR
FTSW
Using FTSW as the co-variablefor cluster analysis
(fraction of transpirable soil water)
Each point on the curve represents a stage in stress development andcan be related to changesin other physiological and molecular factors (such as photosynthesis and transcriptlevels).
NTR = normalized transpirationrate
IRRI
Ermolaeva et al. (1998). Data management and analysis for gene expression arrays.Nature Genetics 20: 19-23.
Data management and analysis for gene expression arrays
IRRI