gene expression introduction to gene expression arrays microarray data pre-processing introduction...

Gene expression Introduction to gene expression arrays Microarray Data pre-processing Introduction to RNA-seq Deep sequencing applications RNA-seq data pre-processing An old technology - some predict microarrays will be replaced by deep sequencing Currently much cheaper/faster than sequencing; widely used 2005: first next-generation sequencing machineTimeline of DNA Microarray Developments 1991: Photolithographic printing (Affymetrix) 1994: First cDNA collections are developed at Stanford 1995: Quantitative monitoring of gene expression patterns with a complementary DNA microarray. 1996: Commercialization of arrays (Affymetrix) 1997: Genome- wide expression monitoring in S. cerevisiae (yeast) 2000: Portraits/ Signatures of cancer. 2003: Introduction into clinical practices 2004: Whole human genome on one microarray 2006: All exons measured on one microarray Basics of microarrays They utilize the chemical binding between the four nucleotides. A --- T, and C --- G. The DNA structure is formed through the binding:DNA_Overview.png Basics of microarrays AATTCAGCATGGGCACATGCCCGCG TTAAGTCGTACCCGTGTACGGGCGC Basics of microarrays Two strategies: (1)One sample on each array The amount is calculated from spot intensity. (2) Two samples, differentially labeled, on each array The relative amount, is given by the ratio between the fluorescence. Amplified DNA segments fluorescence labeling hybridization on the array reading by photo scanner digitize into fluorescence values quantify amount of each target sequence Basics of microarrays Gene expression arrays DNA (2 copies) mRNA (multiple copies) Protein (multiple copies) gene exon intron Poly A tail Start codon The amount of these guys matter! But they are hard to measure. The amount of these guys is easy to measure. And it is positively correlated with the protein amount! Gene expression array --- affymetrix The Affymetrix platform is one of the most widely used. Gene expression arrays -- Affy Here we use the U133 system for illustration. Some 20 probes per gene; Selected from the 3 end of the gene sequence; Not necessarily evenly spaced --- sequence property matters; The probes are located at random locations on the chip; TTAAGTCGTACCCGTGTACGGGCGC Target sequence AATTCAGCATGGGCACATGCCCGCG Perfect match (PM) probe AATTCAGCATGGACACATGCCCGCG Mis-match (MM) probe Gene expression array - affyThe hope was that mismatch probes wont bind the target sequence. Gene expression arry --- affy Microarray data ? We are going to focus on pre-processing for now. Downstream analyses are more in the realm of traditional statistics: multiple testing, clustering, classification They are common across different high-throughput techniques. Microarray data Issues: Background level variation caused by variations in overall RNA concentration in the sample, image reader, etc. Within every probeset, each probe has different sensitivity/specificity, caused by cross-hybridization, different chemical properties etc. Across chips, the fluorescence intensity-concentration response curve can be different, caused by variations in sample processing, image reader etc. Affy data --- general strategy Background correction (within chip) Presence/absence call (within chip) Normalization (across-chip) Probe-set level expression value (within chip) Probeset-level statistical analysis (combining chips) Affy data --- general strategy There are many processing methods. The most popular include: MAS 5.0 (Affymetrix) Flawed. But it comes with the Affymetrix software. Thus widely used by non-experts. dChip (Cheng Li & Wing Wong) Good performance and versatile. Stand-alone Windows application. Can handle arrays other than expression array. RMA (Rafael Irizarry et al.) Good performance. Easily used in R/Bioconductor. Affy data --- RMA Background correction For each array, assumes: lambda=1,miu=1,sigma=1 lambda=5, miu=1, sigma=1 Affy data --- RMA Background correction For each array, from the PM signal distribution, estimate the parameters, Find the overall mode by kernel density estimation; Find the miu and sigma from PM values lower than the overall mode (sample mean and sd) Find the lambda from PM values higher than the overall mode (1/(sample mean minus the overall mode)) then adjust the PM readings (s is PM signal; lambda is replaced by alpha in this expression): See the derivation here: Affy data --- normalization *** This is also relevant to other array platforms ! To reduce chip effect, including non-linear effect. Difficulty: the sample is different for each chip. We cant match a gene in chip A to the same gene in chip B hoping they have the same intensity. PM MM Assumptions on the overall distributions of the signals on each chip are made. For example: Some house-keeping genes dont change; The overall distribution of concentrations dont change; Affy data --- normalization Quantile normalization --- match the quantiles between two chips. Assumes that the distribution of gene abundances is the same between samples. x norm = F 2 -1 (F 1 (x)), x: value in the chip to be normalized F 1 : distribution function in the chip to be normalized F 2 : distribution function in the reference chip Nature Protocols 2, (2007) Affy data --- RMA summary Model-fitting: Median Polish (robust against outliers) alternately removing the row and column medians until convergence The remainder is the residual; After subtracting the residual, the row- and column- medians are the estimates of the effects. Affy data ---- rma summary Remove row median Remove column median Affy data ---- rma summary Remove row median Remove column median Affy data ---- rma summary Remove row median Remove column median Converged. This is the residual. Affy data ---- rma summary * This reflects the assumption that probe effects have median zero. Deep Sequencing Method of the year 2007 by Nature Methods. The name: Next generation sequencing Deep sequencing High-throughput sequencing Second-generation sequencing The key characteristics: Massive parallel sequencing amount of data from a single run ~ amount of data from the human genome project The reads are short ~ a few hundred bases / read Background Potential impact: The $1000 genome Genome sequencing will become a regular medical procedure. Personalized medicine Predictive medicine Ethical issues For statisticians: Data mining using hundreds of thousands of genomes Finding rare SNPs/mutations associated with diseases New methods to analyze epigeomics/transcriptomics data Finding interventions to improve life quality Background The companies use different techniques. We use Illuminas as an example. (http://seqanswers.com/forums/showthread.php?t=21)http://seqanswers.com/forums/showthread.php?t=21 Background An incomplete list of some common platforms. Bioinformatics and Biology insights 2015:9(s1) Background Advantages: Fast and cost effective. No need to clone DNA fragments. Drawbacks: Short read length (platform dependent) Some platforms have trouble on identical repeats Non-uniform confidence in base calling in reads. Data less reliable near the 3 end of each read. Background What deep sequencing can do: Background Nat Methods Nov;6(11 Suppl):S2-5. Sequence the genome of a person? --- Alignment Can rely on existing human genome as a blue print. Align the short reads onto the existing human genome. Need a few fold coverage to cover most regions. Sequence a whole new genome? --- Assembly Overlaps are required to construct the genome. The reads are short need ~30 fold coverage. If 3G data per run, need 30 runs for a new genome similar to human size. Alignment and Assembly Whole gnome/exome/transcriptome sequencing Alignment Finding novel exons. Alternative splicing RNA-Seq Gene expression profiling to replace arrays? Exon-specific abundance. RNA-Seq Genome Biology 2010, 11:220 Alignment Hash table-based alignment. Similar to BLAST in principle. (1) Find potential locations: (2) Local alignment. Normalization Genome Biology 2010, 11:220 RPKM: Reads per kilobase transcript per million reads Normalization by ERCC (External RNA Controls Consortium): Normalization Nature Methods 12: 339342(2015) Sequence count models Example: Simple Poisson model: Between group testing, d i : sequencing depth of sample i g : the expression level of gene g g : the association of gene g with the covariate Cancer Informatics 2015:14(s1) Sequence count models Poisson model doesnt allow overdispersion. Negative binomial model: g accounts for the sample to sample variability Methods like DESeq use the negative binomial distribution. Cancer Informatics 2015:14(s1) RNA-Seq v.s. Array Good agreement for genes expressed at medium-level.

gene expression introduction to gene expression arrays microarray data pre-processing introduction...

Documents