high throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1...
TRANSCRIPT
5/24/2014
1
High‐throughput sequencing and expression analysis
Glossary for high‐throughput sequencing experiments
• Library• Run• Flowcell• Lane (channel)• Read • Read length (25 ‐ 400)• Coverage• Deepness• Barcoding (run >1 sample in 1 lane, multiplex)• Paired‐end• Sequencing by synthesis (SBS)• Sequencing by ligation (SBL)• Emulsion PCR• Single molecule sequencing • (no amplification,3rd generation)
millions of sequences (reads)
5/24/2014
2
Sequencing technologies
• Sanger (ABI, Life Technologies)• 454 (Roche)• Solexa (Illumina)• Solid (Life Technologies)• Polonator (Church Lab)• HeliScope (Helicos)• Pacific Biosciences SMRT• Ion Torrent (Life Technologies)• Complete Genomics• Nanopore Sequencing (IBM & Roche)
Metzker, Nature Rev Genet, 2010Shendure, Li, Natur Biotech, 2008Mardis, Annu Rev Genomics Hum Genet, 2008
Sanger sequencing
• DNA is fragmented
• Cloned to a plasmid vector, transform bacteria, growth (automated colony picking)
• Using fluorophore labeled dideoxy nucleoside triphosphates (ddNTPs) for chain termination
• Fluorescent readout with capillar electrophoresis (up to 384 capillaries)
5/24/2014
3
454 (Roche)
• First on the market (2004) • Pyrosequencing• Emulsion PCR• ‐400 bp reads• 400.000 reads in parallel• Homopolymer can be an issue
• One DNA molecule per bead
454 Pyrosequencing
PPi... Pyrophosphate
Agah A et al. Nucl. Acids Res. 2004
5/24/2014
4
Solexa (Illumina)
1. Prepare genomic DNA sample
2. Attach DNA to surface
3. Bridge amplification
4. Fragments become double stranded
5. Denature double stranded DNA
6. Complete amplification
1 2 3
4 5 6
Solexa (Illumina)
5/24/2014
5
Solid (Life Technologies)
• Sequencing by ligation• 2 base encoding (color space)• => high accuracy
Solid (Life Technologies)
Must know first base Requires an adjacentvalid color change
Errors do not havecompensatory color changes
5/24/2014
6
HeliScope (Helicos)
• 1.000.000.000 reads/exp• >100 sample preparation scalability• No amplification• Digital quantification• 1 labeled base at the time
Pacific Biosciences (Single Molecular Real‐Time, SMRT)
• Zero mode waveguide cells (ZMW)• Polymerase immobilized to solid surface• Read length >1kb possible• No amplification• Phospholinked nucleotides (fluorophores)
5/24/2014
7
Pacific Biosciences
• Long read length activity (>1kb) of DNA polymerase
• Circular sequencing could improve quality by multiple assessing of the same base
Eid et al. Science, 2009
Ion Torrent
• Sequencing on semiconductor device• pH‐sensor• Homopolymer issues• Cheap chemistry• Run time ~ 2 hours• ‐200bp• 11 mill. wells, > 1Gb (Ion 318TM Chip)
Rothberg et al., Nature, 2011
5/24/2014
8
Summary of sequencing technologies
Platform Amplification/Chemistry
Detect Read length (bp)
Read per lane/No lanes
Run time (d)
Gb per run
Machine cost (k$)
Cost per Mb
Sanger/ABI PCR, dideoxy Fluor. 800 384 capil. 0.01 0.1 110 <5000
454/Roche emPCR, synthesis Lumin. 400 (SE) 1x106 0.35 0.5 500 <10
Solexa/illumina
Bridge ampl. synthesis
Fluor. 100 (PE) 50x106/8 (X2)
4 500 540 <2
Solid/Life Tech
emPCR, ligation Fluor. 75 (PE) 40x106/8 (X2)
7 300 595 <2
Polonator emPCR, ligation Fluor. 13 (PE) 10x106/8 (X2)
5 10 170 <1
HeliScope/Helicos
No ampl., synthesis
Fluor. 35 (SE) 20x106/25 (X2)
8 30 999 <0.6
Pacific/Biosciences SMRT
No ampl., synthesis
Fluor. >1000 (SE) 150.00 per SMRT cell
N/A N/A N/A <1
Ion Torrent emPCR, synthesis pH 200 (PE) 11x106per chip
<2h 1 50 <1
Complete Genomics
DNA nanoballs ligation Fluor.
Service, 40X human genome coverage >90% of
the full genome res.400 Human genomes per month
*Changing rapidly and might be outdated and depends on type and version
*
Base calling (Phred score)
Phred quality score Q and base‐calling error probabilities P
QPhred = ‐10 log10 P QSolexa = ‐10 log10P
1 ‐ P
For P=0.05 the quality score Q=13
5/24/2014
9
Base calling (FastQ format)
@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88
Quality scores are encoded in ASCII
Paired‐end sequencing
• Enables both ends of the DNA fragment to be sequenced
• Because the distance between each paired read is known, alignment algorithms can use this information to map the readsover repetitive regions (insertion, deletions more precisely
5/24/2014
10
Multiplex sequencing (barcodes)
Margolies, NHGRI
RNA expression profiling
• Northern bloting
‐ semi‐quantitative‐ few genes
• Real time RT‐PCR (qPCR)
‐medium throughput ‐ 96/384 per run
• Microarray analysis
‐ high throughput ‐ 10.000‐500.000 elements per chip
• RNA seq
‐ high throughput
‐ deep sequencing (short reads 25bp)
5/24/2014
11
Microarray technology
• Two‐color microarrays (Custom, Agilent)
– Spotted oligonucleotides
– Spotted cDNA
• One color microarrays (Affymetrix, ABI)
– In situ synthesized oligonuleotides
• Other types of microarrays
– Exon microarrays
– Tiling microarrays
– Protein (antibody) microarray
– Tissue microarrays
Array types
3‘ Arrayse.g. Affymetrix U133 plus 2.0 arrayApplication: gene expression
Exon‐Arrayse.g. Affymetrix Exon 1.0 ST arrayApplication: alternative splicing, transcript expression
microRNA arrayse.g. Exiqon using locked nucleic acids (LNA)Applications: mature microRNA profilingProblem: short, very similar sequences
Tiling arrays (target genomic DNA) e.g. NimblegenApplications: ChIP on chip, array CGH
5/24/2014
12
Two‐color microarrays
• DNA fragments corresponding to a known sequence are mechanically deposited onto a glass slide
• The fragments can be :
– Oligonucleotides of 60‐80 mers length– cDNA fragments from a library (varying lengths)
• Two samples of reverse‐transcribed mRNA are labelled with two different colors and co‐hybridized onto the slide
Two‐color microarrays
5/24/2014
13
Labels
Two‐color microarrays
5/24/2014
14
Experimental design
• Replicates
– Biological replicates (independent experiments)
– Technical replicates: Repl. arrays (dye swap)
Repl. Spots
• Reference design versus loop design
Which reference sample?
• Constrains
– Sample material (e.g. biopsies) Pooling
– Costs (custom 100 €, Affx 500 €, ABI 1k €)
not independent !!
Analytical pipeline
5/24/2014
15
Image analysis and background correction
Gk=F532 mean – B532
• Software available (GenePix, ImaGene, Agilent)
• Steps:– Gridding, assigns coordinates and gene information to the different spots
– Segmentation: Foreground vs background
– Intensity extraction
Rk=F635 mean – B635
foreground background
A lot of other parameter:F635 % Sat., Flags, B532 SD,…
Normalization
• Removal of all sources of systematic non‐biological variability and the reduction of the random errors.
• Basic assumption is that most of the genes are not changing their expression during the studied process
• Amount of total RNA for both samples is the same
5/24/2014
16
Scatterplot and histogram
Box plot
5/24/2014
17
MA plot
M = log2(R/G)
A = log2(R*G)/2
Intensity dependent normalization
• Apply a locally weighted polynomial regression for a fixed subset of genes in the neighborhood of every gene i (LOWESS).
• Weight function:
5/24/2014
18
Differentially expressed genes
• Rank genes by
– log 2 ratios
– z‐score
– , average over biological replicates
– t‐distribution
– moderated t‐test
( ) /z M mean SD 1 array
1 group of n arrays
z
Comparison of different groups
• 2 groups
– difference
– t‐test
• 3 or more groups
– Analysis of variance ‐ ANOVA
A Bd M M
* / 1/ 1/
A B
A B
M Mt
s n n
MSAF
MSE
5/24/2014
19
Assigning significance
• Threshold for M: 1 (2-fold change)
• Threshold for z: 1.5
• Threshold for p-value from different tests (z-test, t-test, ANOVA): p<0.05 is considered statistically significant
• Problem of multiple testing
• False discovery rate (FDR)
• Significance analysis of microarrays (SAM)
Multiple testing
In case of 1000 tests 50 false positives are expected at an significance level of 0.05 which are declared significant.
• Family wise error (FWER): p(V>0)
• False discovery rate: E(V/R)
To account for this multiple testing following parameterWere used:
5/24/2014
20
Methods to correct p‐values for multiple testing
Significance analysis of microarrays (SAM)
0)(
)()()(
sis
ixixid UI
• dE(i) is average of dP(i) from permutated samples
• Identify genes which deviate from d(i)=dE(i) by more than a threshold,
• These do not necessarily have thelargest change in expression
• Can optimize with estimate of false discovery rate (FDR)
S(i)… gene specific scatter
S0 …small positive constant calculated to minimize CV.
5/24/2014
21
Affymetrix microarrays
Affymetrix microarrays
5/24/2014
22
Affymetrix microarrays
Affymetrix microarrays
5/24/2014
23
Affymetrix microarrays
Affymetrix chips
5/24/2014
24
Pre‐processing of Affymetrix chips
Additive-multiplicative error model:
Random errorLog of true abundance Probe effectSignal
Affymetrix approach:
Avgdiff
MAS 5.0
Li, Wong approach:
dChip
Irizarry: RMA
( )ij ij ijd PM MM
(log( ))ij ijs Tukey Biweight PM CT
ij ij i j ijPM MM
log( )S
log( PMij BG) ai bj ij
Normalization of Affymetrix chips
– Global normalization
– Splines smooth
– Cyclic LOESS
– Quantiles normalization
Remove intensity‐based bias
Summarizing the probes
– Many outliers
5/24/2014
25
Methods
• RMA (Robust multiarray average)
• GCRMA (RMA with adjustment for non‐specific binding based on probe sequence information)
• VSN (Variance stabilization normalization)
• PLIER (Probe Logarithmic Error Intensity Estimate)
>30 different methods for pre‐processing and normalization and combined analysis were studied in a benchmark (Irizarry, Bioinformatics, 2006)
Methods showed good performance:
Density plot
5/24/2014
26
MA plots of chip pairs
Before quantile normalization
After quantile normalization
Quality control
Relative Log Expression (RLE)
Normalized Unscaled Standard Error (NUSE)
RMA Norm data
VSN Norm data
Sanchez‐Cabo et al., in Bioinformatics for Omics Data (ed Mayer), 2011
5/24/2014
27
R & Bioconductor:
– Open source statistical program
– Mostly used by the Microarrays community– All functions implemented and packages available
Other tools
• ArrayNorm (Pieler R. et al. Bioinformatics. 2004)
Standalone Java application for two‐color microrarrays
• CARMAweb (Rainer J. et al. Nucleic Acids Res. 2006)
Web application based on Bioconductor packages for
one and two color arrays and further analysis
• GEPAS, ArrayPipe, MIDAW, RACE, Expression Profiler
Software for microarray normalization
• Potential for surveying the entire transcriptome, including novel, un‐annotated regions.
• Helps to identfy expression and function of regulatory none‐coding RNAs (e.g. lincRNA)
• Potential for determining gene structure and isoform level expression using reads mapping to splice junctions.
• Potential for making better presence/absence calls on regions.
• More expensive than microarrays
• Don‘t need to design probes
Transcriptome sequencing (RNAseq)
5/24/2014
28
Transcriptome sequencing (RNAseq)
Wang et al., Nature Rev Gen, 2009
Normalization
• Reads per kilobase per million (RPKM)
Normalization:
• Quantile normalization
• TMM (trimmed mean of M values).
5/24/2014
29
1. Read mapping
2. Transcriptome reconstruction
3. Expression quantification
4. Differential expression analysis
Analysis steps
Read mapping
Garber et al., Nature Methods, 2011
5/24/2014
30
Spliced aligners
Garber et al., Nature Methods, 2011
Transcriptome reconstruction
Garber et al., Nature Methods, 2011
5/24/2014
31
Cufflinks
Trapnell C, Nature Biotech, 2011
FPKM
fragments per kilobase of transcript per million fragments mapped (paired-end data)
Expression quantification and differential expression
Garber et al., Nature Methods, 2011
5/24/2014
32
Expression quantification
Garber et al., Nature Methods, 2011
Problem of assigning reads to correct isoform
Probabilities can be estimated by iterative Expectation‐Maximum algorithm (EM)(finding maximum likelihood estimates ofparameters where the model depends on unobserved latent variables.
Pachter L, 2011
5/24/2014
33
Differential expression analysis
Garber et al., Nature Methods, 2011
Isoform and gene expression
Li, Dewey, Bioinformatics, 2011
5/24/2014
34
Alternative splicing
Blencowe, http://www.utoronto.ca
Differential expression analysis for sequencing count data
• Discrete, positive, skewed(not (log‐) normal distributed)Poisson distributed
• Sequencing depth (coverage) variesbetween samples
• Normalization for library size
• Large dynamic range (0 ... 105) between genes
Anders, Huber, EMBL
A B C D
Gene1 1 23 2 6
Gene2 0 74 8 7
Gene3 33 4 14 8
5/24/2014
35
Technical and biological replicates
• Counts for the same gene from different technical replicates have a variance equal to the mean (Poisson)
• Counts for the same gene from different biological replicates have a variance exceeding the mean (overdispersion). can be estimated bynegative‐binominal model
• No need for technical replicates (variance=mean), but need biologicalreplicates to estimate variance (dispersion) and to draw conclusion for a greater population (as for any biological experiment)
Nagalakshmi et al. Science, 2008
Bioconductor packages
Testing for differential signal in sequencing count data:
Based on negative-binomial distribution:
• edgeR (Robinson, Mcarthy, Smyth) • DESeq (Anders, Huber)• DEXSeq (Reyes, Anders, Huber) for differential exon usage• BaySeq (Hardcastle, Kelly)
Based on Poisson distribution:
• DEGSeq (Wang et al.)
5/24/2014
36
Tools and standards
Fileformats and standards:
• FastQ format: sequence and corresponding quality levels• GFF/GTF files (General Feature Format, General Transfer Format)• Sequence Alignment/Map (SAM/BAM) format (SAM tools (C APIs), Picard (Java APIs))• Short read archive (SRA)
Tools:
• Overview: http://ngslib.i‐med.ac.at, Garber et al., Nature Methods, 2011• Base calling tools: Phred, Alta‐Cycle, … (platform specific)• RNA‐seq software: ERANGE, TopHat, Cufflinks …• Mapping tools: Bowtie, BWA, Eland, MAQ, SOAP2, GSNAP • Differentially expressed genes/isoforms: Deseq, DEGseq, DEXseq, CuffDif,
Bayseq, EdgeR• Other: BEDTools, Galaxy, HTseq
RNAseq pipeline
Wei Sun, University of North Carolina‐Chapel Hill
5/24/2014
37
Reproducibility and sensitivity of RNAseq
Mortazavi., Nature Methods, 2008
How many reads are needed (depth)?
two mouse libraries (ES,EB) yeast
Wang et al., Nature Rev Gen, 2009
E.g. 20-40m reads for human
5/24/2014
38
Clustering
• Unsupervised or supervised (classification)
• AgglomerativeBottom up approach, whereby single expression
profiles are successively joined to form nodes.
• DivisiveTop down approach, each cluster is successively split in the same fashion, until each cluster consists of one single profile.
Inter vs. intra cluster distance
5/24/2014
39
Methods for unsupervised clustering
• Hierarchical Clustering
• K‐means
• Self Organizing Maps
• Model‐based methods
• Trillions of others
Data format
5/24/2014
40
Mean or median centering
Mean or median centering
5/24/2014
41
Similarity distance measures
• Pearson correlation
• Pearson uncentered
• Pearson squared
• Cosine correlation
• Covariance
• Euclidean distance
• Average dot product
• Manhattan distance
• Chebychev distance
• Mutual information
• Spearman rank
• Kendall’s tau
• Pearson correlation
• Euclidian distance
• Manhattan distance
Similarity distance measures
1
( )n
M i ii
d x y
-1 r 1
5/24/2014
42
Rank order correlation
• Spearman’s rank correlation
• Kendall’s tau
nc … concordant pairs (ordered the same way)nd … disconcordant pairs (ordered in opposite way)
where di are the differences in the ranks
Mutual information
• Entropy (information content)
• Mutual information
xi discretized gene expression level at condition i.p(xi) probability of this stage to occur
H(A,B) … joint entropy
MI(A,B)=0means that the joint profile carries not more information than the two profiles separately
5/24/2014
43
Missing values
• Only elements represented in both vectors are used for the distance calculation
• The greatest problems occur if the distance is not independent of the number of vector elements n, as it is the case for Euclidian distance.
Potential solutions:
1) Put zeros in all missing values
2) Put average of all values that are available = row average or column average
3) Estimate values based on nearest neighbor, or groupof K nearest neighbors
4) Estimate value in others ways (e.g. SVD)
Hierarchical clustering
• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)
1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1
6 cluster
15 cluster
5/24/2014
44
Linkage
Single‐linkage clusteringMinimal distance
Complete‐linkage clusteringMaximal distance
Average‐linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values
Weighted pair‐group averageLike UPGMA but weighted according cluster size
Within‐groups clusteringAverage of merged cluster is used instead of cluster elements
Ward’s methodSmallest possible increase in the sum of squared errors
• partition n genes into k clusters, where k has to be predetermined
• k‐means clustering minimizes the variability within and maximize between clusters
• Moderate memory and time consumption
K‐means
1. Generate random points (“cluster centers”) in n dimensions (results are depending on these seeds).
2.Compute distance of each data point to each of the cluster centers.
3.Assign each data point to the closest cluster center.
4.Compute new cluster center position as average of points assigned.
5.Loop to (2), stop when cluster centers do not move very much.
5/24/2014
45
How to choose k
Figure of Merit (FOM)
• Neural network approach
• Usually one or two dimensional map
• Hexagonal or rectangular net topology
• Moderate memory and time consumption
• Number of clusters has to be specified!
Self organizing maps (SOM)
5/24/2014
46
1. Generate a simple (usually) 2D grid of nodes (x,y)
2. Map the nodes into n‐dim expression vectors (initially randomly, (e.g. (x,y) ‐> [0 0 0 x 0 0 0 y 0 0 0 0 0]
3. For each data point, P, change all node positions sothat they move towards P. Closer nodes move more than far nodes.
4. Iterate for a maximum number of iterations, and then assess position of all nodes.
Self organizing maps (SOM)
Self organizing maps (SOM)
fi+1(N)= fi(N) + t (d(N, NP), i) * [P‐ fi(N)]
• fi(N) = position of node N at iteration i
• P = position of current data point
• P‐ fi(N) = vector from N to P
• t = weighting factor or “learning rate” dictates how much to move N towards P.
• t (d(N, NP), i) = 0.02 T/(T+100 i) for d(N,NP) < cutoff radius, else = 0
• T = maximum number of iterations
Decreases with iteration and distance of N to P
5/24/2014
47
Principal component analysis (PCA)
PCA is a data reduction technique that allows to simplify multidimensional data sets into smaller number of dimensions (r<n).
Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.
Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80‐90% of the variance can already be explained.
This analysis can be done by a special matrix decomposition (singular value decomposition SVD).
Singular value decomposition (SVD)
X = USVT with UUT = VTV = VVT = I
For mean centered data the Covariance matrix C can be calculated by XXT. U are eigenvectors of XXT and the eigenvalues are in the diagonal of S defined by the characteristic equation |C – λI | = 0.
Transformation of the input vectors into the principal component space can be described by Y = XU where the projection of sample i along the axis is defined by the j‐th PC:
5/24/2014
48
Correspondence analysis (CA)
• Correspondence Analysis is an explorative computational method for the study of associations between variables.
• The approach is a combination of using the χ2 statistic and singular value decomposition (SVD) similar to that for principal component analysis.
• Like principal component analysis, it displays a low‐dimensional projection of the data
• It does this, though, for two variables simultaneously, thus revealing associations between them
Correspondence analysis (CA)
5/24/2014
49
Gene expression terrain maps
Clustering vs. classification
• Clustering uses the primary data to group together measurements, with no information from other sources. Often called unsupervized machine learning.Is not biased by previous knowledge, but therefore needs stronger signal to discover clusters.
• Classification uses known groups of interest (from other sources) to learn the features associated with these groups in the primary data, and create rules for associating the data with the groups of interest. Often called supervizedmachine learning. Uses previous knowledge, so can detect weaker signal, but may be biased by wrong previous knowledge.
5/24/2014
50
• K-nearest neighbors
• Linear Models
• Discriminant analysis
• Logistic Regression
• Naïve Bayes
• Decision Trees
• Random Forrests
• Support Vector Machines
Methods for classification
Assess classifier performance
Truely in
group1 group2
Classified ingroup1
group2
group1
5/24/2014
51
Receiver operator characteristics (ROC)
Patients with T4 values of 5 or less are considered to be hypothyroid.
Cross validation
Holdback cross‐validation K‐fold cross validation (LKOCV)
If k=1 it is called leave‐one‐out cross‐validation (LOOCV)Variance bias trade‐off
5/24/2014
52
Support vector machines (SVM)
Support vector machines (SVM)
5/24/2014
53
Software
Saeed AI. et al. Biotechniques 2004
Sturn A. et al. Bioinformatics 2002
Commercial
Open source
Biological meaning of the gene sets
?
• Gene ontology terms
• Pathway mapping
• Linking to Pubmed abstracts or associated MESH terms
• Regulation by the same transcription factor (module)
• Protein families and domains
• Gene set enrichment analysis
• Over representation analysis
5/24/2014
54
Gene Ontology
The three organizing principles of GO are
• cellular component (e.g. mitochondrium)• biological process (e.g. lipid metabolism)• molecular function (e.g. hydrolase activity)
Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name(e.g. fibroblast growth factor receptor binding).
URL: http://www.geneontology.org/
The Gene Ontology project provides a controlled vocabularyto describe gene and gene product attributes in any organism.
Directed acyclic graph (DAG)
2 relations:part_of
is_a
different levels
5/24/2014
55
ISS Inferred from Sequence Similarity
IEP Inferred from Expression Pattern
IMP Inferred from Mutant Phenotype
IGI Inferred from Genetic Interaction
IPI Inferred from Physical Interaction
IDA Inferred from Direct Assay
RCA Inferred from Reviewed Computational Analysis
TAS Traceable Author Statement
NAS Non‐traceable Author Statement
IC Inferred by Curator
ND No biological Data available
Evidence code for GO annotations
Gene Ontology Browser (Amigo)
5/24/2014
56
Gene ontology functionality (Genesis)
cell cyclemitosiscytokineses
nucleus
Cluster 05 includes many cell cycle genes
GO terms for gene sets
5/24/2014
57
Overrepresentation analysis
m
g
gene universe (whole microarray)
GO term
ci
genes in cluster(gene list)
all genes with GO term
genes in clusterwith GO term
Over representation analysis (ORA)
• Fisher exact test for contingency table
• Hypergeometric test
• Example from http://genome.tugraz.at/ORA
p =
gi
m-gc-i
mc
m-g c-i
g i