high throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1...

5/24/2014

1

High‐throughput sequencing and expression analysis

Glossary for high‐throughput sequencing experiments

• Library• Run• Flowcell• Lane (channel)• Read • Read length (25 ‐ 400)• Coverage• Deepness• Barcoding (run >1 sample in 1 lane, multiplex)• Paired‐end• Sequencing by synthesis (SBS)• Sequencing by ligation (SBL)• Emulsion PCR• Single molecule sequencing • (no amplification,3rd generation)

millions of sequences (reads)

5/24/2014

2

Sequencing technologies

• Sanger (ABI, Life Technologies)• 454 (Roche)• Solexa (Illumina)• Solid (Life Technologies)• Polonator (Church Lab)• HeliScope (Helicos)• Pacific Biosciences SMRT• Ion Torrent (Life Technologies)• Complete Genomics• Nanopore Sequencing (IBM & Roche)

Metzker, Nature Rev Genet, 2010Shendure, Li, Natur Biotech, 2008Mardis, Annu Rev Genomics Hum Genet, 2008

Sanger sequencing

• DNA is fragmented

• Cloned to a plasmid vector, transform bacteria, growth (automated colony picking)

• Using fluorophore labeled dideoxy nucleoside triphosphates (ddNTPs) for chain termination

• Fluorescent readout with capillar electrophoresis (up to 384 capillaries)

5/24/2014

3

454 (Roche)

• First on the market (2004) • Pyrosequencing• Emulsion PCR• ‐400 bp reads• 400.000 reads in parallel• Homopolymer can be an issue

• One DNA molecule per bead

454 Pyrosequencing

PPi... Pyrophosphate

Agah A et al. Nucl. Acids Res. 2004

5/24/2014

4

Solexa (Illumina)

1. Prepare genomic DNA sample

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature double stranded DNA

6. Complete amplification

1 2 3

4 5 6

Solexa (Illumina)

5/24/2014

5

Solid (Life Technologies)

• Sequencing by ligation• 2 base encoding (color space)• => high accuracy

Solid (Life Technologies)

Must know first base Requires an adjacentvalid color change

Errors do not havecompensatory color changes

5/24/2014

6

HeliScope (Helicos)

• 1.000.000.000 reads/exp• >100 sample preparation scalability• No amplification• Digital quantification• 1 labeled base at the time

Pacific Biosciences (Single Molecular Real‐Time, SMRT)

• Zero mode waveguide cells (ZMW)• Polymerase immobilized to solid surface• Read length >1kb possible• No amplification• Phospholinked nucleotides (fluorophores)

5/24/2014

7

Pacific Biosciences

• Long read length activity (>1kb) of DNA polymerase

• Circular sequencing could improve quality by multiple assessing of the same base

Eid et al. Science, 2009

Ion Torrent

• Sequencing on semiconductor device• pH‐sensor• Homopolymer issues• Cheap chemistry• Run time ~ 2 hours• ‐200bp• 11 mill. wells, > 1Gb (Ion 318TM Chip)

Rothberg et al., Nature, 2011

5/24/2014

8

Summary of sequencing technologies

Platform Amplification/Chemistry

Detect Read length (bp)

Read per lane/No lanes

Run time (d)

Gb per run

Machine cost (k$)

Cost per Mb

Sanger/ABI PCR, dideoxy Fluor. 800 384 capil. 0.01 0.1 110 <5000

454/Roche emPCR, synthesis Lumin. 400 (SE) 1x106 0.35 0.5 500 <10

Solexa/illumina

Bridge ampl. synthesis

Fluor. 100 (PE) 50x106/8 (X2)

4 500 540 <2

Solid/Life Tech

emPCR, ligation Fluor. 75 (PE) 40x106/8 (X2)

7 300 595 <2

Polonator emPCR, ligation Fluor. 13 (PE) 10x106/8 (X2)

5 10 170 <1

HeliScope/Helicos

No ampl., synthesis

Fluor. 35 (SE) 20x106/25 (X2)

8 30 999 <0.6

Pacific/Biosciences SMRT

No ampl., synthesis

Fluor. >1000 (SE) 150.00 per SMRT cell

N/A N/A N/A <1

Ion Torrent emPCR, synthesis pH 200 (PE) 11x106per chip

<2h 1 50 <1

Complete Genomics

DNA nanoballs ligation Fluor.

Service, 40X human genome coverage >90% of

the full genome res.400 Human genomes per month

*Changing rapidly and might be outdated and depends on type and version

*

Base calling (Phred score)

Phred quality score Q and base‐calling error probabilities P

QPhred = ‐10 log10 P QSolexa = ‐10 log10P

1 ‐ P

For P=0.05 the quality score Q=13

5/24/2014

9

Base calling (FastQ format)

@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88

Quality scores are encoded in ASCII

Paired‐end sequencing

• Enables both ends of the DNA fragment to be sequenced

• Because the distance between each paired read is known, alignment algorithms can use this information to map the readsover repetitive regions (insertion, deletions more precisely

5/24/2014

10

Multiplex sequencing (barcodes)

Margolies, NHGRI

RNA expression profiling

• Northern bloting

‐ semi‐quantitative‐ few genes

• Real time RT‐PCR (qPCR)

‐medium throughput ‐ 96/384 per run

• Microarray analysis

‐ high throughput ‐ 10.000‐500.000 elements per chip

• RNA seq

‐ high throughput

‐ deep sequencing (short reads 25bp)

5/24/2014

11

Microarray technology

• Two‐color microarrays (Custom, Agilent)

– Spotted oligonucleotides

– Spotted cDNA

• One color microarrays (Affymetrix, ABI)

– In situ synthesized oligonuleotides

• Other types of microarrays

– Exon microarrays

– Tiling microarrays

– Protein (antibody) microarray

– Tissue microarrays

Array types

3‘ Arrayse.g. Affymetrix U133 plus 2.0 arrayApplication: gene expression

Exon‐Arrayse.g. Affymetrix Exon 1.0 ST arrayApplication: alternative splicing, transcript expression

microRNA arrayse.g. Exiqon using locked nucleic acids (LNA)Applications: mature microRNA profilingProblem: short, very similar sequences

Tiling arrays (target genomic DNA) e.g. NimblegenApplications: ChIP on chip, array CGH

5/24/2014

12

Two‐color microarrays

• DNA fragments corresponding to a known sequence are mechanically deposited onto a glass slide

• The fragments can be :

– Oligonucleotides of 60‐80 mers length– cDNA fragments from a library (varying lengths)

• Two samples of reverse‐transcribed mRNA are labelled with two different colors and co‐hybridized onto the slide


5/24/2014

13

Labels


5/24/2014

14

Experimental design

• Replicates

– Biological replicates (independent experiments)

– Technical replicates: Repl. arrays (dye swap)

Repl. Spots

• Reference design versus loop design

Which reference sample?

• Constrains

– Sample material (e.g. biopsies) Pooling

– Costs (custom 100 €, Affx 500 €, ABI 1k €)

not independent !!

Analytical pipeline

5/24/2014

15

Image analysis and background correction

Gk=F532 mean – B532

• Software available (GenePix, ImaGene, Agilent)

• Steps:– Gridding, assigns coordinates and gene information to the different spots

– Segmentation: Foreground vs background

– Intensity extraction

Rk=F635 mean – B635

foreground background

A lot of other parameter:F635 % Sat., Flags, B532 SD,…

Normalization

• Removal of all sources of systematic non‐biological variability and the reduction of the random errors.

• Basic assumption is that most of the genes are not changing their expression during the studied process

• Amount of total RNA for both samples is the same

5/24/2014

16

Scatterplot and histogram

Box plot

5/24/2014

17

MA plot

M = log2(R/G)

A = log2(R*G)/2

Intensity dependent normalization

• Apply a locally weighted polynomial regression for a fixed subset of genes in the neighborhood of every gene i (LOWESS).

• Weight function:

5/24/2014

18

Differentially expressed genes

• Rank genes by

– log 2 ratios

– z‐score

– , average over biological replicates

– t‐distribution

– moderated t‐test

( ) /z M mean SD 1 array

1 group of n arrays

z

Comparison of different groups

• 2 groups

– difference

– t‐test

• 3 or more groups

– Analysis of variance ‐ ANOVA

A Bd M M

* / 1/ 1/

A B

A B

M Mt

s n n

MSAF

MSE

5/24/2014

19

Assigning significance

• Threshold for M: 1 (2-fold change)

• Threshold for z: 1.5

• Threshold for p-value from different tests (z-test, t-test, ANOVA): p<0.05 is considered statistically significant

• Problem of multiple testing

• False discovery rate (FDR)

• Significance analysis of microarrays (SAM)

Multiple testing

In case of 1000 tests 50 false positives are expected at an significance level of 0.05 which are declared significant.

• Family wise error (FWER): p(V>0)

• False discovery rate: E(V/R)

To account for this multiple testing following parameterWere used:

5/24/2014

20

Methods to correct p‐values for multiple testing

Significance analysis of microarrays (SAM)

0)(

)()()(

sis

ixixid UI

• dE(i) is average of dP(i) from permutated samples

• Identify genes which deviate from d(i)=dE(i) by more than a threshold,

• These do not necessarily have thelargest change in expression

• Can optimize with estimate of false discovery rate (FDR)

S(i)… gene specific scatter

S0 …small positive constant calculated to minimize CV.

5/24/2014

21

Affymetrix microarrays


5/24/2014

22



5/24/2014

23


Affymetrix chips

5/24/2014

24

Pre‐processing of Affymetrix chips

Additive-multiplicative error model:

Random errorLog of true abundance Probe effectSignal

Affymetrix approach:

Avgdiff

MAS 5.0

Li, Wong approach:

dChip

Irizarry: RMA

( )ij ij ijd PM MM

(log( ))ij ijs Tukey Biweight PM CT

ij ij i j ijPM MM

log( )S

log( PMij BG) ai bj ij

Normalization of Affymetrix chips

– Global normalization

– Splines smooth

– Cyclic LOESS

– Quantiles normalization

Remove intensity‐based bias

Summarizing the probes

– Many outliers

5/24/2014

25

Methods

• RMA (Robust multiarray average)

• GCRMA (RMA with adjustment for non‐specific binding based on probe sequence information)

• VSN (Variance stabilization normalization)

• PLIER (Probe Logarithmic Error Intensity Estimate)

>30 different methods for pre‐processing and normalization and combined analysis were studied in a benchmark (Irizarry, Bioinformatics, 2006)

Methods showed good performance:

Density plot

5/24/2014

26

MA plots of chip pairs

Before quantile normalization

After quantile normalization

Quality control

Relative Log Expression (RLE)

Normalized Unscaled Standard Error (NUSE)

RMA Norm data

VSN Norm data

Sanchez‐Cabo et al., in Bioinformatics for Omics Data (ed Mayer), 2011

5/24/2014

27

R & Bioconductor:

– Open source statistical program

– Mostly used by the Microarrays community– All functions implemented and packages available

Other tools

• ArrayNorm (Pieler R. et al. Bioinformatics. 2004)

Standalone Java application for two‐color microrarrays

• CARMAweb (Rainer J. et al. Nucleic Acids Res. 2006)

Web application based on Bioconductor packages for

one and two color arrays and further analysis

• GEPAS, ArrayPipe, MIDAW, RACE, Expression Profiler

Software for microarray normalization

• Potential for surveying the entire transcriptome, including novel, un‐annotated regions.

• Helps to identfy expression and function of regulatory none‐coding RNAs (e.g. lincRNA)

• Potential for determining gene structure and isoform level expression using reads mapping to splice junctions.

• Potential for making better presence/absence calls on regions.

• More expensive than microarrays

• Don‘t need to design probes

Transcriptome sequencing (RNAseq)

5/24/2014

28

Transcriptome sequencing (RNAseq)

Wang et al., Nature Rev Gen, 2009

Normalization

• Reads per kilobase per million (RPKM)

Normalization:

• Quantile normalization

• TMM (trimmed mean of M values).

5/24/2014

29

1. Read mapping

2. Transcriptome reconstruction

3. Expression quantification

4. Differential expression analysis

Analysis steps

Read mapping

Garber et al., Nature Methods, 2011

5/24/2014

30

Spliced aligners


Transcriptome reconstruction


5/24/2014

31

Cufflinks

Trapnell C, Nature Biotech, 2011

FPKM

fragments per kilobase of transcript per million fragments mapped (paired-end data)

Expression quantification and differential expression


5/24/2014

32

Expression quantification


Problem of assigning reads to correct isoform

Probabilities can be estimated by iterative Expectation‐Maximum algorithm (EM)(finding maximum likelihood estimates ofparameters where the model depends on unobserved latent variables.

Pachter L, 2011

5/24/2014

33

Differential expression analysis


Isoform and gene expression

Li, Dewey, Bioinformatics, 2011

5/24/2014

34

Alternative splicing

Blencowe, http://www.utoronto.ca

Differential expression analysis for sequencing count data

• Discrete, positive, skewed(not (log‐) normal distributed)Poisson distributed

• Sequencing depth (coverage) variesbetween samples

• Normalization for library size

• Large dynamic range (0 ... 105) between genes

Anders, Huber, EMBL

A B C D

Gene1 1 23 2 6

Gene2 0 74 8 7

Gene3 33 4 14 8

5/24/2014

35

Technical and biological replicates

• Counts for the same gene from different technical replicates have a variance equal to the mean (Poisson)

• Counts for the same gene from different biological replicates have a variance exceeding the mean (overdispersion). can be estimated bynegative‐binominal model

• No need for technical replicates (variance=mean), but need biologicalreplicates to estimate variance (dispersion) and to draw conclusion for a greater population (as for any biological experiment)

Nagalakshmi et al. Science, 2008

Bioconductor packages

Testing for differential signal in sequencing count data:

Based on negative-binomial distribution:

• edgeR (Robinson, Mcarthy, Smyth) • DESeq (Anders, Huber)• DEXSeq (Reyes, Anders, Huber) for differential exon usage• BaySeq (Hardcastle, Kelly)

Based on Poisson distribution:

• DEGSeq (Wang et al.)

5/24/2014

36

Tools and standards

Fileformats and standards:

• FastQ format: sequence and corresponding quality levels• GFF/GTF files (General Feature Format, General Transfer Format)• Sequence Alignment/Map (SAM/BAM) format (SAM tools (C APIs), Picard (Java APIs))• Short read archive (SRA)

Tools:

• Overview: http://ngslib.i‐med.ac.at, Garber et al., Nature Methods, 2011• Base calling tools: Phred, Alta‐Cycle, … (platform specific)• RNA‐seq software: ERANGE, TopHat, Cufflinks …• Mapping tools: Bowtie, BWA, Eland, MAQ, SOAP2, GSNAP • Differentially expressed genes/isoforms: Deseq, DEGseq, DEXseq, CuffDif,

Bayseq, EdgeR• Other: BEDTools, Galaxy, HTseq

RNAseq pipeline

Wei Sun, University of North Carolina‐Chapel Hill

5/24/2014

37

Reproducibility and sensitivity of RNAseq

Mortazavi., Nature Methods, 2008

How many reads are needed (depth)?

two mouse libraries (ES,EB) yeast

Wang et al., Nature Rev Gen, 2009

E.g. 20-40m reads for human

5/24/2014

38

Clustering

• Unsupervised or supervised (classification)

• AgglomerativeBottom up approach, whereby single expression

profiles are successively joined to form nodes.

• DivisiveTop down approach, each cluster is successively split in the same fashion, until each cluster consists of one single profile.

Inter vs. intra cluster distance

5/24/2014

39

Methods for unsupervised clustering

• Hierarchical Clustering

• K‐means

• Self Organizing Maps

• Model‐based methods

• Trillions of others

Data format

5/24/2014

40

Mean or median centering

Mean or median centering

5/24/2014

41

Similarity distance measures

• Pearson correlation

• Pearson uncentered

• Pearson squared

• Cosine correlation

• Covariance

• Euclidean distance

• Average dot product

• Manhattan distance

• Chebychev distance

• Mutual information

• Spearman rank

• Kendall’s tau

• Pearson correlation

• Euclidian distance

• Manhattan distance

Similarity distance measures

1

( )n

M i ii

d x y

-1 r 1

5/24/2014

42

Rank order correlation

• Spearman’s rank correlation

• Kendall’s tau

nc … concordant pairs (ordered the same way)nd … disconcordant pairs (ordered in opposite way)

where di are the differences in the ranks

Mutual information

• Entropy (information content)

• Mutual information

xi discretized gene expression level at condition i.p(xi) probability of this stage to occur

H(A,B) … joint entropy

MI(A,B)=0means that the joint profile carries not more information than the two profiles separately

5/24/2014

43

Missing values

• Only elements represented in both vectors are used for the distance calculation

• The greatest problems occur if the distance is not independent of the number of vector elements n, as it is the case for Euclidian distance.

Potential solutions:

1) Put zeros in all missing values

2) Put average of all values that are available = row average or column average

3) Estimate values based on nearest neighbor, or groupof K nearest neighbors

4) Estimate value in others ways (e.g. SVD)

Hierarchical clustering

• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)

1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1

6 cluster

15 cluster

5/24/2014

44

Linkage

Single‐linkage clusteringMinimal distance

Complete‐linkage clusteringMaximal distance

Average‐linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values

Weighted pair‐group averageLike UPGMA but weighted according cluster size

Within‐groups clusteringAverage of merged cluster is used instead of cluster elements

Ward’s methodSmallest possible increase in the sum of squared errors

• partition n genes into k clusters, where k has to be predetermined

• k‐means clustering minimizes the variability within and maximize between clusters

• Moderate memory and time consumption

K‐means

1. Generate random points (“cluster centers”) in n dimensions (results are depending on these seeds).

2.Compute distance of each data point to each of the cluster centers.

3.Assign each data point to the closest cluster center.

4.Compute new cluster center position as average of points assigned.

5.Loop to (2), stop when cluster centers do not move very much.

5/24/2014

45

How to choose k

Figure of Merit (FOM)

• Neural network approach

• Usually one or two dimensional map

• Hexagonal or rectangular net topology

• Moderate memory and time consumption

• Number of clusters has to be specified!

Self organizing maps (SOM)

5/24/2014

46

1. Generate a simple (usually) 2D grid of nodes (x,y)

2. Map the nodes into n‐dim expression vectors (initially randomly, (e.g. (x,y) ‐> [0 0 0 x 0 0 0 y 0 0 0 0 0]

3. For each data point, P, change all node positions sothat they move towards P. Closer nodes move more than far nodes.

4. Iterate for a maximum number of iterations, and then assess position of all nodes.



fi+1(N)= fi(N) + t (d(N, NP), i) * [P‐ fi(N)]

• fi(N) = position of node N at iteration i

• P = position of current data point

• P‐ fi(N) = vector from N to P

• t = weighting factor or “learning rate” dictates how much to move N towards P.

• t (d(N, NP), i) = 0.02 T/(T+100 i) for d(N,NP) < cutoff radius, else = 0

• T = maximum number of iterations

Decreases with iteration and distance of N to P

5/24/2014

47

Principal component analysis (PCA)

PCA is a data reduction technique that allows to simplify multidimensional data sets into smaller number of dimensions (r<n).

Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.

Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually 80‐90% of the variance can already be explained.

This analysis can be done by a special matrix decomposition (singular value decomposition SVD).

Singular value decomposition (SVD)

X = USVT with UUT = VTV = VVT = I

For mean centered data the Covariance matrix C can be calculated by XXT. U are eigenvectors of XXT and the eigenvalues are in the diagonal of S defined by the characteristic equation |C – λI | = 0.

Transformation of the input vectors into the principal component space can be described by Y = XU where the projection of sample i along the axis is defined by the j‐th PC:

5/24/2014

48

Correspondence analysis (CA)

• Correspondence Analysis is an explorative computational method for the study of associations between variables.

• The approach is a combination of using the χ2 statistic and singular value decomposition (SVD) similar to that for principal component analysis.

• Like principal component analysis, it displays a low‐dimensional projection of the data

• It does this, though, for two variables simultaneously, thus revealing associations between them

Correspondence analysis (CA)

5/24/2014

49

Gene expression terrain maps

Clustering vs. classification

• Clustering uses the primary data to group together measurements, with no information from other sources. Often called unsupervized machine learning.Is not biased by previous knowledge, but therefore needs stronger signal to discover clusters.

• Classification uses known groups of interest (from other sources) to learn the features associated with these groups in the primary data, and create rules for associating the data with the groups of interest. Often called supervizedmachine learning. Uses previous knowledge, so can detect weaker signal, but may be biased by wrong previous knowledge.

5/24/2014

50

• K-nearest neighbors

• Linear Models

• Discriminant analysis

• Logistic Regression

• Naïve Bayes

• Decision Trees

• Random Forrests

• Support Vector Machines

Methods for classification

Assess classifier performance

Truely in

group1 group2

Classified ingroup1

group2

group1

5/24/2014

51

Receiver operator characteristics (ROC)

Patients with T4 values of 5 or less are considered to be hypothyroid.

Cross validation

Holdback cross‐validation K‐fold cross validation (LKOCV)

If k=1 it is called leave‐one‐out cross‐validation (LOOCV)Variance bias trade‐off

5/24/2014

52

Support vector machines (SVM)

Support vector machines (SVM)

5/24/2014

53

Software

Saeed AI. et al. Biotechniques 2004

Sturn A. et al. Bioinformatics 2002

Commercial

Open source

Biological meaning of the gene sets

?

• Gene ontology terms

• Pathway mapping

• Linking to Pubmed abstracts or associated MESH terms

• Regulation by the same transcription factor (module)

• Protein families and domains

• Gene set enrichment analysis

• Over representation analysis

5/24/2014

54

Gene Ontology

The three organizing principles of GO are

• cellular component (e.g. mitochondrium)• biological process (e.g. lipid metabolism)• molecular function (e.g. hydrolase activity)

Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name(e.g. fibroblast growth factor receptor binding).

URL: http://www.geneontology.org/

The Gene Ontology project provides a controlled vocabularyto describe gene and gene product attributes in any organism.

Directed acyclic graph (DAG)

2 relations:part_of

is_a

different levels

5/24/2014

55

ISS Inferred from Sequence Similarity

IEP Inferred from Expression Pattern

IMP Inferred from Mutant Phenotype

IGI Inferred from Genetic Interaction

IPI Inferred from Physical Interaction

IDA Inferred from Direct Assay

RCA Inferred from Reviewed Computational Analysis

TAS Traceable Author Statement

NAS Non‐traceable Author Statement

IC Inferred by Curator

ND No biological Data available

Evidence code for GO annotations

Gene Ontology Browser (Amigo)

5/24/2014

56

Gene ontology functionality (Genesis)

cell cyclemitosiscytokineses

nucleus

Cluster 05 includes many cell cycle genes

GO terms for gene sets

5/24/2014

57

Overrepresentation analysis

m

g

gene universe (whole microarray)

GO term

ci

genes in cluster(gene list)

all genes with GO term

genes in clusterwith GO term

Over representation analysis (ORA)

• Fisher exact test for contingency table

• Hypergeometric test

• Example from http://genome.tugraz.at/ORA

p =

gi

m-gc-i

mc

m-g c-i

g i

high throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1...

Documents