high throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1...

57
5/24/2014 1 Highthroughput sequencing and expression analysis Glossary for highthroughput sequencing experiments Library Run Flowcell Lane (channel) Read Read length (25 400) Coverage Deepness Barcoding (run >1 sample in 1 lane, multiplex) Pairedend Sequencing by synthesis (SBS) Sequencing by ligation (SBL) Emulsion PCR Single molecule sequencing (no amplification,3rd generation) millions of sequences (reads)

Upload: others

Post on 08-Mar-2021

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

1

High‐throughput sequencing and expression analysis 

Glossary for high‐throughput sequencing experiments

• Library• Run• Flowcell• Lane (channel)• Read • Read length (25 ‐ 400)• Coverage• Deepness• Barcoding (run >1 sample in 1 lane, multiplex)• Paired‐end• Sequencing by synthesis (SBS)• Sequencing by ligation (SBL)• Emulsion PCR• Single molecule sequencing • (no amplification,3rd generation)

millions of sequences (reads)

Page 2: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

2

Sequencing technologies

• Sanger (ABI, Life Technologies)• 454 (Roche)• Solexa (Illumina)• Solid (Life Technologies)• Polonator (Church Lab)• HeliScope (Helicos)• Pacific Biosciences SMRT• Ion Torrent (Life Technologies)• Complete Genomics• Nanopore Sequencing (IBM & Roche)

Metzker, Nature Rev Genet, 2010Shendure, Li, Natur Biotech, 2008Mardis,  Annu Rev Genomics Hum Genet, 2008

Sanger sequencing

• DNA is fragmented

• Cloned to a plasmid vector, transform bacteria, growth (automated colony picking)

• Using fluorophore labeled dideoxy nucleoside triphosphates (ddNTPs) for chain termination

• Fluorescent readout with capillar electrophoresis (up to 384 capillaries)

Page 3: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

3

454 (Roche)

• First on the market (2004) • Pyrosequencing• Emulsion PCR• ‐400 bp reads• 400.000 reads in parallel• Homopolymer can be an issue

• One DNA molecule per bead

454 Pyrosequencing

PPi... Pyrophosphate

Agah A et al. Nucl. Acids Res. 2004

Page 4: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

4

Solexa (Illumina)

1. Prepare genomic DNA sample

2. Attach DNA to surface

3. Bridge amplification

4. Fragments become double stranded

5. Denature double stranded DNA

6. Complete amplification

1 2 3

4 5 6

Solexa (Illumina)

Page 5: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

5

Solid (Life Technologies)

• Sequencing by ligation• 2 base encoding (color space)• => high accuracy

Solid (Life Technologies)

Must know first base Requires an adjacentvalid color change

Errors do not havecompensatory color changes

Page 6: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

6

HeliScope (Helicos)

• 1.000.000.000 reads/exp• >100 sample preparation scalability• No amplification• Digital quantification• 1 labeled base at the time

Pacific Biosciences (Single Molecular Real‐Time, SMRT)

• Zero mode waveguide cells (ZMW)• Polymerase immobilized to solid surface• Read length >1kb possible• No amplification• Phospholinked nucleotides (fluorophores)

Page 7: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

7

Pacific Biosciences

• Long read length activity (>1kb)  of DNA polymerase

• Circular sequencing could improve quality by multiple assessing of  the same base

Eid et al. Science, 2009 

Ion Torrent

• Sequencing on semiconductor device• pH‐sensor• Homopolymer issues• Cheap chemistry• Run time ~ 2 hours• ‐200bp• 11 mill. wells, > 1Gb (Ion 318TM Chip)

Rothberg et al., Nature, 2011

Page 8: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

8

Summary of sequencing technologies

Platform Amplification/Chemistry

Detect Read length (bp)

Read per lane/No lanes

Run time (d)

Gb per run

Machine cost (k$)

Cost per Mb 

Sanger/ABI PCR, dideoxy Fluor. 800 384 capil. 0.01 0.1 110 <5000

454/Roche emPCR, synthesis Lumin. 400 (SE) 1x106 0.35 0.5 500 <10

Solexa/illumina

Bridge ampl. synthesis

Fluor. 100 (PE) 50x106/8 (X2)

4 500 540 <2

Solid/Life Tech

emPCR, ligation Fluor. 75 (PE) 40x106/8 (X2)

7 300 595 <2

Polonator emPCR, ligation Fluor. 13 (PE) 10x106/8 (X2)

5 10 170 <1

HeliScope/Helicos

No ampl., synthesis

Fluor. 35 (SE) 20x106/25 (X2)

8 30 999 <0.6

Pacific/Biosciences SMRT

No ampl., synthesis

Fluor. >1000 (SE) 150.00 per SMRT cell

N/A N/A N/A <1

Ion Torrent emPCR, synthesis pH 200 (PE) 11x106per chip

<2h 1 50 <1

Complete Genomics

DNA nanoballs ligation Fluor.

Service, 40X human genome coverage >90% of 

the full genome res.400  Human genomes per month

*Changing rapidly and might be outdated and depends on type and version

*

Base calling (Phred score) 

Phred quality score Q and base‐calling error probabilities P

QPhred = ‐10 log10 P  QSolexa = ‐10 log10P

1 ‐ P

For P=0.05 the quality score Q=13 

Page 9: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

9

Base calling (FastQ format) 

@EAS54_6_R1_2_1_413_324CCCTTCTTGTCTTCAGCGTTTCTCC+;;3;;;;;;;;;;;;7;;;;;;;88

Quality scores are encoded in ASCII

Paired‐end sequencing

• Enables both ends of the DNA fragment to be sequenced

• Because the distance between each paired read is known, alignment algorithms can use this information to map the readsover repetitive regions (insertion, deletions more precisely

Page 10: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

10

Multiplex sequencing (barcodes)

Margolies, NHGRI

RNA expression profiling

• Northern bloting

‐ semi‐quantitative‐ few genes

• Real time RT‐PCR (qPCR)

‐medium throughput ‐ 96/384 per run

• Microarray analysis

‐ high throughput ‐ 10.000‐500.000 elements per chip

• RNA seq

‐ high throughput

‐ deep sequencing (short reads 25bp) 

Page 11: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

11

Microarray technology

• Two‐color microarrays (Custom, Agilent)

– Spotted oligonucleotides

– Spotted cDNA

• One color microarrays (Affymetrix, ABI)

– In situ synthesized oligonuleotides

• Other types of microarrays

– Exon microarrays

– Tiling microarrays

– Protein (antibody) microarray

– Tissue microarrays

Array types

3‘ Arrayse.g. Affymetrix U133 plus 2.0 arrayApplication: gene expression

Exon‐Arrayse.g. Affymetrix Exon 1.0 ST arrayApplication: alternative splicing, transcript expression

microRNA arrayse.g. Exiqon using locked nucleic acids (LNA)Applications: mature microRNA profilingProblem: short, very similar sequences

Tiling arrays (target genomic DNA) e.g. NimblegenApplications: ChIP on chip, array CGH

Page 12: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

12

Two‐color microarrays

• DNA fragments corresponding to a known sequence are mechanically deposited onto a glass slide

•   The fragments can be :

– Oligonucleotides of 60‐80 mers length– cDNA fragments from a library (varying lengths)

•   Two samples of reverse‐transcribed mRNA are labelled with two different colors and co‐hybridized onto the slide 

Two‐color microarrays

Page 13: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

13

Labels

Two‐color microarrays

Page 14: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

14

Experimental design

• Replicates

– Biological replicates (independent experiments)

– Technical replicates:    Repl. arrays (dye swap)

Repl. Spots

• Reference design versus loop design

Which reference sample?

• Constrains

– Sample material (e.g. biopsies)  Pooling 

– Costs (custom 100 €, Affx 500 €, ABI 1k €)

not independent !!

Analytical pipeline

Page 15: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

15

Image analysis and background correction

Gk=F532 mean – B532

• Software available (GenePix, ImaGene, Agilent)

• Steps:– Gridding, assigns coordinates and gene information  to the different spots

– Segmentation: Foreground vs background

– Intensity extraction

Rk=F635 mean – B635

foreground background

A lot of other parameter:F635 % Sat., Flags, B532 SD,… 

Normalization

• Removal of all sources of systematic non‐biological variability and the reduction of the random errors.

• Basic assumption is that most of the genes are not changing their expression during the studied process

• Amount of total RNA for both samples is the same

Page 16: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

16

Scatterplot and histogram

Box plot

Page 17: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

17

MA plot

M = log2(R/G)

A = log2(R*G)/2

Intensity dependent normalization

• Apply a locally weighted polynomial regression for a fixed subset of genes in the neighborhood of every gene i (LOWESS).

• Weight function:

Page 18: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

18

Differentially expressed genes

• Rank genes by

– log 2 ratios

– z‐score

– ,                                           average over biological replicates 

– t‐distribution

– moderated t‐test 

( ) /z M mean SD 1 array

1 group of n arrays

z

Comparison of different groups

• 2 groups

– difference

– t‐test 

• 3 or more groups

– Analysis of variance ‐ ANOVA 

A Bd M M

* / 1/ 1/

A B

A B

M Mt

s n n

MSAF

MSE

Page 19: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

19

Assigning significance

• Threshold for M: 1 (2-fold change)

• Threshold for z: 1.5

• Threshold for p-value from different tests (z-test, t-test, ANOVA): p<0.05 is considered statistically significant

• Problem of multiple testing

• False discovery rate (FDR)

• Significance analysis of microarrays (SAM)

Multiple testing

In case of 1000 tests 50 false positives are expected at an significance level of 0.05 which are declared significant.

• Family wise error (FWER): p(V>0)

• False discovery rate: E(V/R)

To account for this multiple testing following parameterWere used:

Page 20: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

20

Methods to correct p‐values for multiple testing

Significance analysis of microarrays (SAM)

0)(

)()()(

sis

ixixid UI

• dE(i) is average of dP(i) from permutated samples

• Identify genes which deviate from d(i)=dE(i) by more than a threshold, 

• These do not necessarily have thelargest change in expression

• Can optimize  with estimate of false discovery rate (FDR)

S(i)… gene specific scatter

S0 …small positive constant calculated to minimize CV.

Page 21: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

21

Affymetrix microarrays

Affymetrix microarrays

Page 22: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

22

Affymetrix microarrays

Affymetrix microarrays

Page 23: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

23

Affymetrix microarrays

Affymetrix chips

Page 24: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

24

Pre‐processing of Affymetrix chips

Additive-multiplicative error model:

Random errorLog of true abundance Probe effectSignal

Affymetrix approach:

Avgdiff

MAS 5.0

Li, Wong approach:

dChip

Irizarry: RMA

( )ij ij ijd PM MM

(log( ))ij ijs Tukey Biweight PM CT

ij ij i j ijPM MM

log( )S

log( PMij BG) ai bj ij

Normalization of Affymetrix chips

– Global normalization

– Splines smooth

– Cyclic LOESS

– Quantiles normalization

Remove intensity‐based bias 

Summarizing the probes

– Many outliers

Page 25: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

25

Methods

• RMA (Robust multiarray average)

• GCRMA (RMA with adjustment for non‐specific binding based on probe sequence information)

• VSN (Variance stabilization normalization)

• PLIER (Probe Logarithmic Error Intensity Estimate) 

>30 different methods for pre‐processing and normalization and combined analysis were studied in a benchmark (Irizarry, Bioinformatics, 2006)

Methods showed good performance:

Density plot

Page 26: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

26

MA plots of chip pairs

Before quantile normalization

After  quantile normalization

Quality control

Relative Log Expression (RLE)

Normalized Unscaled Standard Error (NUSE)

RMA Norm data

VSN Norm data

Sanchez‐Cabo et al., in Bioinformatics for Omics Data (ed Mayer), 2011

Page 27: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

27

R & Bioconductor:

– Open source statistical program

– Mostly used by the Microarrays community– All functions implemented and packages available

Other tools

• ArrayNorm (Pieler R. et al. Bioinformatics. 2004)

Standalone Java application for two‐color microrarrays

• CARMAweb (Rainer J. et al. Nucleic Acids Res. 2006)

Web application based on Bioconductor packages for

one and two color arrays and further analysis

• GEPAS, ArrayPipe, MIDAW, RACE, Expression Profiler

Software for microarray normalization

• Potential for surveying the entire transcriptome, including novel, un‐annotated regions.

• Helps to identfy expression and function of regulatory none‐coding RNAs (e.g. lincRNA)

• Potential for determining gene structure and isoform level expression using reads mapping to splice junctions.

• Potential for making better presence/absence calls on regions.

• More expensive than microarrays

• Don‘t need to design probes

Transcriptome sequencing (RNAseq)

Page 28: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

28

Transcriptome sequencing (RNAseq)

Wang et al., Nature Rev Gen, 2009

Normalization

• Reads per kilobase per million (RPKM)

Normalization:

• Quantile normalization

• TMM (trimmed mean of M values).

Page 29: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

29

1. Read mapping

2. Transcriptome reconstruction

3. Expression quantification

4. Differential expression analysis

Analysis steps

Read mapping

Garber et al., Nature Methods,  2011

Page 30: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

30

Spliced aligners

Garber et al., Nature Methods,  2011

Transcriptome reconstruction

Garber et al., Nature Methods,  2011

Page 31: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

31

Cufflinks

Trapnell C, Nature Biotech, 2011

FPKM

fragments per kilobase of transcript per million fragments mapped (paired-end data)

Expression quantification and differential expression

Garber et al., Nature Methods,  2011

Page 32: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

32

Expression quantification

Garber et al., Nature Methods,  2011

Problem of assigning reads to correct isoform

Probabilities can be estimated by iterative Expectation‐Maximum algorithm (EM)(finding maximum likelihood estimates ofparameters where the model depends on unobserved latent variables.

Pachter L, 2011

Page 33: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

33

Differential expression analysis

Garber et al., Nature Methods,  2011

Isoform and gene expression

Li, Dewey, Bioinformatics, 2011

Page 34: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

34

Alternative splicing

Blencowe, http://www.utoronto.ca

Differential expression analysis for sequencing count data

• Discrete, positive, skewed(not (log‐) normal distributed)Poisson distributed

• Sequencing depth (coverage) variesbetween samples

• Normalization for library size

• Large dynamic range (0 ... 105) between genes

Anders, Huber, EMBL

A B C D

Gene1 1 23 2 6

Gene2 0 74 8 7

Gene3 33 4 14 8

Page 35: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

35

Technical and biological replicates

• Counts for the same gene from different technical replicates have a variance equal to the mean (Poisson)

• Counts for the same gene from different biological replicates have a variance exceeding the mean (overdispersion). can be estimated bynegative‐binominal model

• No need for technical replicates (variance=mean), but need biologicalreplicates to estimate variance (dispersion) and to draw conclusion for a greater population (as for any biological experiment) 

Nagalakshmi et al. Science, 2008

Bioconductor packages 

Testing for differential signal in sequencing count data:

Based on negative-binomial distribution:

• edgeR (Robinson, Mcarthy, Smyth) • DESeq (Anders, Huber)• DEXSeq (Reyes, Anders, Huber) for differential exon usage• BaySeq (Hardcastle, Kelly)

Based on Poisson distribution:

• DEGSeq (Wang et al.)

Page 36: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

36

Tools and standards

Fileformats and standards:

• FastQ format: sequence and corresponding quality levels• GFF/GTF files (General Feature Format, General Transfer Format)• Sequence Alignment/Map (SAM/BAM) format (SAM tools (C APIs), Picard (Java APIs))• Short read archive (SRA)

Tools:

• Overview: http://ngslib.i‐med.ac.at,  Garber et al., Nature Methods,  2011• Base calling tools: Phred, Alta‐Cycle, … (platform specific)• RNA‐seq software: ERANGE, TopHat, Cufflinks …• Mapping tools: Bowtie, BWA, Eland, MAQ, SOAP2, GSNAP • Differentially expressed genes/isoforms: Deseq, DEGseq, DEXseq, CuffDif, 

Bayseq, EdgeR• Other: BEDTools, Galaxy, HTseq

RNAseq pipeline

Wei Sun, University of North Carolina‐Chapel Hill 

Page 37: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

37

Reproducibility and sensitivity of RNAseq

Mortazavi., Nature Methods, 2008

How many reads are needed (depth)?

two mouse libraries (ES,EB) yeast

Wang et al., Nature Rev Gen, 2009

E.g. 20-40m reads for human

Page 38: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

38

Clustering

• Unsupervised or supervised (classification)

• AgglomerativeBottom up approach, whereby  single expression

profiles are successively joined to form nodes.

• DivisiveTop down approach, each cluster is successively   split in the same fashion, until each cluster consists  of one single profile.

Inter vs. intra cluster distance

Page 39: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

39

Methods for unsupervised clustering

• Hierarchical Clustering

•  K‐means

•  Self Organizing Maps

•  Model‐based methods

•  Trillions of others

Data format

Page 40: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

40

Mean or median centering

Mean or median centering

Page 41: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

41

Similarity distance measures

• Pearson correlation

• Pearson uncentered

• Pearson squared

• Cosine correlation

• Covariance

• Euclidean distance

• Average dot product

• Manhattan distance

• Chebychev distance

• Mutual information

• Spearman rank

• Kendall’s tau

• Pearson correlation

• Euclidian distance

• Manhattan distance

Similarity distance measures

1

( )n

M i ii

d x y

-1 r 1

Page 42: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

42

Rank order correlation

• Spearman’s rank correlation

• Kendall’s tau

nc … concordant pairs (ordered the same way)nd … disconcordant pairs (ordered in opposite way)

where di are the differences in the ranks

Mutual information

• Entropy (information content)

• Mutual information

xi discretized gene expression level at condition i.p(xi) probability of this stage to occur

H(A,B) … joint entropy

MI(A,B)=0means that the joint profile carries not more information than the two profiles separately

Page 43: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

43

Missing values

• Only elements represented in both vectors are used for the distance calculation 

• The greatest problems occur if the distance is not  independent of the number of vector elements n, as it is the case for Euclidian distance.

Potential solutions:

1) Put zeros in all missing values

2) Put average of all values that are available = row average or column average

3) Estimate values based on nearest neighbor, or groupof K nearest neighbors

4) Estimate value in others ways (e.g. SVD)

Hierarchical clustering

• Agglomerative (bottom up), unsupervized• Cluster genes or samples (or both= biclustering)• Distances are encoded in dendogram (tree)• Cut tree to get clusters• Pearson correlation (usually used)• Computational intensive (correlation matrix)

1. Identify clusters (items) with closest distance2. Join to new clusters3. Compute distance between clusters (items) (see linkage)4. Return to step 1

6 cluster 

15 cluster 

Page 44: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

44

Linkage

Single‐linkage clusteringMinimal distance

Complete‐linkage clusteringMaximal distance

Average‐linkage clusteringCalculated using average distance (UPGMA)Average from distances not! expression values

Weighted pair‐group averageLike UPGMA but weighted according cluster size

Within‐groups clusteringAverage of merged cluster is used instead of cluster elements

Ward’s methodSmallest possible increase in the sum of squared errors

• partition n genes into k  clusters, where k has to be  predetermined

• k‐means clustering minimizes  the variability within and maximize between clusters

• Moderate memory and time consumption

K‐means

1. Generate random points (“cluster centers”) in n  dimensions (results are depending on these seeds).

2.Compute distance of each data point to each of the cluster centers.

3.Assign each data point to the closest cluster center.

4.Compute new cluster center position as average of points assigned.

5.Loop to (2), stop when cluster centers do not move very much.

Page 45: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

45

How to choose k

Figure of Merit (FOM)

• Neural network approach

• Usually one or two dimensional map

• Hexagonal or rectangular net topology

• Moderate memory and time consumption

• Number of clusters has to be specified!

Self organizing maps (SOM)

Page 46: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

46

1. Generate a simple (usually) 2D grid of nodes (x,y)

2. Map the nodes into n‐dim expression vectors  (initially randomly, (e.g. (x,y) ‐> [0 0 0 x 0 0 0 y 0 0 0 0 0]

3. For each data point, P, change all node positions sothat they move towards P. Closer nodes move more    than far nodes. 

4. Iterate for a maximum number of iterations, and then assess position of all nodes.

Self organizing maps (SOM)

Self organizing maps (SOM)

fi+1(N)= fi(N) + t (d(N, NP), i) * [P‐ fi(N)]

• fi(N) = position of node N at iteration i

• P = position of current data point

• P‐ fi(N) = vector from N to P

• t = weighting factor or “learning rate” dictates how    much to move N towards P.

• t (d(N, NP), i) = 0.02 T/(T+100 i) for d(N,NP) < cutoff radius, else = 0

• T = maximum number of iterations

Decreases with iteration and distance of N to P

Page 47: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

47

Principal component analysis (PCA)

PCA is a data reduction technique that allows to simplify multidimensional data sets into smaller number of dimensions (r<n).

Variables are summarized by a linear combination to the principal components. The origin of coordinate system is centered to the center of the data (mean centering) . The coordinate system is then rotated to a maximum of the variance in the first axis.

Subsequent principal components are orthogonal to the 1st PC. With the first 2 PCs usually  80‐90% of the variance can already be explained.

This analysis can be done by a special matrix decomposition (singular value decomposition SVD).

Singular value decomposition (SVD)

X = USVT with UUT = VTV =  VVT = I

For mean centered data the Covariance matrix C can be calculated by XXT. U are eigenvectors of XXT and the eigenvalues are in the diagonal of S defined by the characteristic equation |C – λI | = 0.

Transformation of the input vectors into the principal component space can be described by Y = XU where the projection of sample i along the axis is defined by the j‐th PC:

Page 48: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

48

Correspondence analysis (CA)

• Correspondence Analysis is an explorative computational method for the study of associations between variables.

• The approach is a combination of using the χ2 statistic and singular value decomposition (SVD) similar to that for principal component analysis.

• Like principal component analysis, it displays a low‐dimensional projection of the data

• It does this, though, for two variables simultaneously, thus revealing associations between them

Correspondence analysis (CA)

Page 49: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

49

Gene expression terrain maps

Clustering vs. classification

• Clustering uses the primary data to group together measurements, with no information from other sources.  Often called unsupervized machine learning.Is not biased by previous knowledge, but therefore needs stronger signal to discover clusters.

• Classification uses known groups of interest (from other sources) to learn the features associated with these groups in the primary data, and create rules for associating the data with the groups of interest. Often called supervizedmachine learning. Uses previous knowledge, so can detect weaker signal, but may be biased by wrong previous knowledge.

Page 50: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

50

• K-nearest neighbors

• Linear Models

• Discriminant analysis

• Logistic Regression

• Naïve Bayes

• Decision Trees

• Random Forrests

• Support Vector Machines

Methods for classification

Assess classifier performance

Truely in

group1 group2

Classified ingroup1

group2

group1

Page 51: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

51

Receiver operator characteristics (ROC)

Patients with T4 values of 5 or less are considered to be hypothyroid.

Cross validation

Holdback cross‐validation K‐fold cross validation (LKOCV)

If k=1 it is called leave‐one‐out cross‐validation (LOOCV)Variance bias trade‐off

Page 52: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

52

Support vector machines (SVM)

Support vector machines (SVM)

Page 53: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

53

Software

Saeed AI. et al. Biotechniques 2004  

Sturn A. et al. Bioinformatics 2002 

Commercial

Open source

Biological meaning of the gene sets

?

• Gene ontology terms

• Pathway mapping

• Linking to Pubmed abstracts or associated MESH terms

• Regulation by the same transcription factor (module)

• Protein families and domains

• Gene set enrichment analysis

• Over representation analysis

Page 54: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

54

Gene Ontology

The three organizing principles of GO are 

• cellular component (e.g. mitochondrium)• biological process (e.g. lipid metabolism)• molecular function (e.g. hydrolase activity)

Each entry in GO has a unique numerical identifier of the form GO:nnnnnnn, and a term name(e.g. fibroblast growth factor receptor binding). 

URL: http://www.geneontology.org/

The Gene Ontology project provides a controlled vocabularyto describe gene and gene product attributes in any organism. 

Directed acyclic graph (DAG)

2 relations:part_of

is_a

different levels

Page 55: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

55

ISS Inferred from Sequence Similarity

IEP Inferred from Expression Pattern

IMP Inferred from Mutant Phenotype

IGI Inferred from Genetic Interaction

IPI Inferred from Physical Interaction

IDA Inferred from Direct Assay

RCA Inferred from Reviewed Computational Analysis

TAS Traceable Author Statement

NAS Non‐traceable Author Statement

IC Inferred by Curator

ND No biological Data available

Evidence code for GO annotations 

Gene Ontology Browser (Amigo)

Page 56: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

56

Gene ontology functionality (Genesis) 

cell cyclemitosiscytokineses

nucleus

Cluster 05 includes many cell cycle genes

GO terms for gene sets

Page 57: High throughput sequencing and expression analysis · 2014. 5. 24. · 5/24/2014 1 High‐throughput sequencing and expression analysis Glossary for high‐throughput sequencing experiments

5/24/2014

57

Overrepresentation analysis

m

g

gene universe (whole microarray)

GO term

ci

genes in cluster(gene list)

all genes with GO term

genes in clusterwith GO term

Over representation analysis (ORA)

• Fisher exact test for contingency table 

• Hypergeometric test

• Example from http://genome.tugraz.at/ORA

p =

gi

m-gc-i

mc

m-g c-i

g i