next generation sequencing, tiling arrays and - r¤tsch lab

Next Generation Sequencing, Tiling Arrays and

Predictive Sequence Analysis for

Transcriptome Analysis

Gunnar Ratsch

Friedrich Miescher Laboratory

Max Planck Society, Tubingen, Germany

9th Course in Bioinformatics and Systems Biologyfor Molecular Biologists (March 24, 2009)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 1 / 89

http://www.fml.mpg.de

Page 2: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Discovery of the Nuclein(Friedrich Miescher, 1869) fml

Discovery of Nuclein:

from lymphocyte & salmon

“multi-basic acid” (≥ 4)

Tubingen, around 1869

“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)

Page 3: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Discovery of the Nuclein(Friedrich Miescher, 1869) fml

Discovery of Nuclein:

from lymphocyte & salmon

“multi-basic acid” (≥ 4)

Tubingen, around 1869

“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)

Page 4: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Transcriptome Analysis What is encoded on the genome and how is it processed? fml

Then we can (try to) understand:

Differences of active components between conditions/organisms?

What changes when perturbing the biological system?

How to get the transcriptome?

1 Infer transcriptome from genomic DNA

2 Measure properties of transcriptome

3 Combine predictions with measurements

Page 5: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 6: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 7: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 8: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 9: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 10: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Transcription & RNA Processing fmlNewly synthesizedpre-mRNA iscapped.[CBP20 & CBP80:

cap-binding proteins]

Introns are splicedfrom pre-mRNA.[U1, U2, U4-6:spliceosome

SF1, U2AF, SR proteins:

splicing factors]

A polyA-tail isadded to the 3’terminus ofpre-mRNA.

[Bergkessel et al., 2009]

Page 11: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

splicing factors]

Page 12: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

splicing factors]

Page 13: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

RNA Transcripts fml

Protein-coding mRNAs

Noncoding RNAs

Structural RNAs (e.g. rRNAs, tRNAs, . . .)Small RNAs (e.g. miRNAs, endogenous siRNAs, . . .)Antisense / promoter-associated transcripts. . .

Page 14: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Computational Gene Finding Labeling the Genome fml

DNA

pre-mRNA

mRNA

Protein

5' UTR

exon

intergenic

3' UTR

intron

genic

exon exonintron

polyAcap

Given a piece of DNA sequencePredict protein-coding mRNAs

Less well developed for non-coding RNAs

Page 15: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

DNA

pre-mRNA

mRNA

Protein

5' UTR

exon

intergenic

3' UTR

intron

genic

exon exonintron

polyAcap

Given a piece of DNA sequencePredict protein-coding mRNAs

Less well developed for non-coding RNAs

Page 16: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Experimental Characterization of the Transcriptomefml

DNA Microarrays

Oligonucleotide probes immobi-lized on a glass slide hybridizeto complementary labeled tar-get RNA.

cDNA Sequencing

[Wikipedia]

http://commons.wikimedia.org/wiki/File:Radioactive_Fluorescent_Seq.jpg

Page 17: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

DNA Microarrays

cDNA Sequencing

[Wikipedia]

Page 18: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

DNA Microarrays

cDNA Sequencing

[Wikipedia]

Page 19: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Key Research Questions fmlCharacterize an organism’s full complement of genes

⇒ Find new (possibly noncoding) genes⇒ Compare genes among organisms

Characterize transcript isoforms

⇒ Find new alternative splice forms / transcript ends

Monitor transcriptome changes between tissuesor in response to environmental changes (e.g. stress)

⇒ Identify significant expression changes

Understand transcriptome regulation

⇒ Knock-out / knock-down analysis of regulators

Identify regulated targets with significant expression changes

⇒ Identify binding sites used in regulation (e.g. ChIP-on-chip)

Page 20: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 21: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 22: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Introduction

Page 23: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Roadmap fml1 Computational Gene Finding

Identification of Genomic SignalsLearning to Predict mRNA Transcripts

2 Whole-genome Tiling ArraysTechnology and LimitationsIdentification of Expression DifferencesDe Novo Transcript Discovery

3 Next-generation SequencingTechnology & LimitationsAssembly & Read Mapping

4 ExtensionsQuantification of TranscriptsChIP-on-Chip Studies

Page 24: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Computational Gene Finding Basics

DNA

Protein

Given a piece of DNA sequence

Predict proteins (or non-coding RNAs)

Page 25: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

DNA

pre-mRNA

mRNA

Protein

5' UTR

exon

intergenic

3' UTR

intron

genic

exon exonintron

polyAcap

Given a piece of DNA sequencePredict the correct corresponding label sequence with labels

“intergenic”, “exon”, “intron”, “5’ UTR”, etc.c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 10 / 89

Page 26: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Hidden Markov Models fmlDNA

pre - mRNA

major RNA

protein

5' UTR

Exon

Intergenic

3' UTR

Intron

genic

Exon ExonIntron

Model sequence content:

One state per segment type

Allow only plausible transitions

Content statistics at each state

Derived from known genes

Prediction:

Given DNA, find most likely state sequences

Focuses on “content”

Weak models for “signals”

Page 27: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

pre - mRNA

major RNA

protein

5' UTR

Exon

Intergenic

3' UTR

Intron

genic

Exon ExonIntron

Prediction:

Page 28: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

pre - mRNA

major RNA

protein

5' UTR

Exon

Intergenic

3' UTR

Intron

genic

Exon ExonIntron

Prediction:

Page 29: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

pre - mRNA

major RNA

protein

5' UTR

Exon

Intergenic

3' UTR

Intron

genic

Exon ExonIntron

p(x, y) =∏L−1

i=1 p(xi |yi)p(yi+1|yi)

Prediction:

Page 30: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

pre - mRNA

major RNA

protein

5' UTR

Exon

Intergenic

3' UTR

Intron

genic

Exon ExonIntron

p(x, y) =∏L−1

i=1 p(xi |yi)p(yi+1|yi)

Prediction:

Page 31: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS

SpliceDonor

SpliceAcceptor

SpliceDonor

SpliceAcceptor

TIS Stop

polyA/cleavage

Page 32: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS Donor Acceptor Donor Acceptor

TIS Stop

polyA/cleavage

Page 33: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

DNA

pre-mRNA

mRNA

Protein

polyAcap

TIS Stop

polyA/cleavage

TSS TIS cleaveStop

Don Acc

Page 34: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Computational Gene Finding Identification of Genomic Signals

Example: Splice Site Recognition fmlTrue Splice Sites

True sites: fixed window around a true splice site

Decoy sites: all other consensus sites

⇒ Millions of labeled instances from EST databases

Page 35: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Example: Splice Site Recognition fml

CT...GTCGTA...GAAGCTAGGAGCGC...ACGCGT...GA

150 nucleotides window around dimer≈

True Splice Sites

True sites: fixed window around a true splice siteDecoy sites: all other consensus sites

⇒ Millions of labeled instances from EST databasesc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 13 / 89

Page 36: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Potential Splice Sites

True sites: fixed window around a true splice siteDecoy sites: all other consensus sites

⇒ Millions of labeled instances from EST databasesc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 13 / 89

Page 37: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

...True sites: fixed window around a true splice site

Page 38: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

...True sites: fixed window around a true splice site

Page 39: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Basic idea:

For instance, exploit that exonshave higher GC content

or

that specific motifs appear nearsplice sites.

[Sonnenburg et al., 2007]

Page 40: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Basic idea:

For instance, exploit that exonshave higher GC content

or

that specific motifs appear nearsplice sites.

Page 41: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Basic idea:

In practice: Use one feature perpossible substring (e.g. ≤20) at allpositions

150·(41+. . .+420) ≈ 2·1014 features

Page 42: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Results on Splice Site Recognition fmlWorm Fly Cress Fish Human

Acc Don Acc Don Acc Don Acc Don Acc DonMarkov Chain

auPRC(%) 92.1 90.0 80.3 78.5 87.4 88.2 63.6 62.9 16.2 26.0SVM

auPRC(%) 95.9 95.3 86.7 87.5 92.2 92.9 86.6 86.9 54.4 56.9

[Sonnenburg, Schweikert, Philips, Behr, Ratsch, 2007]

Page 43: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Example: Predictions in UCSC Browser fml

cleave

polyA

Stop

Acceptor

Donor

TIS

TSS

[Schweikert et al., 2009]

Page 44: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 45: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Based on known genes, learn howto combine predictions for accurategene structure prediction

Page 46: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Computational Gene Finding Learning to Predict mRNA Transcripts

Discriminative Gene Prediction (simplified) fml

[Ratsch, Sonnenburg, Srinivasan, Witte, Muller, Sommer, Scholkopf, 2007]

Simplified Model: Score for splice form y = {(pj , qj)}Jj=1:

F (y) :=J−1∑j=1

SGT (f GTj ) +

J∑j=2

SAG (f AGj )︸︷︷︸

Splice signals

+J−1∑j=1

SLI(pj+1 − qj) +

J∑j=1

SLE(qj − pj)︸︷︷︸

Segment lengths

Tune free parameters (in functions SGT , SAG , SLE, SLI

) by solvinglinear program using training set with known splice forms

Page 47: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Discriminative Gene Prediction (simplified) fml

[Ratsch, Sonnenburg, Srinivasan, Witte, Muller, Sommer, Scholkopf, 2007]

Simplified Model: Score for splice form y = {(pj , qj)}Jj=1:

F (y) :=J−1∑j=1

SGT (f GTj ) +

J∑j=2

SAG (f AGj )︸︷︷︸

Splice signals

+J−1∑j=1

SLI(pj+1 − qj) +

J∑j=1

SLE(qj − pj)︸︷︷︸

Segment lengths

Tune free parameters (in functions SGT , SAG , SLE, SLI

) by solvinglinear program using training set with known splice forms

Page 48: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Results using mGene fmlMost accurate ab initio method in the nGASP genomeannotation challenge (C. elegans) [Coghlan et al., 2008]

Validation of gene predictions for C. elegans: [Schweikert et al., 2009]

No. of genes No. of genes Frac. of genesanalyzed w/ expression

New genes 2,197 57 ≈ 42%Missing unconf. genes 205 24 ≈ 8%

Annotation of other nematode genomes: [Schweikert et al., 2009]

Genome Genome No. of No. exons/gene mGene best othersize [Mbp] genes (mean) accuracy accuracy

C. remanei 235.94 31503 5.7 96.6% 93.8%C. japonica 266.90 20121 5.3 93.3% 88.7%C. brenneri 453.09 41129 5.4 93.1% 87.8%C. briggsae 108.48 22542 6.0 87.0% 82.0%

http://www.wormbase.org/wiki/index.php/NGASP

Page 49: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 50: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 51: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

mGene.web: Gene Finding for Everybody ;-)(Schweikert et al., 2009) fml

http://mgene.org/webservicec© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 19 / 89

http://mgene.org/webservice

Page 52: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Limitations/Extensions fml

Gene finding accuracy still far from perfect

Misses genes, predicts incorrect gene models

Does not (yet) predict alternative transcripts

Cannot predict when transcripts areexpressed/modified/degraded. . .

Need experimental data for condition specific transcriptomes.

Then we can learn to predict (hopefully).

Page 53: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 54: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 55: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 56: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Whole-genome Tiling Arrays

From Genome to Proteins etc. fml

DNA

pre-mRNA

mRNA

Protein

polyAcap

TIS Stop

polyA/cleavage

Directly measure the transcriptome

Whole genome tiling arrays

Transcriptome sequencing (Sanger or Next Generation Sequencing)

Page 57: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Whole-genome Tiling Arrays

From Transcriptome Measurements to Proteins etc. fml

mRNA

Protein

polyAcap

DNA

Directly measure the transcriptome

Whole genome tiling arrays

Transcriptome sequencing (Sanger or Next Generation Sequencing)

Page 58: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Whole-genome Tiling Arrays Technology and Limitations

Whole-genome Tiling Arrays fml

25 nt

~35 nt

Whole-genome, quantitative measurements of expression

Allows to cost-effectively analyze many conditions (replicates)

Hybridization data is noisy, analysis challenging

Variants: exon arrays, exon junction arrays

see Mockler et al. [2005], Yazaki et al. [2007] for comprehensive reviewsc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 23 / 89

Page 59: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Hybridization intensity

Hybridizing RNA transcript

25 nt

~35 nt

. . .

Page 60: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Hybridizing RNA transcript

25 nt

~35 nt

. . .

Page 61: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Hybridizing mRNA transcriptExon skip

Page 62: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Hybridizing mRNA transcriptExon skip

Page 63: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Tiling Array Analysis Challenges (I) fml

Repeats cause cross-hybridization

Hybridizing mRNA transcript

⇒ Discard tiling probes with high sequence similarityto >1 location in the genome

Page 64: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Genome structureRepeats

Page 65: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

? ???

Cross-hybridization

Page 66: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

? ???

Cross-hybridization

Page 67: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Tiling Array Analysis Challenges (II) fml

Hybridization intensity exhibits a probe sequence bias

Probe GC content0 5 10 15 20 25

0

2

4

6

8

10

12

14

16

Med

ian

inte

nsity

(log

) /

frequ

ency

[%]

2

Sequence-normalization approaches:Rescaling by probe GC content [Samanta et al., 2006]

Rescaling using genomic DNA hybridization [David et al., 2006].

nij =xij−bj (yi )

yi

for probe i on replicate array j with RNA hybridization signal xij

to obtain normalized signal nij ; DNA hybridization signal yi istransformed into RNA background signal by bj estimated fromintergenic probes.

Regression techniques [Royce et al., 2007b, Zeller et al., 2008c]

Page 68: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

0

2

4

6

8

10

12

14

16

Med

ian

inte

nsity

(log

) /

frequ

ency

[%]

2

A C G T

0 5 10 15 20 25Position in probe

7

7.5

8

8.5

9

9.5

10

90th

inte

nsity

per

cent

ile (l

og ) 2

nij =xij−bj (yi )

yi

Page 69: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

0

2

4

6

8

10

12

14

16

Med

ian

inte

nsity

(log

) /

frequ

ency

[%]

2

A C G T

0 5 10 15 20 25Position in probe

7

7.5

8

8.5

9

9.5

10

90th

inte

nsity

per

cent

ile (l

og ) 2

nij =xij−bj (yi )

yi

Page 70: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Tiling Array Analysis Challenges (III) fmlTranscript normalization assumes constant transcript intensitiesy i (median estimate) [Zeller et al., 2008c]

Learns intensity deviation from transcript intensity δi := yi − y i

Takes probe sequence xi (positional information on mono-, di-and tri-mer occurrence) as input for regression.Models probe sequence effect depending on yi : f (xi , yi) ≈ δi

0

5

10

Log-

inte

nsity

transcript

transcript intensity

observed intensityannotated exonicannotated intronic

Page 71: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

0

5

10

Log-

inte

nsity

transcript

transcript intensityfold difference δ between observed and transcript intensity

Page 72: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

0

5

10

Log-

inte

nsity

transcript

transcript intensityfold difference δ between observed and transcript intensity

Page 73: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

0

5

10

Log-

inte

nsity

f (x)1

f (x)q

f (x)Q. .

.. .

.Discretize y into Q = 20quantiles and estimateQ independent functionsf1(x), . . . , fQ(x)

Linear regressionfq(x) = wT

q x

Page 74: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Whole-genome Tiling Arrays Identification of Expression Differences

Identification of Expression Changes fml1 Map tiling probes to annotated transcripts (define probe sets)

2 Use standard microarray tools to analyze gene expression

Gene expression values are typically computed using robust“summarization methods” that account for probe noise[e.g. Irizarry et al., 2003]

Significant expression changes are typically identified with a statisticaltest. Results have to be corrected for multiple testing[e.g. Storey and Tibshirani, 2003]

Advantages of tiling arrays:

Annotations change, only remapping is needed to obtainexpression measurements for the latest annotation.

Expression can be measured per exon, not only per gene.

Expression can be measured for introns (⇒ detect retention).

Page 75: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Identification of Expression Changes fml1 Map tiling probes to annotated transcripts (define probe sets)

2 Use standard microarray tools to analyze gene expression

Gene expression values are typically computed using robust“summarization methods” that account for probe noise[e.g. Irizarry et al., 2003]

Significant expression changes are typically identified with a statisticaltest. Results have to be corrected for multiple testing[e.g. Storey and Tibshirani, 2003]

Advantages of tiling arrays:

Annotations change, only remapping is needed to obtainexpression measurements for the latest annotation.

Expression can be measured per exon, not only per gene.

Expression can be measured for introns (⇒ detect retention).

Page 76: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Detection of Alternative Transcripts (I) fml

2993.5 2994.0 2994.5 2995.00

5

10

15

... AT5G09660.1

... AT5G09660.2

... AT5G09660.3

rootsseedlingsyoung leavessenescing leavesstemsveg. shoot meristemsinfl. shoot meristemsinflorescencesflowersfruitsclv3-7 inflorescences

Position on Chr V [Kb]

Hyb

ridiz

atio

n in

tens

ity (l

og ) 2

tissue-specificintron retention

Annotated transcripts

Arabidopsis tissues:

Goal: Identify exon/intron segments that are differentially spliced inthe analyzed samples.

Page 77: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Detection of Alternative Transcripts (II) fml

6719.5 6720 6720.5 6721 6721.5

... AT4G10970.1

... EST-based isoform

Position on Chr IV [Kb]

5

10

15H

ybrid

izat

ion

inte

nsity

(log

) 2

rootsseedlingsyoung leavessenescing leavesstemsveg. shoot meristemsinfl. shoot meristemsinflorescencesflowersfruitsclv3-7 inflorescences

Arabidopsis tissues:

partialintron retention Annotated transcripts

Goal: Identify exon/intron segments that show different intensitiesthan other exons/introns in at least one analyzed sample.

Page 78: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Detecting Alternative Exons fml

Fit a gene expression model to exon array data [Irizarry et al., 2003]:

xik = gk + pi + εikRNA hybridization signal xij ,

gk gene-wide expression effect in sample k,

pi effect of probe i , error terms εik .

Detect alternatively spliced exons as outliers [Purdom et al., 2008] from largeresiduals εi ′k ′ for alternative exon probes i ′ in sample k ′.

Test exon junction probes for different transcript isoforms fordifferential expression using e.g. the Kruskal Wallis test [Sugnet et al., 2006].

More sophisticated methods use classification techniques.[Eichner, 2008, Eichner et al., 2009]

Page 79: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 80: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 81: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Whole-genome Tiling Arrays De Novo Transcript Discovery

Discovery of Expressed Transcripts fml

De novo transcript identification is needed to re-annotate expressedgenes.

Page 82: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

De novo segmentation is needed to re-annotate expressed genes.

Page 83: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

De novo segmentation is needed to re-annotate expressed genes.

Desired segmentation into intergenic regionsintronic, andexonic,

Page 84: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Transfrag Method / Affymetrix TARs fml

1 Identify “positive probes” in local neighborhood. Smooth datalocally, across replicates (Pseudomedian1

[Royce et al., 2007a])Two approaches:

define an ad hoc threshold on smoothed signal intensity (e.g.90th signal percentile) [Kampa et al., 2004]

estimate a threshold from negative bacterial control probes toadjust an empirical false discovery rate [He et al., 2007]

2 Combine positive probes into “transfrags” in case of a run ofconsecutive positive probes (minRun) interrupted by a limitednumber of negative probes (maxGap) [Bertone et al., 2004, Kampa et al., 2004]

Problem: Manual parameter “tuning”

1median of all pairwise averages of probe signals within a sliding windowc© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 32 / 89

Page 85: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 86: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 87: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Dynamic Programming Segmentation fmlModel intensities as piecewise constant function [Huber et al., 2006]:

xij = µs + εijfor ts ≤ i < ts+1

for probe i on replicate array j with RNA hybridization

signal xij ; segment boundaries ts and ts+1. µs is the mean

signal of the sth segment, εij error terms.

Minimize the sum of squared residuals:

G (t1, . . . , tS) =S∑

s=1

R∑j=1

ts+1−1∑i=ts

(xij − µs)2

where S is the number of segments

and R the number of replicates.

0 10 20 30 40

-10

12

34

Position

Sign

al

segmentationsliding window

The optimal segmentation can be computed in O(n2) time usingdynamic programming [Huber et al., 2006].

Problem: S is to be user-specified

Page 88: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

G (t1, . . . , tS) =S∑

s=1

R∑j=1

ts+1−1∑i=ts

(xij − µs)2

and R the number of replicates.0 10 20 30 40

-10

12

34

Position

Sign

al

Page 89: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

G (t1, . . . , tS) =S∑

s=1

R∑j=1

ts+1−1∑i=ts

(xij − µs)2

-10

12

34

Position

Sign

al

Page 90: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

G (t1, . . . , tS) =S∑

s=1

R∑j=1

ts+1−1∑i=ts

(xij − µs)2

-10

12

34

Position

Sign

al

Page 91: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Hidden Markov Models fml

non-expressedexpressed

S

E

Learn to label eachprobe given itshybridization signaland local context[Ji and Wong, 2005a, Du et al., 2006]

Train transitionand emissionprobabilities onannotated genes

Explicit intronmodel [Zeller et al., 2008c]

Q discreteexpression levels[Zeller et al., 2008c]

Page 92: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

non-expressedexpressed

S

E

φ(E,E)

φ(S,S)φ(S,E)

φ(E,S)

g (x) g (x)S E

Page 93: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

intergenic exonic intronic

S

E I

Page 94: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

. . .

Discreteexpression level

1

2

Q

. . .

S

EQ

E2

E1

IQ

I2

I1

Page 95: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Parametrization and Decoding fmlLog transition probabilities φ(k , l)between states k and l

Log emission probabilities gk(x) in state kfor (discretized) hybridization signal x

Parametrization θ

Scoring a sequence of hybridization signals xwith a given labeling π and parametrization θ:

Fθ(x,π) =

|π|∑p=1

gπp(xp) + φ(πp−1, πp)

S denotes set of states, |π| the length of π

Decoding to obtain the best-scoring labeling for x:argmax

πFθ(x, π) (Viterbi decoding [Durbin et al., 1998])

Page 96: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Parametrization θ

Fθ(x,π) =

|π|∑p=1

Page 97: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Parametrization θ

Fθ(x,π) =

|π|∑p=1

Page 98: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Training an HMM fmlTraining sequences: signals xi and labels πi for i = 1, . . . , n.

Log transition probabilities [Durbin et al., 1998]:

φ(k , l) = log(Ak,lP

l′Ak,l′

) for all state pairs (k, l) ∈ S2

counting observed transitions: Ak,l =n∑

i=1

|πi |∑p=1

[[πip = k ∧ πi

p+1 = l ]]

Log emission probabilities [Durbin et al., 1998]

for piece-wise constant gk with L levels (ranging from tl to tl+1):gk,l = log( ElP

l′E ′l

)

counting discrete signal values: El =n∑

i=1

|πi |∑p=1

[[πip = k ∧ tl < x i

p ≤ tl+1]]

HMMs can also be (re-)trained in an unsupervised fashion[e.g. Munch et al., 2006]

Page 99: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Hidden Markov SVMs fmlEnforce a large margin (cf. gene finding)between the correct one π(i) and any other labeling π 6= π(i):

Fθ(x(i), π(i))− Fθ(x(i), π) � 0 ∀π 6= π(i) ∀i = 1, . . . , n

E I

0

0.2

0.4

0.6

0.8

1

5 6 7 8 9 10 11 12 13 14

hybridization signal

0

0.2

0.4

5 6 7 8 9 10 11 12 13 14

0

0.5

1

scor

e

5 6 7 8 9 10 11 12 13 14

S

Page 100: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Method Comparison fmlPr

ecis

ion

[%]

Recall [%]0 20 40 60 80 100

0

20

40

60

80

100

0 20 40 60 80 1000

20

40

60

80

100

Recall [%]

HMMHM-SVM

Transfrags

Recall: Proportion of annotated exons/introns covered by predictions.Precision: Proportion of predictions covered by annotated exons/introns.

Page 101: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Whole-genome Tiling Arrays Differential TARs

Identification of Differential TARs fml

Salt

MockSalt stress

Mock Control

Annotatedgenes

Figure 6Salt

Mock Salt

Mock

RT-PCRRT-PCRExperimental

validation

Apply statistical test for significant expression changeto signal from transcriptionally active regions (TARs)defined by previous segmentation [Zeller et al., 2009].

Page 102: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 103: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Next-generation Sequencing Technology & Limitations

Sequencing Techniques & Applications fmlApplications of DNA/RNA sequencing:

De novo genome sequencing

Genome resequencing

Transcriptome sequencing

Methylation analysis

Sequencing Technology

Capillary/Sanger sequencing

Pyrosequencing (Roche/454)

SOLiD sequencing (ABI)

Flow cell sequencing (Illumina)

Single molecule sequencing (Nanopores, etc.)

} Next Generation Se-quencing

Genome resequencing

Page 105: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Genome resequencing

Page 106: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Genome resequencing

Page 107: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Illumina Sequencing fmlSolexa released a sequencing machine in 2006Fragment sizes from 28− 75Probes are fixed to a glass plate “flow cell”Reagents are directed through flow cell

(see Movie)

Page 108: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Illumina Sequencing fml

Flow cell preparation

Bridge amplification

Synthesize second strand

Denaturate tosingle-stranded samples

After several cyclesclusters are ready forsequencing

Sequence the fragments(see Movie)

[Ossowski, 2007]

Page 109: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Sequence the fragments (see Movie)

[Ossowski, 2007]

Page 110: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Sequence the fragments (see Movie)

[Ossowski, 2007]

Page 111: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Sequence the fragments(see Movie)

[Ossowski, 2007]

Page 112: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Sequence the fragments

(see Movie)

[Ossowski, 2007]

Page 113: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Sequence the fragments

TTTT

G

T

CAG

TC

AC

GTTTT

G TC A GT

CA

C

Laser

G

Camera

ImageAnalysis

(see Movie)

[Ossowski, 2007]

Page 114: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

SOLiD Sequencing fmlSequencing by ligation: Fragments ligated to “beads”

PCR, beads enriched with fragments, ends of the templatesmodified to allow for an attachment to the slide

Beads are deposited onto a glass slide

Di-base probes compete for ligation to the sequencing primer

Page 115: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 116: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Page 117: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

SOLiD Sequencing fmlSequencing by ligation: Fragments ligated to “beads”PCR, beads enriched with fragments, ends of the templatesmodified to allow for an attachment to the slideBeads are deposited onto a glass slideDi-base probes compete for ligation to the sequencing primer

Page 118: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

SOLiD Sequencing - Color Space fml4 fluorescent dyes for 16 possible 2-mersReverse, complement and reverse complement are always ofsame color

Page 119: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

SOLiD Sequencing - Color Space fml4 fluorescent dyes for 16 possible 2-mersReverse, complement and reverse complement are always ofsame color

Page 120: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Overview / Extensions fml

Technology Read length Output/run Run time

Illumina GA II 40− 75 bp ≈6-20 Gbp 5− 8dABI SOLiD 35− 50 bp ≈6-15 Gbp 6− 7dRoche/454 300− 500 bp ≈100 Mbp 7h

Sanger 1000 bp ≈67 kbp 1h

There are several extensionsavailable:

mate-pair / paired-end

bisulfite treatment

multiplexing 8× 12 96 samples per flow cell

[Sutskever, 2008]

Page 121: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

bisulfite treatment

CCGTTT TATTTTTT

75 75

4K

[Sutskever, 2008]

Page 122: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

bisulfite treatment

[Sutskever, 2008]

Page 123: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

bisulfite treatment

[Sutskever, 2008]

Page 124: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Next-generation Sequencing Assembly & Read Mapping

Short Reads Analysis - Methods fmlGiven read data the following analysis steps are possible:

AssemblyMapping/Alignments

CCGTTT TATTTTTTTCTAAG AGATAAA

CTCTGTA TGACTC

ACGTACCGTTTGACTCTAGTATCTTCTAGTAGATATTTTTTTTTTAGATAAAA

Assembled genome

Reads

?

magic

??

?

CTCTGTA

TATTTTTT

AGATAAA

CCGTTT

[Sutskever, 2008]c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis Bertinoro Systems Biology 47 / 89

Page 125: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Assembly

Mapping/Alignments

Page 126: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Assembly

Mapping/Alignments

Page 127: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Assembly

Mapping/Alignments

Problem: 100 million reads of short length

⇒ Big computational challenge

Page 128: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Short Reads Analysis - Problems fmlExperiment leads to millions of reads 36− 75ntReads may have a position-wise varying quality

Quality corresponds to error probability:

q = −10 · log10(p

1− p)

Example: If we have an error probability of 10−3 per base thenthe quality is 30

Page 129: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Short Reads Assembly fmlRead assembly problemFor a set of reads stemming from a reference genome find maximallyoverlapping parts in order to reconstruct the genomic sequence

Classical assembly: ⇒ Too inefficient for short reads1 Overlap phase: Every read is compared with every other read and the overlap

graph is computed

2 Layout phase: Pairs are determined that position every read in the assembly

3 Consensus phase: Multi-alignment of all the placed reads is produced toobtain the final sequence

New techniques: Plethora of tools available (EULER, VELVET,SHARCGS, SSAKE/VCAKE, . . . )

Idea: de Bruijn Graphs

Page 130: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

graph is computed

Page 131: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

graph is computed

Page 132: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

graph is computed

Page 133: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Short Reads Assembly - de Bruijn Graphs fmlExample:

001 011

110100

000 101 010 111

Reads are mapped as a path in the graph

Number of reads does not influence number of nodes⇒ Use de-Bruijn graphs to solve the problem

[Wikipedia]

Page 134: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

de Bruijn Graphs fmlExample: TAGAC

AGACT

AGACT ACTGA

CTGAT

TGATT

GATTG ATTGA

TTGAC

TGACC

GACCA

ATTGC

TTGCC

1. TAGACTTATTGACCA2. TAGACTTATTGCC.....

de Bruin graph for two reads:

[Zerbino and Birney, 2008]

Nodes represent k-mers smaller than read lengthA k-mer can refer to thousands of reads containing itRead errors or ambiguities lead to branching of pathsEach node also stores the reverse complement

Page 135: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

de Bruijn Graphs fmlExample: TAGAC

AGACT

AGACT ACTGA

CTGAT

TGATT

GATTG ATTGA

TTGAC

TGACC

GACCA

ATTGC

TTGCC

1. TAGACTTATTGACCA2. TAGACTTATTGCC.....

de Bruin graph for two reads:

Nodes represent k-mers smaller than read lengthA k-mer can refer to thousands of reads containing itRead errors or ambiguities lead to branching of pathsEach node also stores the reverse complement

Page 136: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

de Bruijn Graphs fmlExample:

Nodes represent k-mers smaller than read length

A k-mer can refer to thousands of reads containing it

Read errors or ambiguities lead to branching of paths

Each node also stores the reverse complement

Page 137: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

de Bruijn Graphs fmlExample:

Nodes represent k-mers smaller than read length

A k-mer can refer to thousands of reads containing it

Read errors or ambiguities lead to branching of paths

Each node also stores the reverse complement

Page 138: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Short Reads Assembly - Example fmlConsider a constructed de Bruijn graph

Unconnected nodes

Ambiguous paths

Erroneous edges

A

B

B'

C

C'

X

E

Read off genome sequence (if everything goes well ;-)[Zerbino and Birney, 2008]

Page 139: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Unconnected nodes

Ambiguous paths

Erroneous edges

A

B

B'

C

C'

X

E

Page 140: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Unconnected nodes

Ambiguous paths

Erroneous edges

A

B

B'

C

C'

E

Page 141: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Unconnected nodes

Ambiguous paths

Erroneous edges

A

B

B'

C

C'

E

Page 142: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Unconnected nodes

Ambiguous paths

Erroneous edges

A

B C

E

Page 143: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab

Results for Velvet fml

vet uses slightly more memory, it is significantly faster and pro-duces larger contigs, without mis-assembly. Furthermore, it cov-ers a large area of the genome with high precision.

We also tried using SHARCGS (Dohm et al. 2007) andEULER (Pevzner et al. 2001) but were not able to make theseprograms work with our data sets. This is probably due to differ-ences in the expected input, particularly in terms of coveragedepth and read length.

DiscussionWe have developed Velvet, a novel set of de Bruijn graph-basedsequence assembly methods for very short reads that can bothremove errors and, in the presence of read pair information, re-solve a large number of repeats. With unpaired reads, the assem-bly is broken when there is a repeat longer than the k-mer length.With the addition of short reads in read pair format, many ofthese repeats can be resolved, leading to assemblies similar todraft status in bacteria and reasonably long (∼5 kb) SCSCs ineukaryotic genomes.

For the latter genomes, the short readcontigs will probably have to be combinedwith long reads or other sequencing strate-gies such as BAC or fosmid pooling. Simu-lations of Breadcrumb produced virtuallyidentical N50 lengths on both a continuous5-Mb region and a discontinuous 5-Mb re-gion made up of random 150-kb BACs, with

twofold variation in BAC concentration(data not shown). This approach wouldthen require merging local assemblies.

Sequence connected supercontigshave considerably more informationthan gapped supercontigs, in that the se-quence content separating the definitivecontigs is an unresolved graph. One caneasily imagine methods that can excludethe presence of a novel sequence in theSCSC completely by considering thepotential paths in the unresolved se-quence regions, in contrast to tradi-tional supercontigs, where one cannever make such a claim. In addition,the unresolved regions will often be dis-persed repeats, and as such the classifi-cation of such regions as repeats is moreimportant than their sequence contentfor many applications.

It is important to emphasize thatassembly is not a solved problem, in par-ticular with very short reads, and therewill continue to be considerable algo-rithmic improvements. Velvet can al-ready convert high-coverage very shortreads into reasonably sized contigs withno additional information. With addi-tional paired read information to resolvesmall repeats, almost complete genomescan be assembled. We believe the Velvetframework will provide a rich set of dif-ferent algorithmic options tailored todifferent tasks and thus provide a plat-

form for cheap de novo sequence assemblies, eventually for allgenomes.

MethodsVelvet parametersVelvet was implemented in C and tested on a 64-bit Linux ma-chine.

The results of Velvet are very sensitive to the parameter k asmentioned previously. The optimum depends on the genome,the coverage, the quality, and the length of the reads. One ap-proach consists in testing several alternatives in parallel and pick-ing the best.

Another method consists in estimating the expected num-ber X of times a unique k-mer in a genome of length G is observedin a set of n reads of length l. We can link this number to thetraditional value of coverage, noted C, with the relations:

E!X" =n!l − k + 1"

G − k + 1≈

nG !l − k + 1" = C

l − k + 1l

Figure 6. Breadcrumb performance on simulated data sets. As in Figure 3, we sampled 5-Mb DNAsequences from four different species (E. coli, S. cerevisiae, C. elegans, and H. sapiens, respectively) andgenerated 50! read sets. The horizontal lines represent the N50 reached at the end of Tour Bus (seeFig. 3) (broken black line) and after applying a 4! coverage cutoff (broken red line). Note how thedifference in N50 between the graph of perfect reads and that of erroneous reads is significantlyreduced by this last cutoff. (Black curves) The results after the basic Breadcrumb algorithm; (red curves)the results after super-contigging.

Table 3. Comparison of short read assemblers on experimental Streptococcus suis Solexareads

AssemblerNo. ofcontigs N50

Averageerror rate Memory Time Seq. Cov.

Velvet 0.3 470 8661 bp 0.02% 2.0G 2 min 57 sec 97%SSAKE 2.0 265 1727 bp 0.20% 1.7G 1 h 47 min 16%VCAKE 1.0 7675 1137 bp 0.64% 1.8G 4 h 25 min 134%

Short read de novo assembly using de Bruijn graphs

Genome Research 827www.genome.org

Cold Spring Harbor Laboratory Press on March 21, 2009 - Published by genome.cshlp.orgDownloaded from

Considerably shorter fragments for larger genomes

Page 144: Next Generation Sequencing, Tiling Arrays and - R¤tsch Lab