machine learning methods for rna-seq-based transcriptome … · machine learning methods for...

Post on 19-Aug-2020

11 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Machine Learning Methods for RNA-seq-based

Transcriptome Reconstruction

Gunnar Ratsch

Friedrich Miescher Laboratory

Max Planck Society, Tubingen, Germany

StatSeq Workshop, Gent (Mai 20, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 1 / 28

Motivation

Discovery of the Nuclein(Friedrich Miescher, 1869) fml

Discovery of Nuclein:from lymphocyte & salmon

“multi-basic acid” (≥ 4)

Tubingen, around 1869

“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 2 / 28

Motivation

Discovery of the Nuclein(Friedrich Miescher, 1869) fml

Discovery of Nuclein:from lymphocyte & salmon

“multi-basic acid” (≥ 4)

Tubingen, around 1869

“If one . . . wants to assume that a single substance . . . is the specificcause of fertilization, then one should undoubtedly first and foremostconsider nuclein” (Miescher, 1874)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 2 / 28

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28

Motivation

Learning about the Transcriptome What is encoded on the genome and how is it processed? fml

Computational Point of View

How to learn to predict what these processes accomplish?

How well can we predict it from the available information?

Biological View

What can we not predict yet? What is missing?

Can we derive a deeper understanding of these processes?c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 3 / 28

Motivation

Gene-by-Gene vs. Genome-wide Analysesfml

Cloning individual genes during the last 30 years revealedimportant hints to the complex structure of eukaryotic genes

[ENCODE Project Consortium (2004). Science 306:636]

Comprehensive understanding of gene regulation necessarily hasto include the entire organism’s gene collection as well as itsintergenic regions.

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 4 / 28

Motivation

Sequencing Genomesfml

Concerted effort to sequence genomes of model organisms:

E. coli 4.5 Mb (1997)S. cerevisiae 6.0 Mb (1997)C. elegans 98 Mb (1998)Human 3 Gb (2000)

Estimated cost: $2.7 billion in 1991 dollarsEstimated time in 1990: 15 years

In 2003, NHGRI committed to develop next-generationsequencing technologies to lower the cost of 30x a humangenome (∼100 Gbp):

$100,000 genome$1,000 genome

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 5 / 28

Motivation

Sequencing Genomesfml

Concerted effort to sequence genomes of model organisms:

E. coli 4.5 Mb (1997)S. cerevisiae 6.0 Mb (1997)C. elegans 98 Mb (1998)Human 3 Gb (2000)

Estimated cost: $2.7 billion in 1991 dollarsEstimated time in 1990: 15 years

In 2003, NHGRI committed to develop next-generationsequencing technologies to lower the cost of 30x a humangenome (∼100 Gbp):

$100,000 genome$1,000 genome

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 5 / 28

Motivation

New Sequencing Technologiesfml

short many

long few

SoLID (2007)

Genome Analyzer (2007)

454 Sequencing (2005)

SMRT (2010)

[Slide provided by Ali Mortazavi]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 6 / 28

Motivation

The Encyclopedia of DNA Elementsfml

How to make sense of the genome?

Which nucleotides are functional?What is their function?

National Human Genome Research Institute started the Encodeproject to annotate human and model organism genomes:

2004: Human 1%2007: Human whole-genome2008: Drosophila and C. elegans2010: Mouse

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 7 / 28

Motivation

The Encyclopedia of DNA Elementsfml

How to make sense of the genome?

Which nucleotides are functional?What is their function?

National Human Genome Research Institute started the Encodeproject to annotate human and model organism genomes:

2004: Human 1%2007: Human whole-genome2008: Drosophila and C. elegans2010: Mouse

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 7 / 28

Motivation

Sequencing for Functional Genomicsfml

[Wold & Myers, Nature Methods, 2007]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 8 / 28

Motivation

contiguous reads

splice-crossing reads

enriched regions

novel transfrags

density on known exons

Map reads

Aggregate and identify

Analyze

binding sources

motif finding

associated genes

novel gene models

novel splice isoforms

expression levels

differential expression

ChIP-seq RNA-seq discovery

RNA-seq quantification

RNA-seq, ChIP-seq, and external data Integrate

de novo transcript assembly

Info

rmat

ion

extra

ctio

n

[Pep

keet

al.,

2009

]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 9 / 28

RNA-Seq Pipeline

Deep RNA Sequencing (RNA-Seq)fml

RNA-Seq allows . . .

High-throughputtranscriptomemeasurements

Qualitative studies

Quantitative studies athigh resolution

pre-mRNAintron

exon

mRNAshort reads

junction reads

reference genomeFigure adapted from Wikipedia

Goal: Obtain complete transcriptome for further analyses

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 10 / 28

RNA-Seq Pipeline

Deep RNA Sequencing (RNA-Seq)fml

RNA-Seq allows . . .

High-throughputtranscriptomemeasurements

Qualitative studies

Quantitative studies athigh resolution

pre-mRNAintron

exon

mRNAshort reads

junction reads

reference genomeFigure adapted from Wikipedia

Goal: Obtain complete transcriptome for further analyses

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 10 / 28

RNA-Seq Pipeline

RNA-Seq Pipeline Overviewfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 11 / 28

RNA-Seq Pipeline

RNA-Seq Pipeline Overviewfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 11 / 28

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

More info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

ACCGTCGCGCGCGT...TCGGCG...AGAACGCT

TCGCGCGCAACGTCGC CGCG GCGC CGCG GCGC CGCA GCAA CAAC AACG

...matchingk-mers

DNA

Read

k-mers

...

More info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

ACCGTCGCGCGCGT...TCGGCG...AGAACGCT

TCGCGCGCAACGTCGC CGCG GCGC CGCG GCGC CGCA GCAA CAAC AACG

...matchingk-mers

DNA

Read

k-mers

...

More info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28

PALMapper Read Alignment Overview

Step 1: PALMapper Read Alignment(PALMapper = QPALMA + GenomeMapper) fmlGenomeMapper for (unspliced) read mapping:

Alignments based on GenomeMapper developed in Tubingen forthe 1001 plant genome project [Schneeberger et al., 2009]

k-mer based index, well suited for smaller genomes

QPALMA for spliced read alignments:

GenomeMapper identifies seed regions

Spliced alignments by QPALMA [De Bona et al., 2008]

ACCGTCGCGCGCGT...TCGGCG...AGAACGCT

TCGCGCGCAACG

DNA

ReadMore info: http://fml.mpg.de/raetsch/suppl/palmapper

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 12 / 28

PALMapper Read Alignment Overview

QPALMA: Extended Smith-Waterman Scoringfml

Classical scoring f : Σ× Σ→ R

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28

PALMapper Read Alignment Overview

QPALMA: Extended Smith-Waterman Scoringfml

Classical scoring f : Σ× Σ→ R

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28

PALMapper Read Alignment Overview

QPALMA: Extended Smith-Waterman Scoringfml

Classical scoring f : Σ× Σ→ R

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28

PALMapper Read Alignment Overview

QPALMA: Extended Smith-Waterman Scoringfml

Quality scoring f : (Σ× R)× Σ→ R [De Bona et al., 2008]

Source of information

Sequence matches

Computational splicesite predictions

Intron length model

Read qualityinformation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 13 / 28

PALMapper Read Alignment Overview

QPALMA RNA-Seq Read Alignmentfml

Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality [De Bona et al., 2008]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 14 / 28

PALMapper Read Alignment Overview

QPALMA RNA-Seq Read Alignmentfml

Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality

Error vs. intron position

[De Bona et al., 2008]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 14 / 28

PALMapper Read Alignment Overview

QPALMA RNA-Seq Read Alignmentfml

Generate set of artificially spliced readsGenomic reads with quality informationGenome annotation for artificially splicing the readsUse 10, 000 reads for training and 30, 000 for testing

Alig

nmen

t Er

ror

Rat

e

SmithW14.19%

Intron9.96% 1.94% 1.78%

Intron+Splice

Intron+Splice+Quality

8nt12nt12nt8ntError vs. intron position

[De Bona et al., 2008]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 14 / 28

PALMapper Read Alignment Overview

Step 2: Transcript Predictionfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

A. Coverage segmentation algorithm mTiM for general transcripts(no coding bias/assumption)

B. Extension of the mGene gene finding system to use NGS datafor protein coding transcript prediction (mGene.ngs)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 15 / 28

Next Generation Gene Finding Idea

Computational Gene Finding Labeling the Genome fml

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS

SpliceDonor

SpliceAcceptor

SpliceDonor

SpliceAcceptor

TIS Stop

polyA/cleavage

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 16 / 28

Next Generation Gene Finding Idea

Computational Gene Finding Labeling the Genome fml

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS Donor Acceptor Donor Acceptor

TIS Stop

polyA/cleavage

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 16 / 28

Next Generation Gene Finding Idea

Computational Gene Finding Labeling the Genome fml

DNA

pre-mRNA

mRNA

Protein

polyAcap

TSS Donor Acceptor Donor Acceptor

TIS Stop

polyA/cleavage

TSS TIS cleaveStop

Don Acc

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 16 / 28

Next Generation Gene Finding Idea

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

STEP 1: SVM Signal Predictions

genomic position

genomic position

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28

Next Generation Gene Finding Idea

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28

Next Generation Gene Finding Idea

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

Wrong gene model

large margin

F(x,y)

transform features

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28

Next Generation Gene Finding Idea

mGene-based Transcript Predictionfml

acc

don

tss

tis

stop

True gene model 2 3 4 5

Wrong gene model

large margin

F(x,y)

STEP 1: SVM Signal Predictions

STEP 2: Integration

genomic position

genomic position

transform features

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 17 / 28

Next Generation Gene Finding Modeling Uncertainty

Learning to use Expression Measurementsfml

Two approaches:

Heuristic to incorporate ESTs/reads/tiling array measurementsto refine predictions

Directly use evidence during learning to learn to appropriatelyweight its importance

Exon Level Transcript LevelSN SP F SN SP F

ab initio 82.3 82.6 82.5 43.1 49.5 46.1ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9RNA-Seq trainedRNA-Seq/ESTs trained

Gene prediction in C. elegans (CDS evaluation)

Behr et al., in prep., 2010

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 18 / 28

Next Generation Gene Finding Modeling Uncertainty

Learning to use Expression Measurementsfml

Two approaches:

Heuristic to incorporate ESTs/reads/tiling array measurementsto refine predictions

Directly use evidence during learning to learn to appropriatelyweight its importance

Exon Level Transcript LevelSN SP F SN SP F

ab initio 82.3 82.6 82.5 43.1 49.5 46.1ESTs heuristic 85.3 84.7 85.0 49.5 56.4 52.7ESTs trained 84.8 85.8 85.3 50.5 57.8 53.9RNA-Seq trained 84.6 84.9 84.8 49.1 55.2 52.0RNA-Seq/ESTs trained 84.7 86.9 85.8 50.3 60.5 54.9

Gene prediction in C. elegans (CDS evaluation)

Behr et al., in prep., 2010

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 18 / 28

Next Generation Gene Finding Results

Preliminary Evaluation (C. elegans)fml

CDS (precision+recall)/2

expression percentiles [%]

mGene ab initiomGene.ngs

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 19 / 28

Next Generation Gene Finding Results

Preliminary Evaluation (C. elegans)fml

CDS (precision+recall)/2

expression percentiles [%]

mTiMmGene ab initiomGene.ngs

10 20 30 40 50 60 70 80 90 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 19 / 28

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alternative Transcripts: Spliced reads for splicing graphs:

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alternative Transcripts: Spliced reads for splicing graphs:

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alternative Transcripts: Spliced reads for splicing graphs:

Transcript prediction(result of mGene/mTIM)

Spliced reads (result of Palmapper)

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alternative Transcripts: Spliced reads for splicing graphs:

Transcript prediction(result of mGene/mTIM)

Spliced reads (result of Palmapper)

Infered edges

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28

Next Generation Gene Finding Alternative Transcripts

Digestionfml

mTiM and mGene.ngs predict single transcripts

mTiM exploits “uniformity” of read coverage among exons ofsame transcript

mGene.ngs uses more assumptions on structure of transcripts

Alternative Transcripts: Spliced reads for splicing graphs:

Transcript prediction(result of mGene/mTIM)

Spliced reads (result of Palmapper)

Infered edges

Paths through splicing graph define alternative transcripts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 20 / 28

Next Generation Gene Finding Alternative Transcripts

Prediction of Alternative Transcriptsfml

Combine gene finding with prediction of alternative splicingMachine learning challenge:

Input: DNA sequenceOutput: All possible transcripts (not just one)

First step: Predict “simple splicing graphs”

Conceptually works . . .But, only with clean data we train the model!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28

Next Generation Gene Finding Alternative Transcripts

Prediction of Alternative Transcriptsfml

Combine gene finding with prediction of alternative splicing

Machine learning challenge:

Input: DNA sequenceOutput: All possible transcripts (not just one)

First step: Predict “simple splicing graphs”

TSS TIS cleaveStop

Don Acc

A-Don A-AccExonskip

Conceptually works . . .

But, only with clean data we train the model!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28

Next Generation Gene Finding Alternative Transcripts

Prediction of Alternative Transcriptsfml

Combine gene finding with prediction of alternative splicing

Machine learning challenge:

Input: DNA sequenceOutput: All possible transcripts (not just one)

First step: Predict “simple splicing graphs”

TSS TIS cleaveStop

Don Acc

A-Don A-AccExonskip

Conceptually works . . .

But, only with clean data we train the model!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28

Next Generation Gene Finding Alternative Transcripts

Prediction of Alternative Transcriptsfml

Combine gene finding with prediction of alternative splicing

Machine learning challenge:

Input: DNA sequenceOutput: All possible transcripts (not just one)

First step: Predict “simple splicing graphs”

TSS TIS cleaveStop

Don Acc

A-Don A-AccExonskip

Conceptually works . . .

But, only with clean data we train the model!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 21 / 28

Transcript Quantitation with rQuant

RNA-Seq Pipeline Overviewfml

PALMapper

mGene.ngs

mTim Segmentation

rQuant

Short Reads Transcripts w/ QuantitationAlignm

ents

Transcripts

Read alignment Transcript finding Quantitation

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 22 / 28

Transcript Quantitation with rQuant Biases

RNA-Seq Biases and Quantitationfml

Sequencer short sequence reads

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

fragmentation

sequence library

RT

fragmen-tation

RT

exonic reads

genome

transcripts

junction reads

aligned reads

Biases due to . . .

cDNA library construction

Sequencing

Read mapping

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

80

(average over annotated

transcripts of length ≈1kb for the

C. elegans SRX001872 dataset)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 23 / 28

Transcript Quantitation with rQuant Biases

RNA-Seq Biases and Quantitationfml

Sequencer short sequence reads

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

AAAAAAA

mRNA

fragmentation

sequence library

RT

fragmen-tation

RT

exonic reads

genome

transcripts

junction reads

aligned reads

Biases due to . . .

cDNA library construction

Sequencing

Read mapping

relative transcript position 5’ -> 3’

read

cov

erag

e 0

20

40

60

80

(average over annotated

transcripts of length ≈1kb for the

C. elegans SRX001872 dataset)c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 23 / 28

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

A

read

cov

erag

e

Short transcript

relative transcript position 5’ -> 3’

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri )

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

A

B

read

cov

erag

e

Short transcript

relative transcript position 5’ -> 3’

read

cov

erag

e

Long transcript

relative transcript position 5’ -> 3’

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri )

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

A

B

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri )

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

read

cov

erag

e

genome position 5’ -> 3’

Mixture of transcripts

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

MA

B

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri )

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28

Transcript Quantitation with rQuant Approach

rQuant – Basic Ideafml

read

cov

erag

e

genome position 5’ -> 3’

Mixture of transcripts

expected

observed

read

cov

erag

e

genome position 5’ -> 3’

Short transcript

read

cov

erag

e

genome position 5’ -> 3’

Long transcript

MA

B R

Mi = wAAi + wBBi ⇒ minwA,wB

∑i ` (Mi , Ri )

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 24 / 28

Transcript Quantitation with rQuant Approach

Galaxy-based Web Services for NGS Analysesfml

Galaxy-based web service http://galaxy.fml.mpg.de

PALMapper http://fml.mpg.de/raetsch/suppl/palmapper

mGene http://mgene.org/web

mTIM http://fml.mpg.de/raetsch/suppl/mtim (in prep.)

rQuant http://fml.mpg.de/raetsch/suppl/rquant/web

[Ratsch et al., in preparation, 2010]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 25 / 28

Transcript Quantitation with rQuant Approach

contiguous reads

splice-crossing reads

enriched regions

novel transfrags

density on known exons

Map reads

Aggregate and identify

Analyze

binding sources

motif finding

associated genes

novel gene models

novel splice isoforms

expression levels

differential expression

ChIP-seq RNA-seq discovery

RNA-seq quantification

RNA-seq, ChIP-seq, and external data Integrate

de novo transcript assembly

Info

rmat

ion

extra

ctio

n

[Pep

keet

al.,

2009

]

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 26 / 28

Transcript Quantitation with rQuant Approach

Gene Finding + Auxiliary Data[Behr et al., 2009] fml

Is there other information that may be used by cellular processes thatcan improve prediction results?

Preliminary study: (A. thaliana) transcript level (SN + SP)/2

1 mGene (ab initio) . . . 73.3%

2 . . . with DNA methylation (1 tissue) 76.1%

3 . . . with Nucleosome position predictions 78.0%

4 . . . with RNA secondary structure predictions 76.7%

In progress:

Study effect of other information sources for gene prediction

Ideally, consider measurements from the same sample

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28

Transcript Quantitation with rQuant Approach

Gene Finding + Auxiliary Data[Behr et al., 2009] fml

Is there other information that may be used by cellular processes thatcan improve prediction results?

Preliminary study: (A. thaliana) transcript level (SN + SP)/2

1 mGene (ab initio) . . . 73.3%

2 . . . with DNA methylation (1 tissue) 76.1%

3 . . . with Nucleosome position predictions 78.0%

4 . . . with RNA secondary structure predictions 76.7%

In progress:

Study effect of other information sources for gene prediction

Ideally, consider measurements from the same sample

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28

Transcript Quantitation with rQuant Approach

Gene Finding + Auxiliary Data[Behr et al., 2009] fml

Is there other information that may be used by cellular processes thatcan improve prediction results?

Preliminary study: (A. thaliana) transcript level (SN + SP)/2

1 mGene (ab initio) . . . 73.3%

2 . . . with DNA methylation (1 tissue) 76.1%

3 . . . with Nucleosome position predictions 78.0%

4 . . . with RNA secondary structure predictions 76.7%

In progress:

Study effect of other information sources for gene prediction

Ideally, consider measurements from the same sample

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28

Transcript Quantitation with rQuant Approach

Gene Finding + Auxiliary Data[Behr et al., 2009] fml

Is there other information that may be used by cellular processes thatcan improve prediction results?

Preliminary study: (A. thaliana) transcript level (SN + SP)/2

1 mGene (ab initio) . . . 73.3%

2 . . . with DNA methylation (1 tissue) 76.1%

3 . . . with Nucleosome position predictions 78.0%

4 . . . with RNA secondary structure predictions 76.7%

In progress:

Study effect of other information sources for gene prediction

Ideally, consider measurements from the same sample

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28

Transcript Quantitation with rQuant Approach

Gene Finding + Auxiliary Data[Behr et al., 2009] fml

Is there other information that may be used by cellular processes thatcan improve prediction results?

Preliminary study: (A. thaliana) transcript level (SN + SP)/2

1 mGene (ab initio) . . . 73.3%

2 . . . with DNA methylation (1 tissue) 76.1%

3 . . . with Nucleosome position predictions 78.0%

4 . . . with RNA secondary structure predictions 76.7%

In progress:

Study effect of other information sources for gene prediction

Ideally, consider measurements from the same sample

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 27 / 28

Summary

Summary and Conclusionsfml

Methods:

1 PALMapper: Splice site predictions improve alignments

2 mTiM: Identifies transcripts specific to experiment

3 mGene: Best for annotation; also finds lowly expressed genes

4 rQuant: Models library prep., sequencing, alignment biases

Unsolved problems:

1 Better prediction of alternative transcripts (without RNA-seq?)

2 Differential expression detection [Stegle et al., 2010]

3 Multi-modal modelling to handle the many different ways genesare processed

4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28

Summary

Summary and Conclusionsfml

Methods:

1 PALMapper: Splice site predictions improve alignments

2 mTiM: Identifies transcripts specific to experiment

3 mGene: Best for annotation; also finds lowly expressed genes

4 rQuant: Models library prep., sequencing, alignment biases

Unsolved problems:

1 Better prediction of alternative transcripts (without RNA-seq?)

2 Differential expression detection [Stegle et al., 2010]

3 Multi-modal modelling to handle the many different ways genesare processed

4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28

Summary

Summary and Conclusionsfml

Methods:

1 PALMapper: Splice site predictions improve alignments

2 mTiM: Identifies transcripts specific to experiment

3 mGene: Best for annotation; also finds lowly expressed genes

4 rQuant: Models library prep., sequencing, alignment biases

Unsolved problems:

1 Better prediction of alternative transcripts (without RNA-seq?)

2 Differential expression detection [Stegle et al., 2010]

3 Multi-modal modelling to handle the many different ways genesare processed

4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28

Summary

Summary and Conclusionsfml

Methods:

1 PALMapper: Splice site predictions improve alignments

2 mTiM: Identifies transcripts specific to experiment

3 mGene: Best for annotation; also finds lowly expressed genes

4 rQuant: Models library prep., sequencing, alignment biases

Unsolved problems:

1 Better prediction of alternative transcripts (without RNA-seq?)

2 Differential expression detection [Stegle et al., 2010]

3 Multi-modal modelling to handle the many different ways genesare processed

4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28

Summary

Summary and Conclusionsfml

Methods:

1 PALMapper: Splice site predictions improve alignments

2 mTiM: Identifies transcripts specific to experiment

3 mGene: Best for annotation; also finds lowly expressed genes

4 rQuant: Models library prep., sequencing, alignment biases

Unsolved problems:

1 Better prediction of alternative transcripts (without RNA-seq?)

2 Differential expression detection [Stegle et al., 2010]

3 Multi-modal modelling to handle the many different ways genesare processed

4 Data integration; very promising for detecting missingcomponents, for instance, in gene prediction systems

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 28 / 28

Summary

Acknowledgementsfml

Fabio De Bona

Alignments

Jonas Behr

Gene finding

Georg Zeller

Segmentation

Regina Bohnert

Quantitation

Help with Slides

Jonas Behr (FML)

Georg Zeller (EMBL)

Regina Bohnert (FML)

Ali Mortazavi (Caltech)

Funding by DFG &Max Planck Society.

Thank you for your attention!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 29 / 28

Summary

Acknowledgementsfml

Fabio De Bona

Alignments

Jonas Behr

Gene finding

Georg Zeller

Segmentation

Regina Bohnert

Quantitation

Help with Slides

Jonas Behr (FML)

Georg Zeller (EMBL)

Regina Bohnert (FML)

Ali Mortazavi (Caltech)

Funding by DFG &Max Planck Society.

Thank you for your attention!

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 29 / 28

Summary

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `(∑

t w (t)p(t)i , Ri

)2 Optimise profile weights: minp

∑i `(∑

t w (t)p(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

99,960 100,368 100,776 101,184 101,592

70

60

50

40

30

20

10

0

read

cou

nts

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28

Summary

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `(∑

t w (t)p(t)i , Ri

)2 Optimise profile weights: minp

∑i `(∑

t w (t)p(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

99,960 100,368 100,776 101,184 101,592

27.20

70

60

50

40

30

20

10

0

read

cou

nts

27.20

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28

Summary

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `(∑

t w (t)p(t)i , Ri

)2 Optimise profile weights: minp

∑i `(∑

t w (t)p(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

99,960 100,368 100,776 101,184 101,592

4.80

27.20

4.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28

Summary

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `(∑

t w (t)p(t)i , Ri

)2 Optimise profile weights: minp

∑i `(∑

t w (t)p(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28

Summary

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `(∑

t w (t)p(t)i , Ri

)2 Optimise profile weights: minp

∑i `(∑

t w (t)p(t)i , Ri

)3 Repeat 1. and 2. until convergence.

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

38.00

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28

Summary

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `(∑

t w (t)p(t)i , Ri

)2 Optimise profile weights: minp

∑i `(∑

t w (t)p(t)i , Ri

)3 Repeat 1. and 2. until convergence.

distance to transcript start/end

pro!

le w

eigh

t

0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

6.00

4.80

27.20

6.004.80

70

60

50

40

30

20

10

0

read

cou

nts

27.20

38.00

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28

Summary

rQuant – Iterative Algorithmfml

1 Optimise transcript weights: minw

∑i `(∑

t w (t)p(t)i , Ri

)2 Optimise profile weights: minp

∑i `(∑

t w (t)p(t)i , Ri

)3 Repeat 1. and 2. until convergence.

distance to transcript start/end

pro!

le w

eigh

t

0 50 200 450 800 1250 1800 2450 "2450 "1800 "1250 "800 "450 "200 "50 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

gene AT1G01240chromosome 1, forward strand

AT1G01240.1 +110 1,178

625

AT1G01240.2 +190 1,178

436

AT1G01240.3 +1,941

99,960 100,368 100,776 101,184 101,592

4.95

8.35

23.99

70

60

50

40

30

20

10

0

read

cou

nts

8.354.95

23.99

37.29

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 30 / 28

Summary Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28

Summary Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28

Summary Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28

Summary Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28

Summary Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28

Summary Results

rQuant Evaluation Ifml

rQuant: Position-wise with profiles (estimating library and mapping bias)

compared to

Position-wise, without profiles

Segment-wise, without profiles (e.g., Jiang and Wong [2009] )

Segment-wise, with profiles (e.g. Flux Capacitor [Sammeth, 2009a])

Estimate transcript abundances

Using simulated data for A. thaliana (Flux Simulator [Sammeth, 2009b])

Subset of alternatively spliced genes

Evaluation: Spearman correlation between

Simulated RNA expression level and

Predicted transcript weights

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 31 / 28

Summary Results

rQuant Evaluation IIfml

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

(Bohnert et al., submitted, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 32 / 28

Summary Results

rQuant Evaluation IIfml

Position-wise, w/o pro!les

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

Segment-wise, w/ pro!les

(Bohnert et al., submitted, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 32 / 28

Summary Results

rQuant Evaluation IIfml

rQuantPosition-wise,

w/ pro!lesPosition-wise, w/o pro!les

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

Spea

rman

Cor

rela

tion

Across transcripts

Within genes (mean)

Segment-wise, w/o pro!les

Segment-wise, w/ pro!les

(Bohnert et al., submitted, 2010)

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 32 / 28

Summary Results

References I

J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski,K. Schneeberger, D. Weigel, and G. Ratsch. Rna-seq and tiling arrays for improved genefinding. Oral presentation at the CSHL Genome Informatics Meeting, September 2008. URLhttp:

//www.fml.tuebingen.mpg.de/raetsch/lectures/RaetschGenomeInformatics08.pdf.

J. Behr, G. Schweikert, J. Cao, F. De Bona, G. Zeller, S. Laubinger, S. Ossowski,K. Schneeberger, D. Weigel, and G. Ratsch. Rna-seq and tiling arrays for improved genefinding. Presented at the CSHL Genome Informatics Meeting, July 2009.

RM Clark, G Schweikert, C Toomajian, S Ossowski, G Zeller, P Shinn, N Warthmann, TT Hu,G Fu, DA Hinds, H Chen, KA Frazer, DH Huson, B Scholkopf, M Nordborg, G Ratsch,JR Ecker, and D Weigel. Common sequence polymorphisms shaping genetic diversity inarabidopsis thaliana. Science, 317(5836):338–342, 2007. ISSN 1095-9203 (Electronic). doi:10.1126/science.1138632.

F. De Bona, S. Ossowski, K. Schneeberger, and G. Ratsch. Qpalma: Optimal splicedalignments of short sequence reads. Bioinformatics, 24:i174–i180, 2008.

Hui Jiang and Wing Hung Wong. Statistical inferences for isoform expression in RNA-Seq.Bioinformatics, 25(8):1026–1032, April 2009.

Shirley Pepke, Barbara Wold, and Ali Mortazavi. Computation for chip-seq and rna-seq studies.Nat Methods, 6(11 Suppl):S22–32, Nov 2009. doi: 10.1038/nmeth.1371.

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 33 / 28

Summary Results

References II

G. Ratsch and S. Sonnenburg. Accurate splice site detection for Caenorhabditis elegans. InK. Tsuda B. Schoelkopf and J.-P. Vert, editors, Kernel Methods in Computational Biology.MIT Press, 2004.

G. Ratsch, S. Sonnenburg, and B. Scholkopf. RASE: recognition of alternatively spliced exonsin C. elegans. Bioinformatics, 21(Suppl. 1):i369–i377, June 2005.

M. Sammeth. The Flux Capacitor. Website, 2009a. http://flux.sammeth.net/capacitor.html.

M. Sammeth. The Flux Simulator. Website, 2009b. http://flux.sammeth.net/simulator.html.

Korbinian Schneeberger, Jorg Hagmann, Stephan Ossowski, Norman Warthmann, SandraGesing, Oliver Kohlbacher, and Detlef Weigel. Simultaneous alignment of short readsagainst multiple genomes. Genome Biol, 10(9):R98, Jan 2009. doi:10.1186/gb-2009-10-9-r98. URL http://genomebiology.com/2009/10/9/R98.

Gabriele Schweikert, Alexander Zien, Georg Zeller, Jonas Behr, Christoph Dieterich,Cheng Soon Ong, Petra Philips, Fabio De Bona, Lisa Hartmann, Anja Bohlen, Nina Kruger,Soren Sonnenburg, and Gunnar Ratsch. mgene: Accurate svm-based gene finding with anapplication to nematode genomes. Genome Research, 2009. URLhttp://genome.cshlp.org/content/early/2009/06/29/gr.090597.108.full.pdf+html.Advance access June 29, 2009.

S. Sonnenburg, G. Ratsch, A. Jagota, and K.-R. Muller. New methods for splice-siterecognition. In Proc. International Conference on Artificial Neural Networks, 2002.

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 34 / 28

Summary Results

References III

Soren Sonnenburg, Alexander Zien, and Gunnar Ratsch. ARTS: Accurate Recognition ofTranscription Starts in Human. Bioinformatics, 22(14):e472–480, 2006.

O. Stegle, P. Drewe, R. Bohnert, K. Borgwardt, and G. Ratsch. Statistical tests for detectingdifferential rna-transcript expression from read counts. Technical report, Nature Preceedings,2010.

G Zeller, RM Clark, K Schneeberger, A Bohlen, D Weigel, and G Ratsch. Detectingpolymorphic regions in arabidopsis thaliana with resequencing microarrays. Genome Res, 18(6):918–929, 2008. ISSN 1088-9051 (Print). doi: 10.1101/gr.070169.107.

A. Zien, G. Ratsch, S. Mika, B. Scholkopf, T. Lengauer, and K.-R. Muller. Engineering SupportVector Machine Kernels That Recognize Translation Initiation Sites. BioInformatics, 16(9):799–807, September 2000.

c© Gunnar Ratsch (FML, Tubingen) Methods for Transcriptome Analysis StatSeq Workshop, Gent 35 / 28

top related