twain: a new tool for parallel gene finding (and other gene finders) mihaela pertea william majoros...

TWAIN: a new tool for TWAIN: a new tool for parallel gene findingparallel gene finding

(and other gene finders)(and other gene finders)

Mihaela PerteaMihaela Pertea

William MajorosWilliam Majoros

Steven SalzbergSteven Salzberg

First, some First, some background…background…

Genomes completed and published by TIGR and our collaborators, 1995-present

Organism ReferenceArabidopsis thaliana Lin et al., Nature 402: 761-8 (2000)Archaeoglobus fulgidus Klenk et al., Nature 390:364-370 (1997)Bacillus anthracis Ames Read et al., Nature 423: 81-86 (2003)Bacillus anthracis Florida Read et al., Science 296, 2028-33 (2002)Borrelia burgdorferi Fraser et al., Nature 390: 580-586 (1997) Brucella suis Paulsen et al., PNAS 99 (2002)Caulobacter crescentus Nierman et al., PNAS 98 (2001)Chlamydia pneumoniae Read et al., Nucl. Acids Res. 28, (2000)Chlamydia muridarum Read et al., Nucl. Acids Res. 28, (2000)Chlamydophila caviae Read et al., Nucl. Acids Res. 31, (2003) Chlorobium tepidum Eisen et al., PNAS 99: 9509-9514 (2002)Coxiella burnetii RSA 493 Seshadri et al., PNAS 100: 5455-60 (2003)Deinococcus radiodurans White et al., Science 286 (1999)Enterococcus faecalis Paulsen et al., Science 299: 2071-2074 (2003)Haemophilus influenzae Fleischmann et al., Science 269, (1995)Helicobacter pylori Tomb et al., Nature 388:539-547 (1997)Methanococcus jannaschii Bult et al., Science 273:1058-1073 (1996)Mycobacterium tuberculosis Fleischmann et al., J. Bact.184, (2002)Mycoplasma genitalium Fraser et al., Science 270:397-403 (1995)Neisseria meningitidis Tettelin et al., Science 287 (2000)Oryza sativa (rice) chr 10 Wing et al., Science 300: 1566-1569 (2003)Plasmodium falciparum Gardner et al., Nature 419:531-534 (2002)Plasmodium yoelii Carlton et al., Nature 419:512-519(2002)Porphyromonas gingivalis Nelson et al., J. Bact., in revision.Pseudomonas putida Nelson et al., Envir. Microbiol. (2002)Shewanella oneidensis Heidelberg et al., Nat. Biotech. 20 (2002) Streptococcus agalactiae Tettelin et al., PNAS. 99 (2002) Streptococcus pneumoniae Tettelin et al., Science 293 (2001)Sulfolobus islandicus virus Arnold et al., Virology 15:252-66 (2000)Thermotoga maritima Nelson et al., Nature 399: 323-329 (1999)Treponema pallidum Fraser et al., Science 281: 375-388 (1998)Vibrio cholerae Heidelberg et al., Nature 406, (2000)

Genomes in progress or recently completedFibrobacter succinogenesPrevotella intermediaPseudomonas fluorescensSilicibacter pomeroyi DSS-3Streptococcus agalactiae A909Streptococcus gordoniiStreptococcus mitisStreptococcus pneumoniae 670Acidobacterium capsulatum Bacillus anthracis A01055Bacillus anthracis A0402Bacillus anthracis Ames 0581Burkholderia thailandensisCampylobacter coli RM2228Campylobacter upsaliensis RM3195Clostridium perfringens SM101Epulopiscium fisheloniiHyphomonas neptuniumListeria monocytogenes F6854Listeria monocytogenes H7858Mycoplasma arthritidis Mycoplasma capricolumMyxococcus xanthusPrevotella ruminicolaPyrococcus furiosusVerrucomicrobium spinosum Actinomyces naeslundii

Bacillus anthracis A0071 Bacillus anthracis Kruger BErwinia chrysanthemiGemmata obscuriglobus Mycobacterium tuberculosisRuminococcus albusStreptococcus sobrinusAspergillus fumigatus Brugia malayi Coccidioides immitisCryptococcus neoformansEntamoeba histolyticaOryza sativa Chromosome 3 & 10Plasmodium vivaxSchistosoma mansoniSolanum spp.Tetrahymena thermophilaToxoplasma gondii Theileria parvaTrichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi

Acidithiobacillus ferrooxidansBacillus anthracis Kruger BBurkholderia mallei Clostridium perfringens ATCC13124Dehalococcoides ethenogenesDesulfovibrio vulgaris Ehrlichia chaffeensisEhrlichia sennetsuGeobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatusMycobacterium avium 104Mycobacterium smegmatisPseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticolaWolbachia sp.Anaplasma phagocytophilaBacillus cereus 10987Bacteroides forsythesBrucella ovisBaumannia cicadellinicolaCampylobacter jejuniCarboxydothermus hydrogenoformansColwellia sp. 34HDichelobacter nodosus

Anatomy of a Genome Sequencing Project

Shotgun sequencingGenome Assembly AnnotationData release

Downstream research

Library construction

Colony picking

Template preparation

Sequencing reactions

Base calling

Sequence files

Assembler->Genome scaffold

Ordered contig set

Gap closuresequence editing

Re-assembly

ONE ASSEMBLY!

(per molecule)

Combinatorial PCRPOMP

Gene finding

Homology searches

Function assignments

Metabolic pathwaysGene families

Comparative genomics

Transcriptional/translational

regulatory elementsRepetitive sequences

Publicationwww.tigr.org

LIMS entry point

Microarraystudies

Vaccine, drugdevelopment

Human diseasestudies

Gene Finding

Gene finding plays an ever-larger role in high-speed DNA sequencing projects 1000’s of genes generated each week at a high-

throughput sequencing facility Separate gene finders are needed for every organism

Training on organism X, finding genes on Y, generates inferior results

Bootstrapping problem: training data is hard to find

Prokaryotic – “easy”bacteria, viruses, archaea have• high gene density• no introns

Eukaryotic – hard• low gene density• many introns

GLIMMER: A Bacterial Gene Finder

GLIMMER 2.0: released late 1999 > 2000 sites worldwide (Open Source) Also handles Archaea, viruses, others Refs: Salzberg et al., NAR, 1998, Genomics

1999; Delcher et al., NAR, 1999, Pertea et al, Nature 2000; Pertea and Salzberg, Plant Mol Biol 2001; Majoros et al, NAR, 2003

Web site and code:

http://www.tigr.org/software

Bacterial gene finding, pre-Glimmer: Uniform Markov Models

• Use conditional probability of a sequence position given previous k positions in the sequence, e.g.

ACCGAT• Fixed, kth-order model: bigger k ‘s yield better

models (as long as data is sufficient).

• Probability (score) of sequence s1 s2 s3 … sn is:

) ... |( 11

iki

n

ii sssP

• Advantages:– Easy to train. Count frequencies of (k+1)mers

in training data.– Easy to assign a score to a sequence.

• Disadvantages:– (k+1)mers can be undersampled; i.e., occur

too infrequently in training data.– Choosing a single value of k may not be the

best way to model the data

Uniform Markov Models

Glimmer: Interpolated Markov Models

Use a linear combination of 8 different Markov chains; for example: c8 P (g|atcagtta) + c7 P (g|tcagtta) + …

+ c1 P (g|a) + c0 P (g)

where c0 + c1 + c2 + c3 + c4 = 1

Equivalent to interpolating the results of multiple Markov chains

Score of a sequence is the product of interpolated probabilities of bases in the sequence

IMM’s vs. Fixed-Order Models

• Performance:– IMM should always do at least as well as

fixed-order.• E.g., even if kth-order model is correct, it can be

simulated by (k+1)st-order

– Our results support this.• IMM can be used as fixed-order model.

How GLIMMER Works

Three separate programs: long-orfs: automatically extract

long open reading frames that do not overlap other long orfs.

IMM model builder. Takes any kind of sequence data.

Gene predictor. Takes genome sequence and finds all the genes.

GLIMMER 2.0’s Performance

Organism Genes Genes Additional Annotated Found Genes

H. influenzae 1738 1720 (99.0%) 250 (14%)M. genitalium 483 480 (99.4%) 81 (17%)M. jannaschii 1727 1721 (99.7%) 221 (13%)H. pylori 1590 1550 (97.5%) 293 (18%)E. coli 4269 4158 (97.4%) 824 (19%)B. subtilis 4100 4030 (98.3%) 586 (14%)A. fulgidis 2437 2404 (98.6%) 274 (11%)B. burgdorferi 853 843 (99.3%) 62 (7%)T. pallidum 1039 1014 (97.6%) 180 (17%)T. maritima 1877 1854 (98.8%) 190 (10%)

GLIMMER on “known” genes

Organism Genes Known Correct Annotated Genes Predictions

H. influenzae 1738 1501 1496 (99.7%)M. genitalium 483 478 476 (99.6%)M. jannaschii 1727 1259 1256 (99.8%)H. pylori 1590 1092 1084 (99.3%)E. coli 4269 2656 2632 (99.1%)B. subtilis 4100 1249 1231 (98.6%)A. fulgidis 2437 1799 1786 (99.3%)B. burgdorferi 853 601 600 (99.8%)T. pallidum 1039 755 747 (98.9%)

T. maritima 1877 1504 1493 (99.3%)

Average (99.3%)

SpeedTraining for 2 Megabase genome: < 30 sec

(on a Linux desktop)Find all genes in 2Mb genome: < 30 sec

Impact: GLIMMER has been used for: B. anthracis (anthrax) (TIGR) B. burgdorferi (Lyme disease) , T. pallidum (syphilis)

(TIGR) C. pneumoniae (pneumonia) (Berkeley/Stanford/UCSF) T. maritima, D. radiodurans, M. tuberculosis, V.

cholerae, S. pneumoniae, C. trachomatis, C. pneumoniae, N. meningitidis (TIGR)

X. fastidiosa (Brazilian consortium) Plasmodium falciparum (malaria) [GlimmerM] Arabidopsis thaliana (model plant) [GlimmerM] and many others: viruses, simple eukaryotes, more

bacteria

Eukaryotic gene finding

• Much harder

• Overall accuracy usually below 50%– Human (mammalian) gene finding is hardest– very long introns, and lots of them

• Leading methods: HMMs, GHMMs

• New ideas needed

• New opportunity: use sequence of related species

GlimmerHMMGlimmerHMM

Intergenic

I0 I1 I2

Exon0 Exon1 Exon2

Exon Sngl

Initial Exon Terminal Exon

GlimmerHMM: results on GlimmerHMM: results on Arabidopsis thalianaArabidopsis thaliana

NuclNucl ExonExon GeneGeneSnSn SpSp AccAcc SnSn SpSp AccAcc SnSn SpSp AccAcc

GlimmerHMMGlimmerHMM 9595 9999 9797 7171 7878 74.574.5 3333 3232 32.532.5Genscan+Genscan+ 9393 9999 9696 7474 8181 77.577.5 3535 3535 3535

•Train data set: 3237 genes•Test data set: 809 non-homologous genes•All genes confirmed by full-length Arabidopsis cDNAs

Exonomy: a generalized HMMExonomy: a generalized HMM

Program Nucleotide Exon accuracy Whole-geneaccuracy spec sens accuracy

Unveil 94% 75% 74% 46%Exonomy 95% 63% 61% 42%GlimmerM 93% 71% 71% 44%Genscan 94% 80% 75% 27%

Arabidopsis test results, 300 genes (Majoros et al., 2003)

AspergillusAspergillus species experiments species experiments

Training data:Training data:– 589 Genbank genomic sequences 589 Genbank genomic sequences

containing 625 genes that have the phrase containing 625 genes that have the phrase ‘complete cds’ in their description‘complete cds’ in their description

– 1166 introns inferred from spliced alignments 1166 introns inferred from spliced alignments of ESTs to a recent genome assemblyof ESTs to a recent genome assembly

Test data:Test data:– 85 genes for 85 genes for Aspergillus fumigatusAspergillus fumigatus manually manually

curated and with strong protein evidencecurated and with strong protein evidence

Gene Finding in Gene Finding in A. fumigatusA. fumigatus

GlimmerHMM

Unveil

GlimmerM

Phat

Exonomy

“Truth”

Aspergillus fumigatusAspergillus fumigatus test results test results

0102030405060708090

100

NuclAcc

ExonAcc

GeneAcc

ExonomyPhatGlimmerMUnveilGlimmerHMM

Example:Example: D. melanogaster D. melanogaster vs. vs. D. pseudobscuraD. pseudobscura (alignment generated by MUMmer/Promer)(alignment generated by MUMmer/Promer)

D. melanogaster chr 2L

annotated genes

amino acid matches

Ortholog Detection in TWAINOrtholog Detection in TWAIN Promer/MUMmer to identify conserved regionsPromer/MUMmer to identify conserved regions Individual gene finder to predict coding regions Individual gene finder to predict coding regions

separately in each genomeseparately in each genome Combine these two types of evidence with Combine these two types of evidence with

protein sequence homologyprotein sequence homology

Species 1

Species 2

Run TWAINon these

TWAIN approach: PremiseTWAIN approach: Premise

Instead of independently choosing the optimal gene models for two conserved regions, we want to find the pair of nearly optimal gene models which produce the most similar proteins.

Build Parse GraphsBuild Parse Graphs

• Parse graph: keep N highest scoring ORFs according to individual gene finder

• Parse graphs are built without regard for synteny• Nodes are: start, stop, donor, acceptor sites

Align Parse GraphsAlign Parse Graphs

• The two parse graphs are aligned using a global alignment algorithm on gene structures.

• Optimal alignment corresponds to the best pair of orthologous gene predictions.

Gene Alignment in TWAINGene Alignment in TWAIN

Ideally, each cell links back to the “optimal” predecessor

A cell with a diagonal link to its left denotes homologous signals

A cell with a horizontal or vertical link to its left indicates that a signal in one species is not present in the other

Some ExamplesSome Examples

Intron insertion Exon insertion Multiple insertions

Pair HMM equivalencePair HMM equivalence

E1,E2 I1,I2

Pair HMM EquivalencePair HMM Equivalence

Intron insertion

E1,E2p

I1,--

E1,E2p

Orthogonal vs. Oblique LinkingOrthogonal vs. Oblique Linking

The oblique (red) alignment matches up the two introns

The orthogonal alignment denotes coding regions that have shifted across introns

Dynamic Programming Dynamic Programming OptimizationsOptimizations

• We only look back to cells left and below the current cell, and only those having an edge in both parse graphs to the current cells• Depending on the Promer alignments we might “cut corners” to improve performance

Scoring ModelScoring Model

1 2

1 1 2 2,

(align)argmax ( ) ( )

( align)

PP P

P

where: •P i(i) is the probability that sequence i has parse i

•P(align) is the probability that an alignment PHMM generates this pair of proteins (evaluated by the forward algorithm)

P(align)

MRNDCACQEGHLINRFPDNAR||||| || ||||MRNDCTCQRGHLIATG..................................................................TAG

Partial AlignmentsPartial Alignments

not penalized

• Evaluation of a cell in the alignment matrix depends in many cases on the alignment of partial proteins up to this point in the parse graph (e.g., at a GT-GT cell).• Because a terminal portion of a partial sequence may be matched later, we do not penalize for insertions/deletions at the right end of the alignment

Pair HMM resultsPair HMM results

available within a few weeks….available within a few weeks….

Priority organismsPriority organisms Human-mouse gene finding not very high-impactHuman-mouse gene finding not very high-impact

– lots of ancillary data gives better evidencelots of ancillary data gives better evidence– most genes now knownmost genes now known– nonetheless, this problem is getting all the attentionnonetheless, this problem is getting all the attention

Countless other species really need gene finders:Countless other species really need gene finders:– Brugia malayi (causes lymphatic filariasis)Brugia malayi (causes lymphatic filariasis)– Toxoplasma gondiiToxoplasma gondii– Schistosoma mansoni (Schistosomiasis)Schistosoma mansoni (Schistosomiasis)– Entamoeba histolytica (50 million cases/year)Entamoeba histolytica (50 million cases/year)– Tetrahymena thermophila (model organism)Tetrahymena thermophila (model organism)– Plants: potato, maize, sorghumPlants: potato, maize, sorghum– Mammals: chimp, dog, cow, pigMammals: chimp, dog, cow, pig

AcknowledgementsAcknowledgementsGLIMMERGLIMMER: Arthur Delcher, Simon Kasif, Owen White: Arthur Delcher, Simon Kasif, Owen White

GlimmerM, GlimmerHMMGlimmerM, GlimmerHMM: Mihaela Pertea: Mihaela Pertea

Exonomy, UnveilExonomy, Unveil: Bill Majoros: Bill Majoros

TWAINTWAIN: Mihaela Pertea, Bill Majoros: Mihaela Pertea, Bill Majoros

Funding support:National Institutes of Health (NLM)National Science Foundation (CISE, BIO)

Software downloads: http://www.tigr.org/softwareSoftware downloads: http://www.tigr.org/software

twain: a new tool for parallel gene finding (and other gene finders) mihaela pertea william majoros...

Documents