twain: a new tool for parallel gene finding (and other gene finders) mihaela pertea william majoros...
Post on 21-Dec-2015
220 views
TRANSCRIPT
TWAIN: a new tool for TWAIN: a new tool for parallel gene findingparallel gene finding
(and other gene finders)(and other gene finders)
Mihaela PerteaMihaela Pertea
William MajorosWilliam Majoros
Steven SalzbergSteven Salzberg
First, some First, some background…background…
Genomes completed and published by TIGR and our collaborators, 1995-present
Organism ReferenceArabidopsis thaliana Lin et al., Nature 402: 761-8 (2000)Archaeoglobus fulgidus Klenk et al., Nature 390:364-370 (1997)Bacillus anthracis Ames Read et al., Nature 423: 81-86 (2003)Bacillus anthracis Florida Read et al., Science 296, 2028-33 (2002)Borrelia burgdorferi Fraser et al., Nature 390: 580-586 (1997) Brucella suis Paulsen et al., PNAS 99 (2002)Caulobacter crescentus Nierman et al., PNAS 98 (2001)Chlamydia pneumoniae Read et al., Nucl. Acids Res. 28, (2000)Chlamydia muridarum Read et al., Nucl. Acids Res. 28, (2000)Chlamydophila caviae Read et al., Nucl. Acids Res. 31, (2003) Chlorobium tepidum Eisen et al., PNAS 99: 9509-9514 (2002)Coxiella burnetii RSA 493 Seshadri et al., PNAS 100: 5455-60 (2003)Deinococcus radiodurans White et al., Science 286 (1999)Enterococcus faecalis Paulsen et al., Science 299: 2071-2074 (2003)Haemophilus influenzae Fleischmann et al., Science 269, (1995)Helicobacter pylori Tomb et al., Nature 388:539-547 (1997)Methanococcus jannaschii Bult et al., Science 273:1058-1073 (1996)Mycobacterium tuberculosis Fleischmann et al., J. Bact.184, (2002)Mycoplasma genitalium Fraser et al., Science 270:397-403 (1995)Neisseria meningitidis Tettelin et al., Science 287 (2000)Oryza sativa (rice) chr 10 Wing et al., Science 300: 1566-1569 (2003)Plasmodium falciparum Gardner et al., Nature 419:531-534 (2002)Plasmodium yoelii Carlton et al., Nature 419:512-519(2002)Porphyromonas gingivalis Nelson et al., J. Bact., in revision.Pseudomonas putida Nelson et al., Envir. Microbiol. (2002)Shewanella oneidensis Heidelberg et al., Nat. Biotech. 20 (2002) Streptococcus agalactiae Tettelin et al., PNAS. 99 (2002) Streptococcus pneumoniae Tettelin et al., Science 293 (2001)Sulfolobus islandicus virus Arnold et al., Virology 15:252-66 (2000)Thermotoga maritima Nelson et al., Nature 399: 323-329 (1999)Treponema pallidum Fraser et al., Science 281: 375-388 (1998)Vibrio cholerae Heidelberg et al., Nature 406, (2000)
Genomes in progress or recently completedFibrobacter succinogenesPrevotella intermediaPseudomonas fluorescensSilicibacter pomeroyi DSS-3Streptococcus agalactiae A909Streptococcus gordoniiStreptococcus mitisStreptococcus pneumoniae 670Acidobacterium capsulatum Bacillus anthracis A01055Bacillus anthracis A0402Bacillus anthracis Ames 0581Burkholderia thailandensisCampylobacter coli RM2228Campylobacter upsaliensis RM3195Clostridium perfringens SM101Epulopiscium fisheloniiHyphomonas neptuniumListeria monocytogenes F6854Listeria monocytogenes H7858Mycoplasma arthritidis Mycoplasma capricolumMyxococcus xanthusPrevotella ruminicolaPyrococcus furiosusVerrucomicrobium spinosum Actinomyces naeslundii
Bacillus anthracis A0071 Bacillus anthracis Kruger BErwinia chrysanthemiGemmata obscuriglobus Mycobacterium tuberculosisRuminococcus albusStreptococcus sobrinusAspergillus fumigatus Brugia malayi Coccidioides immitisCryptococcus neoformansEntamoeba histolyticaOryza sativa Chromosome 3 & 10Plasmodium vivaxSchistosoma mansoniSolanum spp.Tetrahymena thermophilaToxoplasma gondii Theileria parvaTrichomonas vaginalis Trypanosoma brucei Trypanosoma cruzi
Acidithiobacillus ferrooxidansBacillus anthracis Kruger BBurkholderia mallei Clostridium perfringens ATCC13124Dehalococcoides ethenogenesDesulfovibrio vulgaris Ehrlichia chaffeensisEhrlichia sennetsuGeobacter sulfurreducens Listeria monocytogenes Methylococcus capsulatusMycobacterium avium 104Mycobacterium smegmatisPseudomonas syringae Staphylococcus aureus Staphylococcus epidermidis Treponema denticolaWolbachia sp.Anaplasma phagocytophilaBacillus cereus 10987Bacteroides forsythesBrucella ovisBaumannia cicadellinicolaCampylobacter jejuniCarboxydothermus hydrogenoformansColwellia sp. 34HDichelobacter nodosus
Anatomy of a Genome Sequencing Project
Shotgun sequencingGenome Assembly AnnotationData release
Downstream research
Library construction
Colony picking
Template preparation
Sequencing reactions
Base calling
Sequence files
Assembler->Genome scaffold
Ordered contig set
Gap closuresequence editing
Re-assembly
ONE ASSEMBLY!
(per molecule)
Combinatorial PCRPOMP
Gene finding
Homology searches
Function assignments
Metabolic pathwaysGene families
Comparative genomics
Transcriptional/translational
regulatory elementsRepetitive sequences
Publicationwww.tigr.org
LIMS entry point
Microarraystudies
Vaccine, drugdevelopment
Human diseasestudies
Gene Finding
Gene finding plays an ever-larger role in high-speed DNA sequencing projects 1000’s of genes generated each week at a high-
throughput sequencing facility Separate gene finders are needed for every organism
Training on organism X, finding genes on Y, generates inferior results
Bootstrapping problem: training data is hard to find
Prokaryotic – “easy”bacteria, viruses, archaea have• high gene density• no introns
Eukaryotic – hard• low gene density• many introns
GLIMMER: A Bacterial Gene Finder
GLIMMER 2.0: released late 1999 > 2000 sites worldwide (Open Source) Also handles Archaea, viruses, others Refs: Salzberg et al., NAR, 1998, Genomics
1999; Delcher et al., NAR, 1999, Pertea et al, Nature 2000; Pertea and Salzberg, Plant Mol Biol 2001; Majoros et al, NAR, 2003
Web site and code:
http://www.tigr.org/software
Bacterial gene finding, pre-Glimmer: Uniform Markov Models
• Use conditional probability of a sequence position given previous k positions in the sequence, e.g.
ACCGAT• Fixed, kth-order model: bigger k ‘s yield better
models (as long as data is sufficient).
• Probability (score) of sequence s1 s2 s3 … sn is:
) ... |( 11
iki
n
ii sssP
• Advantages:– Easy to train. Count frequencies of (k+1)mers
in training data.– Easy to assign a score to a sequence.
• Disadvantages:– (k+1)mers can be undersampled; i.e., occur
too infrequently in training data.– Choosing a single value of k may not be the
best way to model the data
Uniform Markov Models
Glimmer: Interpolated Markov Models
Use a linear combination of 8 different Markov chains; for example: c8 P (g|atcagtta) + c7 P (g|tcagtta) + …
+ c1 P (g|a) + c0 P (g)
where c0 + c1 + c2 + c3 + c4 = 1
Equivalent to interpolating the results of multiple Markov chains
Score of a sequence is the product of interpolated probabilities of bases in the sequence
IMM’s vs. Fixed-Order Models
• Performance:– IMM should always do at least as well as
fixed-order.• E.g., even if kth-order model is correct, it can be
simulated by (k+1)st-order
– Our results support this.• IMM can be used as fixed-order model.
How GLIMMER Works
Three separate programs: long-orfs: automatically extract
long open reading frames that do not overlap other long orfs.
IMM model builder. Takes any kind of sequence data.
Gene predictor. Takes genome sequence and finds all the genes.
GLIMMER 2.0’s Performance
Organism Genes Genes Additional Annotated Found Genes
H. influenzae 1738 1720 (99.0%) 250 (14%)M. genitalium 483 480 (99.4%) 81 (17%)M. jannaschii 1727 1721 (99.7%) 221 (13%)H. pylori 1590 1550 (97.5%) 293 (18%)E. coli 4269 4158 (97.4%) 824 (19%)B. subtilis 4100 4030 (98.3%) 586 (14%)A. fulgidis 2437 2404 (98.6%) 274 (11%)B. burgdorferi 853 843 (99.3%) 62 (7%)T. pallidum 1039 1014 (97.6%) 180 (17%)T. maritima 1877 1854 (98.8%) 190 (10%)
GLIMMER on “known” genes
Organism Genes Known Correct Annotated Genes Predictions
H. influenzae 1738 1501 1496 (99.7%)M. genitalium 483 478 476 (99.6%)M. jannaschii 1727 1259 1256 (99.8%)H. pylori 1590 1092 1084 (99.3%)E. coli 4269 2656 2632 (99.1%)B. subtilis 4100 1249 1231 (98.6%)A. fulgidis 2437 1799 1786 (99.3%)B. burgdorferi 853 601 600 (99.8%)T. pallidum 1039 755 747 (98.9%)
T. maritima 1877 1504 1493 (99.3%)
Average (99.3%)
SpeedTraining for 2 Megabase genome: < 30 sec
(on a Linux desktop)Find all genes in 2Mb genome: < 30 sec
Impact: GLIMMER has been used for: B. anthracis (anthrax) (TIGR) B. burgdorferi (Lyme disease) , T. pallidum (syphilis)
(TIGR) C. pneumoniae (pneumonia) (Berkeley/Stanford/UCSF) T. maritima, D. radiodurans, M. tuberculosis, V.
cholerae, S. pneumoniae, C. trachomatis, C. pneumoniae, N. meningitidis (TIGR)
X. fastidiosa (Brazilian consortium) Plasmodium falciparum (malaria) [GlimmerM] Arabidopsis thaliana (model plant) [GlimmerM] and many others: viruses, simple eukaryotes, more
bacteria
Eukaryotic gene finding
• Much harder
• Overall accuracy usually below 50%– Human (mammalian) gene finding is hardest– very long introns, and lots of them
• Leading methods: HMMs, GHMMs
• New ideas needed
• New opportunity: use sequence of related species
GlimmerHMMGlimmerHMM
Intergenic
I0 I1 I2
Exon0 Exon1 Exon2
Exon Sngl
Initial Exon Terminal Exon
GlimmerHMM: results on GlimmerHMM: results on Arabidopsis thalianaArabidopsis thaliana
NuclNucl ExonExon GeneGeneSnSn SpSp AccAcc SnSn SpSp AccAcc SnSn SpSp AccAcc
GlimmerHMMGlimmerHMM 9595 9999 9797 7171 7878 74.574.5 3333 3232 32.532.5Genscan+Genscan+ 9393 9999 9696 7474 8181 77.577.5 3535 3535 3535
•Train data set: 3237 genes•Test data set: 809 non-homologous genes•All genes confirmed by full-length Arabidopsis cDNAs
Exonomy: a generalized HMMExonomy: a generalized HMM
Program Nucleotide Exon accuracy Whole-geneaccuracy spec sens accuracy
Unveil 94% 75% 74% 46%Exonomy 95% 63% 61% 42%GlimmerM 93% 71% 71% 44%Genscan 94% 80% 75% 27%
Arabidopsis test results, 300 genes (Majoros et al., 2003)
AspergillusAspergillus species experiments species experiments
Training data:Training data:– 589 Genbank genomic sequences 589 Genbank genomic sequences
containing 625 genes that have the phrase containing 625 genes that have the phrase ‘complete cds’ in their description‘complete cds’ in their description
– 1166 introns inferred from spliced alignments 1166 introns inferred from spliced alignments of ESTs to a recent genome assemblyof ESTs to a recent genome assembly
Test data:Test data:– 85 genes for 85 genes for Aspergillus fumigatusAspergillus fumigatus manually manually
curated and with strong protein evidencecurated and with strong protein evidence
Gene Finding in Gene Finding in A. fumigatusA. fumigatus
GlimmerHMM
Unveil
GlimmerM
Phat
Exonomy
“Truth”
Aspergillus fumigatusAspergillus fumigatus test results test results
0102030405060708090
100
NuclAcc
ExonAcc
GeneAcc
ExonomyPhatGlimmerMUnveilGlimmerHMM
Example:Example: D. melanogaster D. melanogaster vs. vs. D. pseudobscuraD. pseudobscura (alignment generated by MUMmer/Promer)(alignment generated by MUMmer/Promer)
D. melanogaster chr 2L
annotated genes
amino acid matches
Ortholog Detection in TWAINOrtholog Detection in TWAIN Promer/MUMmer to identify conserved regionsPromer/MUMmer to identify conserved regions Individual gene finder to predict coding regions Individual gene finder to predict coding regions
separately in each genomeseparately in each genome Combine these two types of evidence with Combine these two types of evidence with
protein sequence homologyprotein sequence homology
Species 1
Species 2
Run TWAINon these
TWAIN approach: PremiseTWAIN approach: Premise
Instead of independently choosing the optimal gene models for two conserved regions, we want to find the pair of nearly optimal gene models which produce the most similar proteins.
Build Parse GraphsBuild Parse Graphs
• Parse graph: keep N highest scoring ORFs according to individual gene finder
• Parse graphs are built without regard for synteny• Nodes are: start, stop, donor, acceptor sites
Align Parse GraphsAlign Parse Graphs
• The two parse graphs are aligned using a global alignment algorithm on gene structures.
• Optimal alignment corresponds to the best pair of orthologous gene predictions.
Gene Alignment in TWAINGene Alignment in TWAIN
Ideally, each cell links back to the “optimal” predecessor
A cell with a diagonal link to its left denotes homologous signals
A cell with a horizontal or vertical link to its left indicates that a signal in one species is not present in the other
Some ExamplesSome Examples
Intron insertion Exon insertion Multiple insertions
Pair HMM equivalencePair HMM equivalence
E1,E2 I1,I2
Pair HMM EquivalencePair HMM Equivalence
Intron insertion
E1,E2p
I1,--
E1,E2p
Orthogonal vs. Oblique LinkingOrthogonal vs. Oblique Linking
The oblique (red) alignment matches up the two introns
The orthogonal alignment denotes coding regions that have shifted across introns
Dynamic Programming Dynamic Programming OptimizationsOptimizations
• We only look back to cells left and below the current cell, and only those having an edge in both parse graphs to the current cells• Depending on the Promer alignments we might “cut corners” to improve performance
Scoring ModelScoring Model
1 2
1 1 2 2,
(align)argmax ( ) ( )
( align)
PP P
P
where: •P i(i) is the probability that sequence i has parse i
•P(align) is the probability that an alignment PHMM generates this pair of proteins (evaluated by the forward algorithm)
P(align)
MRNDCACQEGHLINRFPDNAR||||| || ||||MRNDCTCQRGHLIATG..................................................................TAG
Partial AlignmentsPartial Alignments
not penalized
• Evaluation of a cell in the alignment matrix depends in many cases on the alignment of partial proteins up to this point in the parse graph (e.g., at a GT-GT cell).• Because a terminal portion of a partial sequence may be matched later, we do not penalize for insertions/deletions at the right end of the alignment
Pair HMM resultsPair HMM results
available within a few weeks….available within a few weeks….
Priority organismsPriority organisms Human-mouse gene finding not very high-impactHuman-mouse gene finding not very high-impact
– lots of ancillary data gives better evidencelots of ancillary data gives better evidence– most genes now knownmost genes now known– nonetheless, this problem is getting all the attentionnonetheless, this problem is getting all the attention
Countless other species really need gene finders:Countless other species really need gene finders:– Brugia malayi (causes lymphatic filariasis)Brugia malayi (causes lymphatic filariasis)– Toxoplasma gondiiToxoplasma gondii– Schistosoma mansoni (Schistosomiasis)Schistosoma mansoni (Schistosomiasis)– Entamoeba histolytica (50 million cases/year)Entamoeba histolytica (50 million cases/year)– Tetrahymena thermophila (model organism)Tetrahymena thermophila (model organism)– Plants: potato, maize, sorghumPlants: potato, maize, sorghum– Mammals: chimp, dog, cow, pigMammals: chimp, dog, cow, pig
AcknowledgementsAcknowledgementsGLIMMERGLIMMER: Arthur Delcher, Simon Kasif, Owen White: Arthur Delcher, Simon Kasif, Owen White
GlimmerM, GlimmerHMMGlimmerM, GlimmerHMM: Mihaela Pertea: Mihaela Pertea
Exonomy, UnveilExonomy, Unveil: Bill Majoros: Bill Majoros
TWAINTWAIN: Mihaela Pertea, Bill Majoros: Mihaela Pertea, Bill Majoros
Funding support:National Institutes of Health (NLM)National Science Foundation (CISE, BIO)
Software downloads: http://www.tigr.org/softwareSoftware downloads: http://www.tigr.org/software