introduction expressed sequence tags offer a low cost approach to gene discovery
DESCRIPTION
From ESTs to partial genomes. The Environmental Genomics Thematic Programme Data Centre. Alasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter. Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT. INTRODUCTION - PowerPoint PPT PresentationTRANSCRIPT
INTRODUCTION● Expressed sequence tags offer a low cost approach to gene discovery● For a range of non-model organisms, ESTs represent the only sequence information available● Using this data to create 'partial genomes' means the data can be interpreted in a genomic context● To facilitate the creation of partial genomes, we have created a suite of software tools, designed to form a complete EST pipeline● The first tool in the pipline, trace2dbest, process raw chromatograms into high quality sequence objects● These sequences are then used to build a partial genome, using the PartiGene tool● The partial genome is held in an SQL database, which can be made accessible through the web● A further software tool, prot4EST, provides robust translation of the error prone sequences
SUMMARY ●The PartiGene process has been used to create several species specific databases, includingnembase (http://www.nematodes.org) and lumbribase (http://www.earthworms.org).● The software is freely available under a GNU license at http://nema.cap.ed.ac.uk/PartiGene● The software is under continued development, SimiTri (a tool allowing phylogenetic●comparisons) is due to be integrated into the pipeline soon. An additional module, annot8er isalso under development
Raw Chromatogram
acatcgaatcgatacatgACGTAGCAGATCAGTACATGATACACGTCGTCGTCTGCATGCTTGCCACGTCCAGTTTGGCCATTAGTACGCCCGCTGACCTGACTCTGACCATTGACCACTGATGTCCATGATTccatgacatcttgatcgtgatcga
Base Calling(PHRED1)
TYPE: ESTSTATUS: NewCONT_NAME: Blaxter MLCITATION:Expressed Sequence Tags from the humus earthworm L. rubellusLIBRARY: Earthworm Lambda Zap Express LibraryEST#: Lr_adE_01H01_T3CLONE: Lr_adE_01H01SOURCE:PCR_F: T3PCR_B: T7PLPLATE: 01ROW: HCOLUMN: 01SEQ_PRIMER: T3P_END: 5'HIQUAL_START: 1HIQUAL_STOP: 478DNA_TYPE: cDNAPUBLIC: PUT_ID: gb|AAA74396.1| cytochrome c oxidase subunit IVCOMMENT:Sequencing was performed in EdinburghSEQUENCE:CCAACACCGTCATGTCCGGAGACACGACCATGTTCCCAGGTATCGCCGATCGTATGCAGAAGGAGATCACGAGCATGGCTCCAAGCACGATGAAGATCAAGATCATCGCTCCACCCGAGCGCAAGTACTCCGTATGGATCGGTGGGTCCATCCTGGCTTCCCTGTCCACCTTCCAGCAGATGTGGATCAGCAAGCAGGAGTACGACGAGTCCGGCCCATCCATCGTCCACAGGAAGTGCTTCTAAATGCACCGCCGACAACGAGTTACCAAGGGCGACAGAAAGAACCCGCTAACGCGAGCACACACACGCAAGCAAACACACAGCGTGCACGTACATACAACATCACACAACCCATCTCTATGACTCACACACCTTTTCAACCGAACTTTATCCAAATTACGCAAACCGAAGTTTCGATTTTATTTCGTCCTTGTGGACACAAAAGTAATTTAAAAATCTCTGTACGCCTTAATTTGAGGCTATAGTTTGCTTTTGTAACTTAAGGCGATCACAGATTCTAGATGCAATCGTGACTTTATATTTTACGATTTAT||
Trimming
Library Details
Library ID : Lr_adEAdapter1 : GCACGAGAdapter2 : Vector file : /usr/software/trace2dbest/vector.seqContact name : Blaxter MLCitation : Expressed Sequence Tags from the humus earthworm L.rubellusLibrary Name : Earthworm Lambda Zap Express LibrarySequencing Primer : T3Pcr primer - forward : T3Pcr primer - reverse : T7PLLibrary description : Sequencing was performed in Edinburgh
High quality sequence
cDNA library information
trace2dbest
Run DECoder
Run ESTScan
Parse results
Join and extendHSPs
prot4EST
BLASTN against RNA
database
BLASTX against mitochondrially encoded proteins
BLASTX against
SWISSProt
Identify longestORF from six
frame translation
Partial Genome
Sequences
Peptide prediction
no match
no match
no match
fails filters
fails filters
length and quality filters
length and quality filters
>= 30 residues long
sequence similarity (E<e-8)
sequence similarity (E<e-8)
sequence similarity (E<e-65)
+
dbESTEST file
From ESTs to partial genomesAlasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter
Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT
● Poor sequence quality, identification of coding region and frame-shifts make EST translation problematic● prot4EST integrates current translation solutions, BLASTX, DECoder3, ESTScan4
● Fully compatible with PartiGene
PartiGene
1 Collate sequences
dbEST
● Sequences downloaded from public database
2 Cluster● Sequences clustered on the basis of similarity (BLAST) using CLOBB2
3 Assemble● Clusters assembled to form contigs using phrap (Green, P. unpublished)
4 Partial genome
Gene A
Gene B
Gene C
5 Annotation
Example PartiGene HTML results output
Nembase was created using php to submit queries to the PartiGene database
6 Web front ends
● PartiGene represents the core of the partial genome creation process● All ESTs from a particular species are clustered and assembled to form putative gene objects● These genes can then be annotated and the information presented as a web based resource
● trace2dbest is an interactive utility for processing raw EST data
● the basecalling program phred is used to produce a quality scored sequence
● trace2dbest then performs a series of trimming steps● cross_match is used to identify leading and trailing vector sequence
● Next user defined leader and adapter sequences are trimmed
● poly(A) tails are identified based on user defined parameters and trimmed
● Translation (prot4EST)● BLAST● Under development
●Putative location●Functional prediction●Structure prediction●Domain identification
RNAsequences
Acknowledgments: the authors would like to thank Ann Hedley and the rest of the Environmental Genomics Data Centre team for their help. The project is funded by NERC.
References: 1. Ewing, B., & Green, P. (1998) Base-calling of automated sequencer traces using phred. Genome Res. 8, 175-1942. Parkinson J., Guiliano D.B. & Blaxter M. (2002) Making sense of EST sequences by CLOBBing them. BMC Bioinformatics. 3, 313. Fukunishi, Y. & Hayashizaki, Y. (2001) Amino-acid translation for cDNA with frame-shift error. Physiol. Genomics. 5, 81-874. Iseli, C., Jongeneel, C.V., & Bucher, P. (1999) ESTScan: A Program for detecting, evaluating and reconstructing potential coding regions in EST sequences. ISMB7, 138-158
The Environmental Genomics
Thematic ProgrammeData Centre