introduction expressed sequence tags offer a low cost approach to gene discovery

1
INTRODUCTION Expressed sequence tags offer a low cost approach to gene discovery For a range of non-model organisms, ESTs represent the only sequence information available Using this data to create 'partial genomes' means the data can be interpreted in a genomic context To facilitate the creation of partial genomes, we have created a suite of software tools, designed to form a complete EST pipeline The first tool in the pipline, trace2dbest, process raw chromatograms into high quality sequence objects These sequences are then used to build a partial genome, using the PartiGene tool The partial genome is held in an SQL database, which can be made accessible through the web A further software tool, prot4EST, provides robust translation of the error prone sequences SUMMARY The PartiGene process has been used to create several species specific databases, including nembase (http://www.nematodes.org) and lumbribase (http://www.earthworms.org). The software is freely available under a GNU license at http://nema.cap.ed.ac.uk/PartiGene The software is under continued development, SimiTri (a tool allowing phylogenetic comparisons) is due to be integrated into the pipeline soon. An additional module, annot8er is also under development Raw Chromatogra m acatcgaatcgatacatgACGTAGCAGATCAG TACATGATACACGTCGTCGTCTGCATGCTTGC CACGTCCAGTTTGGCCATTAGTACGCCCGCTG ACCTGACTCTGACCATTGACCACTGATGTCCA TGATTccatgacatcttgatcgtgatcga Base Calling (PHRED 1 ) TYPE: EST STATUS: New CONT_NAME: Blaxter ML CITATION: Expressed Sequence Tags from the humus earthworm L. rubellus LIBRARY: Earthworm Lambda Zap Express Library EST#: Lr_adE_01H01_T3 CLONE: Lr_adE_01H01 SOURCE: PCR_F: T3 PCR_B: T7PL PLATE: 01 ROW: H COLUMN: 01 SEQ_PRIMER: T3 P_END: 5' HIQUAL_START: 1 HIQUAL_STOP: 478 DNA_TYPE: cDNA PUBLIC: PUT_ID: gb|AAA74396.1| cytochrome c oxidase subunit IV COMMENT: Sequencing was performed in Edinburgh SEQUENCE: CCAACACCGTCATGTCCGGAGACACGACCATGTTCCCAGGTATCGCCGATCGTATGCAGA AGGAGATCACGAGCATGGCTCCAAGCACGATGAAGATCAAGATCATCGCTCCACCCGAGC GCAAGTACTCCGTATGGATCGGTGGGTCCATCCTGGCTTCCCTGTCCACCTTCCAGCAGA TGTGGATCAGCAAGCAGGAGTACGACGAGTCCGGCCCATCCATCGTCCACAGGAAGTGCT TCTAAATGCACCGCCGACAACGAGTTACCAAGGGCGACAGAAAGAACCCGCTAACGCGAG CACACACACGCAAGCAAACACACAGCGTGCACGTACATACAACATCACACAACCCATCTC TATGACTCACACACCTTTTCAACCGAACTTTATCCAAATTACGCAAACCGAAGTTTCGAT TTTATTTCGTCCTTGTGGACACAAAAGTAATTTAAAAATCTCTGTACGCCTTAATTTGAG GCTATAGTTTGCTTTTGTAACTTAAGGCGATCACAGATTCTAGATGCAATCGTGACTTTA TATTTTACGATTTAT || Trimmi ng Library Details Library ID : Lr_adE Adapter1 : GCACGAG Adapter2 : Vector file : /usr/software/trace2dbest/vector.seq Contact name : Blaxter ML Citation : Expressed Sequence Tags from the humus earthworm L. rubellus Library Name : Earthworm Lambda Zap Express Library Sequencing Primer : T3 Pcr primer - forward : T3 Pcr primer - reverse : T7PL Library description : Sequencing was performed in Edinburgh High quality sequence cDNA library information trace2dbes t Run DECoder Run ESTScan Parse results Join and extend HSPs prot4ES T BLASTN against RNA database BLASTX against mitochondri ally encoded proteins BLASTX against SWISSProt Identify longest ORF from six frame translation Partial Genome Sequences Peptide predict ion no match no match no match fails filters fails filters length and quality filters length and quality filters >= 30 residues long sequence similarity (E<e -8 ) sequence similarity (E<e -8 ) sequence similarity (E<e -65 ) + dbEST EST file From ESTs to partial genomes Alasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT Poor sequence quality, identification of coding region and frame-shifts make EST translation problematic prot4EST integrates current translation solutions, BLASTX, DECoder 3 , ESTScan 4 Fully compatible with PartiGene PartiGe ne 1 Collate sequences dbES T Sequences downloaded from public database 2 Cluste r Sequences clustered on the basis of similarity (BLAST) using CLOBB 2 3 Assemble Clusters assembled to form contigs using phrap (Green, P. unpublished) 4 Partial genome Gene A Gene B Gene C 5 Annotatio n Example PartiGene HTML results output Nembase was created using php to submit queries to the PartiGene database 6 Web front ends PartiGene represents the core of the partial genome creation process All ESTs from a particular species are clustered and assembled to form putative gene objects These genes can then be annotated and the information presented as a web based resource trace2dbest is an interactive utility for processing raw EST data the basecalling program phred is used to produce a quality scored sequence trace2dbest then performs a series of trimming steps cross_match is used to identify leading and trailing vector sequence Next user defined leader and adapter sequences are trimmed poly(A) tails are identified based on user defined parameters and trimmed Translation (prot4EST) BLAST Under development Putative location Functional prediction Structure prediction Domain identification RNA sequenc es Acknowledgments: the authors would like to thank Ann Hedley and the rest of the Environmental Genomics Data Centre team for their help. The project is funded by NERC. References: 1. Ewing, B., & Green, P. (1998) Base-calling of automated sequencer traces using phred. Genome Res. 8, 175-194 2. Parkinson J., Guiliano D.B. & Blaxter M. (2002) Making sense of EST sequences by CLOBBing them. BMC Bioinformatics. 3, 31 3. Fukunishi, Y. & Hayashizaki, Y. (2001) Amino-acid translation for cDNA with frame-shift error. Physiol. Genomics. 5, 81-87 4. Iseli, C., Jongeneel, C.V., & Bucher, P. (1999) ESTScan: A Program for detecting, evaluating and reconstructing potential coding regions in EST sequences. ISMB7, 138-158 The Environmental Genomics Thematic Programme Data Centre

Upload: andren

Post on 06-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

From ESTs to partial genomes. The Environmental Genomics Thematic Programme Data Centre. Alasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter. Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT. INTRODUCTION - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: INTRODUCTION  Expressed sequence tags offer a low cost approach to gene discovery

INTRODUCTION● Expressed sequence tags offer a low cost approach to gene discovery● For a range of non-model organisms, ESTs represent the only sequence information available● Using this data to create 'partial genomes' means the data can be interpreted in a genomic context● To facilitate the creation of partial genomes, we have created a suite of software tools, designed to form a complete EST pipeline● The first tool in the pipline, trace2dbest, process raw chromatograms into high quality sequence objects● These sequences are then used to build a partial genome, using the PartiGene tool● The partial genome is held in an SQL database, which can be made accessible through the web● A further software tool, prot4EST, provides robust translation of the error prone sequences

SUMMARY ●The PartiGene process has been used to create several species specific databases, includingnembase (http://www.nematodes.org) and lumbribase (http://www.earthworms.org).● The software is freely available under a GNU license at http://nema.cap.ed.ac.uk/PartiGene● The software is under continued development, SimiTri (a tool allowing phylogenetic●comparisons) is due to be integrated into the pipeline soon. An additional module, annot8er isalso under development

Raw Chromatogram

acatcgaatcgatacatgACGTAGCAGATCAGTACATGATACACGTCGTCGTCTGCATGCTTGCCACGTCCAGTTTGGCCATTAGTACGCCCGCTGACCTGACTCTGACCATTGACCACTGATGTCCATGATTccatgacatcttgatcgtgatcga

Base Calling(PHRED1)

TYPE: ESTSTATUS: NewCONT_NAME: Blaxter MLCITATION:Expressed Sequence Tags from the humus earthworm L. rubellusLIBRARY: Earthworm Lambda Zap Express LibraryEST#: Lr_adE_01H01_T3CLONE: Lr_adE_01H01SOURCE:PCR_F: T3PCR_B: T7PLPLATE: 01ROW: HCOLUMN: 01SEQ_PRIMER: T3P_END: 5'HIQUAL_START: 1HIQUAL_STOP: 478DNA_TYPE: cDNAPUBLIC: PUT_ID: gb|AAA74396.1| cytochrome c oxidase subunit IVCOMMENT:Sequencing was performed in EdinburghSEQUENCE:CCAACACCGTCATGTCCGGAGACACGACCATGTTCCCAGGTATCGCCGATCGTATGCAGAAGGAGATCACGAGCATGGCTCCAAGCACGATGAAGATCAAGATCATCGCTCCACCCGAGCGCAAGTACTCCGTATGGATCGGTGGGTCCATCCTGGCTTCCCTGTCCACCTTCCAGCAGATGTGGATCAGCAAGCAGGAGTACGACGAGTCCGGCCCATCCATCGTCCACAGGAAGTGCTTCTAAATGCACCGCCGACAACGAGTTACCAAGGGCGACAGAAAGAACCCGCTAACGCGAGCACACACACGCAAGCAAACACACAGCGTGCACGTACATACAACATCACACAACCCATCTCTATGACTCACACACCTTTTCAACCGAACTTTATCCAAATTACGCAAACCGAAGTTTCGATTTTATTTCGTCCTTGTGGACACAAAAGTAATTTAAAAATCTCTGTACGCCTTAATTTGAGGCTATAGTTTGCTTTTGTAACTTAAGGCGATCACAGATTCTAGATGCAATCGTGACTTTATATTTTACGATTTAT||

Trimming

Library Details

Library ID : Lr_adEAdapter1 : GCACGAGAdapter2 : Vector file : /usr/software/trace2dbest/vector.seqContact name : Blaxter MLCitation : Expressed Sequence Tags from the humus earthworm L.rubellusLibrary Name : Earthworm Lambda Zap Express LibrarySequencing Primer : T3Pcr primer - forward : T3Pcr primer - reverse : T7PLLibrary description : Sequencing was performed in Edinburgh

High quality sequence

cDNA library information

trace2dbest

Run DECoder

Run ESTScan

Parse results

Join and extendHSPs

prot4EST

BLASTN against RNA

database

BLASTX against mitochondrially encoded proteins

BLASTX against

SWISSProt

Identify longestORF from six

frame translation

Partial Genome

Sequences

Peptide prediction

no match

no match

no match

fails filters

fails filters

length and quality filters

length and quality filters

>= 30 residues long

sequence similarity (E<e-8)

sequence similarity (E<e-8)

sequence similarity (E<e-65)

+

dbESTEST file

From ESTs to partial genomesAlasdair Anthony, Ralf Schmid, James Wasmuth, John Parkinson and Mark Blaxter

Nematode Genomics, Institute of Cell, Animal and Population Biology, University of Edinburgh, EH9 3JT

● Poor sequence quality, identification of coding region and frame-shifts make EST translation problematic● prot4EST integrates current translation solutions, BLASTX, DECoder3, ESTScan4

● Fully compatible with PartiGene

PartiGene

1 Collate sequences

dbEST

● Sequences downloaded from public database

2 Cluster● Sequences clustered on the basis of similarity (BLAST) using CLOBB2

3 Assemble● Clusters assembled to form contigs using phrap (Green, P. unpublished)

4 Partial genome

Gene A

Gene B

Gene C

5 Annotation

Example PartiGene HTML results output

Nembase was created using php to submit queries to the PartiGene database

6 Web front ends

● PartiGene represents the core of the partial genome creation process● All ESTs from a particular species are clustered and assembled to form putative gene objects● These genes can then be annotated and the information presented as a web based resource

● trace2dbest is an interactive utility for processing raw EST data

● the basecalling program phred is used to produce a quality scored sequence

● trace2dbest then performs a series of trimming steps● cross_match is used to identify leading and trailing vector sequence

● Next user defined leader and adapter sequences are trimmed

● poly(A) tails are identified based on user defined parameters and trimmed

● Translation (prot4EST)● BLAST● Under development

●Putative location●Functional prediction●Structure prediction●Domain identification

RNAsequences

Acknowledgments: the authors would like to thank Ann Hedley and the rest of the Environmental Genomics Data Centre team for their help. The project is funded by NERC.

References: 1. Ewing, B., & Green, P. (1998) Base-calling of automated sequencer traces using phred. Genome Res. 8, 175-1942. Parkinson J., Guiliano D.B. & Blaxter M. (2002) Making sense of EST sequences by CLOBBing them. BMC Bioinformatics. 3, 313. Fukunishi, Y. & Hayashizaki, Y. (2001) Amino-acid translation for cDNA with frame-shift error. Physiol. Genomics. 5, 81-874. Iseli, C., Jongeneel, C.V., & Bucher, P. (1999) ESTScan: A Program for detecting, evaluating and reconstructing potential coding regions in EST sequences. ISMB7, 138-158

The Environmental Genomics

Thematic ProgrammeData Centre