pipeline for functional annotation of novel …accelrys.com/.../pdf/hyseq_pipeline_annot.pdf ·...

of 1 /1
HS00945 query sequence length query sequence length Structure Search Results Structure Search Results start, end & range of model sequence start, end & range of model sequence Psi Psi - - blast score & link blast score & link 3D model scores 3D model scores SeqFold score SeqFold score model sequence model sequence % identity & similarity % identity & similarity PDB functional annotation PDB functional annotation total number total number of hits for each sequence of hits for each sequence PDB template origin PDB template origin SCOP & PDB public database links SCOP & PDB public database links HS00222 HS00222 HS00945 HS00945 HS00945 PDB active site annotation PDB active site annotation HS00222 sequence identifier sequence identifier structure method structure method Acknowledgements We would like to thank Sue Andarmani and Ling Jiang (web interface), Savita Jayaram (Structure Plus), Kiran Mukhyala (structure analysis tools), Ami Gavali (SQL database), and Ivan Labat for their excellent work and contributions. Disclaimer: Sequence and structure data are only representations of the real data References Sánchez, R., Šali, A., PNAS 95 (1998) 13597-13602. Fischer, D., Eisenberg, D., Theor. Chem. Acc. 101 (1999) 57. Fischer, D., Eisenberg, D., Protein Sci. 5 (1996) 947-955. Lüthy, R., Bowie, J., Eisenberg, D., Nature 356 (1992) 83-85. Kitson, D., et al., Briefings in Bioinform . In press. Abstract We have created a high-throughput, integral pipeline of structure analysis protocols for over 10,000 of Hyseq’s proprietary protein sequences. This structure analysis pipeline incorporates 3D structure prediction and functional annotation (GeneAtlas TM , Accelrys Inc., San Diego), parsing and datamining programs, an SQL structure database, and several structural analysis programs. These tools are all accessible via an in-house web- interface. The pipeline has allowed us to obtain significant structure hits (over 100,000) and 3D models for many of our novel protein sequences. After storing the hit information in our database, we datamine the hits by keywords and analyze template-model structure pairs for individual novel proteins. Altogether, the results of the pipeline are used to aid in the functional annotation of our sequences by structure homology including active site residues, to interpret and verify sequence-based annotation, and to rapidly target novel genes to appropriate assays. High-throughput 3D structure determination from novel gene sequences has created new opportunities for us for discovery of biopharmaceuticals acting through novel mechanisms. PIPELINE FOR FUNCTIONAL ANNOTATION OF NOVEL PROTEINS BY STRUCTUR PIPELINE FOR FUNCTIONAL ANNOTATION OF NOVEL PROTEINS BY STRUCTUR AL HOMOLOGY AL HOMOLOGY Dana Haley-Vicente* and Nancy Mize Hyseq Pharmaceuticals Inc., 675 Almanor Ave., Sunnyvale, CA 94086 * Currently at Accelrys, 9685 Scranton Rd., San Diego, CA 92121 searching database searching database by keyword(s) or sequence ID by keyword(s) or sequence ID filtering for filtering for best hits best hits search & print fields options search & print fields options 3D Protein Structure Search 3D Protein Structure Search query query sequence sequence model model PDB PDB template template structure structure 3D Viewer 3D Viewer Links Links Structure Analysis Structure Analysis alignment analysis alignment analysis USER INTERFACE USER INTERFACE PIPELINE & DATABASE PIPELINE & DATABASE query sequence query sequence & template & template alignment alignment HS01Project Individual Individual GeneAtlas GeneAtlas TM TM 3D 3D Model Report Model Report secondary secondary structure structure annotation annotation HS01Project Individual Individual GeneAtlas GeneAtlas TM TM SeqFold SeqFold Report Report statistical analysis statistical analysis Cloning Sequencing 3D Active Site Annotation 3D Alignment & Statistical Analysis Tools Template Search (Psi-Blast) Sequence / Template Alignment (Psi-Blast, PDB95) Model Generation (MODELER) Model Annotation Datamining Threading (SeqFold) GeneAtlas TM Structure Plus (Parser & Filter Program) Protein Sequences (Projects) Structure Database Model Evaluation (Profiles-3D/Verify & PMF) The Protein The Protein Structure Pipeline Structure Pipeline join to join to other databases other databases individual individual hit data hit data method type method type project data project data active site active site data for data for query sequence query sequence models models individual individual sequence sequence data data active site & active site & template PDB template PDB structure structure data data Hyseq’s Relational Structure Database Hyseq’s Relational Structure Database

Author: donhan

Post on 09-Mar-2018

219 views

Category:

Documents


2 download

Embed Size (px)

TRANSCRIPT

  • HS00222

    HS00945

    query sequence lengthquery sequence lengthquery sequence length

    Structure Search ResultsStructure Search ResultsStructure Search Results

    start, end & range of model sequencestart, end & range of model sequencestart, end & range of model sequence

    Psi-blast score & linkPsiPsi--blast score & linkblast score & link

    3D model scores3D model scores3D model scores

    SeqFold scoreSeqFold scoreSeqFold score

    model sequence % identity & similarity

    model sequence model sequence % identity & similarity% identity & similarityPDB functional annotationPDB functional annotationPDB functional annotation

    total numberof hits for each sequence

    total numbertotal numberof hits for each sequenceof hits for each sequence

    PDB template originPDB template originPDB template origin

    SCOP & PDB public database linksSCOP & PDB public database linksSCOP & PDB public database linksHS00222

    HS00222

    HS00945

    HS00945

    HS00945

    PDB active site annotationPDB active site annotationPDB active site annotation

    HS00222

    sequence identifiersequence identifiersequence identifier

    structure methodstructure methodstructure method

    Acknowledgements

    We would like to thank Sue Andarmani and Ling Jiang (web interface), Savita Jayaram (Structure Plus), Kiran Mukhyala (structure analysis tools), Ami Gavali (SQL database), and Ivan Labat for their excellent work and contributions.

    Disclaimer: Sequence and structure data are only representations of the real data

    References

    Snchez, R., ali, A., PNAS 95 (1998) 13597-13602.

    Fischer, D., Eisenberg, D., Theor. Chem. Acc. 101 (1999) 57.

    Fischer, D., Eisenberg, D., Protein Sci. 5 (1996) 947-955.

    Lthy, R., Bowie, J., Eisenberg, D., Nature 356 (1992) 83-85.

    Kitson, D., et al., Briefings in Bioinform. In press.

    Abstract

    We have created a high-throughput, integral pipeline of structure analysis protocols for over 10,000 of Hyseqs proprietary protein sequences. This structure analysis pipeline incorporates 3D structure prediction and functional annotation (GeneAtlasTM, Accelrys Inc., San Diego), parsing and datamining programs, an SQL structure database, and several structural analysis programs. These tools are all accessible via an in-house web-interface. The pipeline has allowed us to obtain significant structure hits (over 100,000) and 3D models for many of our novel protein sequences. After storing the hit information in our database, we datamine the hits by keywords and analyze template-model structure pairs for individual novel proteins. Altogether, the results of the pipeline are used to aid in the functional annotation of our sequences by structure homology including active site residues, to interpret and verify sequence-based annotation, and to rapidly target novel genes to appropriate assays.

    High-throughput 3D structure determination from novel gene sequences has created new opportunities for us for discovery of biopharmaceuticals acting through novel mechanisms.

    PIPELINE FOR FUNCTIONAL ANNOTATION OF NOVEL PROTEINS BY STRUCTURPIPELINE FOR FUNCTIONAL ANNOTATION OF NOVEL PROTEINS BY STRUCTURAL HOMOLOGYAL HOMOLOGYDana Haley-Vicente* and Nancy Mize

    Hyseq Pharmaceuticals Inc., 675 Almanor Ave., Sunnyvale, CA 94086

    * Currently at Accelrys, 9685 Scranton Rd., San Diego, CA 92121

    searching databaseby keyword(s) or sequence ID

    searching databasesearching databaseby keyword(s) or sequence IDby keyword(s) or sequence ID

    filtering forbest hits

    filtering forfiltering forbest hitsbest hits

    search & print fields optionssearch & print fields optionssearch & print fields options

    3D Protein Structure Search3D Protein Structure Search3D Protein Structure Search

    query sequence

    model

    query query sequence sequence

    modelmodel

    PDB template structure

    PDB PDB template template structurestructure

    3D Viewer Links

    3D Viewer 3D Viewer LinksLinks

    Structure AnalysisStructure AnalysisStructure Analysis

    alignment analysisalignment analysisalignment analysis

    USER INTERFACE

    USER INTERFACE

    PIPELINE & DATABASEPIPELINE & DATABASE

    query sequence & template alignment

    query sequence query sequence & template & template alignmentalignment

    HS01Project

    Individual GeneAtlasTM 3D Model Report

    Individual Individual GeneAtlasGeneAtlasTMTM 3D 3D Model ReportModel Report

    secondary structure

    annotation

    secondary secondary structure structure

    annotationannotation

    HS01Project

    Individual GeneAtlasTM SeqFold

    Report

    Individual Individual GeneAtlasGeneAtlasTMTM SeqFold SeqFold

    ReportReportstatistical analysisstatistical analysisstatistical analysis

    CloningCloning SequencingSequencing

    3D Active Site Annotation

    3D Active Site Annotation

    3D Alignment &

    Statistical Analysis Tools

    3D Alignment &

    Statistical Analysis Tools

    Template Search

    (Psi-Blast)

    Template Search

    (Psi-Blast)

    Sequence / Template Alignment

    (Psi-Blast, PDB95)

    Sequence / Template Alignment

    (Psi-Blast, PDB95)

    Model Generation(MODELER)

    Model Generation(MODELER)

    Model AnnotationModel Annotation

    DataminingDatamining

    Threading(SeqFold)

    Threading(SeqFold)

    GeneAtlasTM

    Structure Plus(Parser & Filter Program)

    Structure Plus(Parser & Filter Program)

    Protein Sequences(Projects)

    Protein Sequences(Projects)

    StructureDatabase

    StructureDatabase

    Model Evaluation(Profiles-3D/Verify & PMF)

    Model Evaluation(Profiles-3D/Verify & PMF)

    The Protein Structure Pipeline

    The Protein The Protein Structure PipelineStructure Pipeline

    join toother databases

    join tojoin toother databasesother databases

    individualhit data

    individualindividualhit datahit data

    method typemethod typemethod type

    project dataproject dataproject data

    active sitedata for

    query sequencemodels

    active siteactive sitedata for data for

    query sequencequery sequencemodelsmodels

    individualsequence

    data

    individualindividualsequencesequence

    datadata

    active site & template PDB

    structuredata

    active site & active site & template PDB template PDB

    structurestructuredatadata

    Hyseqs Relational Structure DatabaseHyseqs Relational Structure DatabaseHyseqs Relational Structure Database