slritools project: providing a platform for bioinformatics research michel dumontier bioinformatics...

35
SLRITools Project: Providing a Platform for Bioinformatics Research Michel Dumontier Bioinformatics Technology Conference February 3 - 6, 2003

Upload: shona-anderson

Post on 29-Dec-2015

216 views

Category:

Documents


0 download

TRANSCRIPT

SLRITools Project:Providing a Platform for Bioinformatics Research

Michel Dumontier

Bioinformatics Technology ConferenceFebruary 3 - 6, 2003

Michel Dumontier – SLRITools Project

SLRITools Outline

• Introduction• Open-source• Toolkit Foundation• Toolkit Projects• Future Prospects

Michel Dumontier – SLRITools Project

Christopher W.V. Hogue Lab:An Engineering Approach Towards

Cellular SimulationWhole Cell

Visualization

GRID Computing Layer

Modular Cell SimulationSoftware Layer

Data Access Layer

CellGeometry

MoleculesSeqHound

InteractionsReactions

Kinetics, PTMs

Initial Conditions

microscopy NCBI/EBI/DDBJPDB

BINDExpression, Concentration,Localization/distributions

Proteomics/Genomics

Michel Dumontier – SLRITools Project

SLRITools

Purpose• Make freely available our sequence and

structure manipulation and analysis infrastructure and tool software to the greater benefit of the Bioinformatics community

Michel Dumontier – SLRITools Project

SLRIToolsDescription• Mainly C-based cross-platform toolkit for

dealing with biological information, especially protein structure/function.

• Extends the freely available NCBI C/C++ Toolkits and forms the basis for a number of powerful applications

• GPL/LGPL/PAL licenses• Currently hosted at

http://sourceforge.net/projects/slritools

• Training tutorials http://bioinfo.mshri.on.ca/tkcourse/

• Canadian Bioinformatics Workshops http://bioinformatics.ca

Michel Dumontier – SLRITools Project

SLRIToolsProjects • SLRI lib - common library that extends NCBI Toolkit

• SeqHound - Sequence and Structure Database Management System

• BIND - Biomolecular Interaction Network Database

• Text Indexer - ASN.1 indexer

• NBLAST – Cluster variant of BLAST for NxN comparisons

• Kangaroo – Regular expression search of DNA/protein/CDR

Michel Dumontier – SLRITools Project

BIND SeqHound

MoBiDiCK

TraDES

database NCBI c NCBI c++ SLRI

2.6M lines of source code160 Person-years of work

Hogue Lab - Source Code

450,000 lines of source code22 Person-years of work

Industry Standard65 lines/day

http://ncbi.nlm.nih.gov/IEB/

http://sourceforge.net/projects/slritools

Michel Dumontier – SLRITools Project

SLRITools Outline

• Introduction• Open-Source• Toolkit Foundation• Toolkit Projects• Future Prospects

Michel Dumontier – SLRITools Project

Going Open Source

• Subject to the Intellectual Property Policy of Mt. Sinai Hospital

• Does the software have the potential to improve patient care ?

• Does the software have economic benefits that will fund new research and development?

• Patents, Licenses & Publications

Michel Dumontier – SLRITools Project

Software LicensesStage 1) “Not Released”

– “No license”– internal use only– Protects commercial interest of MSH

• distributedfolding

Stage 2) “Free to Academics”

– Executables provided free, source upon request– Publication– Companies must license from MSH

• MCODE, TRADES, SSSF

Stage 3) “Public Use License”

– GNU Public License• SeqHound, BIND Data Manager, BIND specification

– Perl Artistic License/Lesser GNU Public License• SeqHound Remote Interfaces for BioPERL/ C, C++

API

MSH Boardsubcommittee oncommercialization

SLRI Industrial LiasionTech Transfer OfficePatent IP

Michel Dumontier – SLRITools Project

Open Source Issues

• Software Releases• Support

Michel Dumontier – SLRITools Project

SLRITools Outline

• Introduction• Open-Source• Toolkit Foundation• Toolkit Projects• Future Prospects

Michel Dumontier – SLRITools Project

SLRITools Foundation

• National Center for Biotechnology Information (NCBI)

• NCBI Toolbox - Information Engineering Branch– http://www.ncbi.nlm.nih.gov/IEB/ – GenBank, Entrez, BLAST, Sequin, OMIM,

RefSeq1. Data Model – An explicit, complete data model of

biological sequences, structures, bibliographic data, and associated annotations

2. Data Encoding - A formal specification and encoding rules. The telecommunications standard, ASN.1, has been used for this. Recently it has been mapped to a similar language, XML. Provides automatic code generators.

Michel Dumontier – SLRITools Project

SLRITools Foundation II

3. Programming Libraries – Originally written in a portable dialect of C.

Recently a new generation is being written in C++.

– Compiled and occasionally tested over 14 OS • Linux, HPUX, MacOS 9/X, Irix, Solaris,

Windows 3.1/95/NT/2000/XP, BeOS, QNX, alpha, BSD, AIX, parisc-Linux, Sony PlayStation2 Linux

• 16/32/64 bit hardware

– Open Source – Free License– ftp://ftp.ncbi.nih.gov/toolbox/

Michel Dumontier – SLRITools Project

SLRITools Outline

• Introduction• Open-Source• Toolkit Foundation• Toolkit Projects• Future Prospects

Michel Dumontier – SLRITools Project

SeqHound

• SeqHound is a sequence and structure database management system that inherits the NCBI data model and mirrors the NCBI core biological sequence and structure information

• Why did we develop SeqHound?– Too many hits to NCBI server -> banned IP!– Data transmission & network connection issues– Generate more sophisticated API to access data

currently only available within the NCBI– Faster, local or remote access with a variety of

programming languages– Provide functionality necessary to retrieve

specialized subsets of sequences, structures and structural domains.

Michel Dumontier – SLRITools Project

SeqHound

Nucleic AcidsProteins3D StructuresDomainsPubMed LinksTaxonomy IdentifiersCoding RegionsGenome SetsRedundancyNeighborsGO AnnotationLocusLinkFielded Text IndexMedline XML/DB2

GFFFASTAClustalPDBXML ASN.1

database fill database update

databases

SeqHound APIlocal API, remoteAPI server, www

server

Applications

NCBI ftp site

NCBI toolkitbzip library CodeBase

http://seqhound.mshri.on.ca

150+ functions

Daily Updated

Michel Dumontier – SLRITools Project

SeqHound Resources

• SeqHound is accessible via

– http://seqhound.mshri.on.ca– Simple web interface (under development) – C, C++, Java (new!), Perl remote API or an

optimized local API. (->SOAP?)• Timeline

– Redundant fail-over server mid-summer– Concurrent with Bioperl release

• Freely available article published in BMC Bioinformatics 2002, 3:32

• http://www.biomedcentral.com/1471-2105/3/32/

Michel Dumontier – SLRITools Project

BIND

Biomolecular Interaction Network Database

Motivation:• Massive influx of biomolecular interaction data

requires repository, standards and access

Goals:• Provide a standard, comprehensive and

integrated interaction resource to the scientific community

• Define protein function and mechanisms• Recover and integrate biomolecular

interaction knowledge (backfilling)• Discover new knowledge through data mining

Michel Dumontier – SLRITools Project

Result:• Database to archive and exchange molecular

assembly information• Describes

– Interactions– Complexes– Pathways

• BIND has an extensive data model, GNU software tools and is based on the NCBI toolkit.

• Recently funded for a 3 year effort at 25M CDN– CIHR (1M) OGI/Genome Canada (12.5M)

Ontario R&D Challenge fund (5.2M) – IBM, MDS Proteomics and Foundry Networks– Sun

http://bind.ca

Michel Dumontier – SLRITools Project

BIND Data PoliciesGenBank Policy

– BIND data is freely available for any purpose

Direct Submission– Submitters cannot limit the intended use of

submitted BIND data– Submitters have the right to edit/alter their

records over time– Suggestions made by a third party will be

forwarded by us to the submitters to seek approval for any changes or corrections

Availability– ftp://ftp.bind.ca – ASN.1/XML data+specification

Michel Dumontier – SLRITools Project

Browsing BIND http://bind.ca

Michel Dumontier – SLRITools Project

Visually Navigating BIND

Michel Dumontier – SLRITools Project

Michel Dumontier – SLRITools Project

Molecular Complex Detection (MCODE)

• Assume densely connected regions of a heterogeneous interaction network represent molecular complexes

• MCODE finds densely connected regions of a graph

• Weight nodes by local density (scoring function)• From highest weighted node, recursively add

neighbours above threshold score to complex

• Evaluation (Yeast):• 88/221 CellZome hand annotated complexes• 64/208 MIPS complexes (166 predicted)

• 200 complexes predicted in 15,143 protein interactions from yeast

Published: BMC Bioinformatics 2003. 4:2.• http://www.biomedcentral.com/1471-2105/4/2

Michel Dumontier – SLRITools Project

Dense Fibrillar Center

Fibrillar Center

Granular Component

9-core from ~15,000 yeast interactions

Michel Dumontier – SLRITools Project

FAST = “parallel” RPS BLAST

Used to spot domain similarities in a protein interaction cluster

Server-generated scalable FLASHgraphics – zoomable, printable.

Followed-up by zoom in on FASTA formatted sequences to see domain superposition and links to SMART/PFAM

Michel Dumontier – SLRITools Project

Nucleic Acids Res. 2003 Jan 1;31(1):248-50

Michel Dumontier – SLRITools Project

NBLASTDescription:• NBLAST is a cluster computer variant of BLAST• It performs the minimum number of sequence

comparisons and stores sequence alignments and the list of similar sequences (neighbours) as binary ASN.1 (XML)

• NBLAST is written in C using the NCBI C Toolkit.

• Separate function and database layersAccessibility: via SeqHound• http://seqhound.mshri.on.ca Neighbours DB (codebase)• ftp://ftp.mshri.on.ca/pub/nblastPublished: BMC Bioinformatics 2002, 3:13• http://www.biomedcentral.com/1471-

2105/3/13/

Ookpik CFI/ORDCF Funded.

216 P-III 45064 GB 1.2 TB disk

NBLASTRPS-BLASTTRADESMoBiDiCK

http://bioinfo.mshri.on.ca/yac/ http://sourceforge.net/projects/slritools/

Michel Dumontier – SLRITools Project

Kangaroo

Description:• Kangaroo is implemented to facilitate a wide

range of queries with no restriction on the length or complexity of the query expression

• Uses regular expression• Search DNA, protein, or coding region• Web-based form and results• Links to SeqHoundAccessibility:• http://bioinfo.mshri.on.ca/kangaroo currently

supports searches on 10 organisms (including human, mouse)

Published: BMC Bioinformatics 2002, 3:20 http://www.biomedcentral.com/1471-2105/3/20

Michel Dumontier – SLRITools Project

Summary

• Robust tools and services based on the NCBI data model

• Flexible licensing

Future Prospects• BIND/SeqHound Web Services (SOAP)• SeqHound

– Web Interface– InterPro|COG

• Larger & more sophisticated BIND (JAVA)• Grid Engine & Cell Simulation

Michel Dumontier – SLRITools Project

Christopher W.V. Hogue Lab Projects/Graduate Students

• BIND– Gary Bader– Doron Betel

• SeqHound– Katerina Michalickova

• Protein Folding/CASP Predictions– Howard Feldman

• Species Specific Protein Scoring Functions– Michel Dumontier

• Cell Simulation/Systems Biology– Adrian Heilbut– Ken Lau

• FPGA Hardware Database Search Engines– Ruth Isserlin

Michel Dumontier – SLRITools Project

BIND = “Blueprint Initiative”

• Database Curation– Vicki Lay– Susan Moore – Brigitte Tuekam – Cheryl Wolting

• Software Engineering– Neil Bahroos– Ian Donaldson – Marc Dumontier– Vladimir Grytsan – Hao Lieu– Greg Pintile– John Salama

• Administration– Eric Andrade – Marianne Rukavina– Sue Sroka – Greg Van

Volkenburg

• IT– Greg Clark– Edward Lee