creating a new language to support open innovation
DESCRIPTION
Presentation given on 19 August 2013 at a BioBriefings meeting of the BioMelbourne Network (http://www.biomelbourne.org/events/view/289) in Melbourne, Australia.TRANSCRIPT
Creating a new language to support open innovation
Michael Hucka, Ph.D.Department of Computing + Mathematical Sciences
California Institute of TechnologyPasadena, CA, USA
BioBriefing – BioMelbourne Network, Australia, August 2013
Email: [email protected] Twitter: @mhucka
Outli
ne
Background and introduction
The Systems Biology Markup Language (SBML)
Complementary efforts: MIRIAM and SED-ML
COMBINE: the Computational Modeling in Biology Network
Conclusion
Outli
ne
Background and introduction
The Systems Biology Markup Language (SBML)
Complementary efforts: MIRIAM and SED-ML
COMBINE: the Computational Modeling in Biology Network
Conclusion
Research today: experimentation, computation, cogitation
“ The nature of systems biology”Bruggeman & Westerhoff,
Trends Microbiol. 15 (2007).
Large-scale integrative models are growing
Many models have traditionally been published this way
Problems:
• Errors in printing
• Missing information
• Dependencies onimplementation
• Outright errors
• Can be a hugeeffort to recreate
Is it enough to communicate the model in a paper?
Experiences from BioModels DatabaseBioModels Database:
• Public database of published computational models in biology
• Many models are curated – i.e., made to work & annotated
- If not available in electronic form, they encode it from the paper
Their experiences?
• Vast majority of models encoded directly from the publication did not work as published
- Often (not always) due to common errors – typos, omissions
• Success rate improved in recent years thanks to more people providing their models in electronic formats
More is needed to make computational results reproducible
Experiences from BioModels DatabaseBioModels Database:
• Public database of published computational models in biology
• Many models are curated – i.e., made to work & annotated
- If not available in electronic form, they encode it from the paper
Their experiences?
• Vast majority of models encoded directly from the publication did not work as published
- Often (not always) due to common errors – typos, omissions
• Success rate improved in recent years thanks to more people providing their models in electronic formats
More is needed to make computational results reproducible
http://biomodels.net/biomodels
Experiences from BioModels DatabaseBioModels Database:
• Public database of published computational models in biology
• Many models are curated – i.e., made to work & annotated
- If not available in electronic form, they encode it from the paper
Their experiences?
• Vast majority of models encoded directly from the publication did not work as published
- Often (not always) due to common errors – typos, omissions
• Success rate improved in recent years thanks to more people providing their models in electronic formats
More is needed to make computational results reproducible
Is it enough to make your (software X) code available?It’s vital for good science:
• Someone with access to the same software can try to run it, understand it, verify the computational results, build on them, etc.
• Opinion: you should always do this in any case
Is it enough to make your (software X) code available?It’s vital for good science—
• Someone with access to the same software can try to run it, understand it, build on it, etc.
• Opinion: you should always do this in any case
But it’s still not ideal for communication of scientific results:
• What if they don’t have access to the same software?
• What if they don’t want to use that software?
• What if they want to use a different conceptual framework?
• And how will people be able to relate the model to other work?
Different tools ⇒ different interfaces & languages
Outli
ne
Background and introduction
The Systems Biology Markup Language (SBML)
Complementary efforts: MIRIAM and SED-ML
COMBINE: the Computational Modeling in Biology Network
Conclusion
SBML: a lingua fra
nca
for software
Format for representing computational models of biological processes
• Data structures + usage principles + serialization to XML
• (Mostly) Declarative, not procedural—not a scripting language
Neutral with respect to modeling framework
• E.g., ODE, stochastic systems, etc.
Important: software reads/writes SBML, not humans <Beginning of SBML model definition>
List of function definitionsList of unit definitionsList of compartments
List of molecular speciesList of parameters
List of rulesList of reactions
List of events<End of SBML model definition>
SBML = Systems Biology Markup Language
The raw SBML
The process is central
• Literally called a “reaction” in SBML
• Participants are pools of entities (biochemical species)
Models can further include:
• Compartments
• Other constants & variables
• Discontinuous events
• Other, explicit math
Core SBML concepts are fairly simple
• Unit definitions
• Annotations
SBML is now widely used
Dozens of journals accept models in SBML format
100’s of software tools available today
1000’s of models available in SBML format today
0
100
200
300
2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012
254+ today
Contents of BioModels DatabaseContents today:
• 142,000+ pathway models (converted from KEGG)
• 460+ hand-curated quantitative models
• 460+ non-curated quantitative models
8%2%
3%6%
6%
7%
8%
9%24%
27%
signal transductionmetabolic processmulticelullar organismal processrhythmic processcell cyclehomeostatic processresponse to stimuluscell deathlocalizationothers (e.g., developmental process)
Database data from 2013
Free software libraries – libSBMLReads, writes, validates SBML
Can check & convert units
Written in portable C++
Runs on Linux, Mac, Windows
APIs for C, C++, C#, Java, Octave, Perl, Python, R, Ruby, MATLAB
Well documented API
Open-source (LGPL)
http://sbml.org/Software/libSBML
Free software libraries – JSBMLPure Java implementation
API is compatible with libSBML but more Java-like
Functionality is subset of libSBML
Open source (LGPL)
http://sbml.org/Software/JSBML
Evolution of SBML continuesToday: SBML Level 3
• Level 3 Core provides framework for common models
• Level 3 packages add additional constructs to the Core
Level 3 package What it enablesHierarchical model composition Models containing submodels ✔
Flux balance constraints Constraint-based models ✔
Qualitative models Petri net models, Boolean models ✔
Graph layout Diagrams of models ✔
Multicomponent/state species Entities w/ structure; also rule-based models draft
Spatial Nonhomogeneous spatial models draft
Graph rendering Diagrams of models draft
Groups Arbitrary grouping of components draft
Distributions Numerical values as statistical distributions in dev
Arrays & sets Arrays or sets of entities in dev
Dynamic structures Creation & destruction of components in dev
Annotations Richer annotation syntax
Status
National Institute of General Medical Sciences (USA) European Molecular Biology Laboratory (EMBL)JST ERATO Kitano Symbiotic Systems Project (Japan) (to 2003)JST ERATO-SORST Program (Japan)ELIXIR (UK)Beckman Institute, Caltech (USA)Keio University (Japan)International Joint Research Program of NEDO (Japan)Japanese Ministry of AgricultureJapanese Ministry of Educ., Culture, Sports, Science and Tech.BBSRC (UK)National Science Foundation (USA)DARPA IPTO Bio-SPICE Bio-Computation Program (USA)Air Force Office of Scientific Research (USA)STRI, University of Hertfordshire (UK)Molecular Sciences Institute (USA)
SBML funding sources over the past 13+ years
Outli
ne
Background and introduction
The Systems Biology Markup Language (SBML)
Complementary efforts: MIRIAM and SED-ML
COMBINE: the Computational Modeling in Biology Network
Conclusion
Mathematical semantics
Biological semantics
Visual interpretation
Discrete stochastic entities
Continuous lumped parameter
State transition
Mean field approximation
Model type
Model creation
Model annotation
Model analysis
Numerical results
Model life-cycle
Model representation level
COMBINE efforts cover different facets of modeling
...
Conc
ept d
ue to
Nic
olas
Le N
ovèr
e
Modelers want to use their own conventions
Modelers want to use their own conventions
No standard identifiers
Modelers want to use their own conventions
Low info content
No standard identifiers
Raw models alone are insufficient
Need standard schemes for machine-readable annotations
• Identify entities
• Mathematical semantics
• Links to other data resources
• Authorship & pub. info
Modelers want to use their own conventions
Low info content
No standard identifiers
Addresses 2 general areas of annotation needs:
MIRIAM is not specific to SBML
MIRIAM (Minimum Information Requested In the Annotation of Models)
Requirements for reference correspondence
Scheme for encoding annotations
Annotations for attributing model creators & sources
Annotations for referring to external
data resources
Addresses 2 general areas of annotation needs:
MIRIAM is not specific to SBML
MIRIAM (Minimum Information Requested In the Annotation of Models)
Requirements for reference correspondence
Scheme for encoding annotations
Annotations for attributing model creators & sources
Annotations for referring to external
data resources
Annotations for referring to external
data resources
Example of a problem that can be solved with annotations
http://www.ebi.ac.uk/chebi
Low info content
Example of a problem that can be solved with annotations
http://www.ebi.ac.uk/chebi
Low info content
Known by different names – do you want to write all of
them into your model?
salicylic acid
MIRIAM annotations for external referencesGoal: link model constituents to corresponding entities in bioinformatics resources (e.g., databases, controlled vocabularies)
• Supports:
- Precise identification of model constituents
- Discovery of models that concern the same thing
- Comparison of model constituents between different models
MIRIAM approach avoids putting data content directly in the model
• Instead, it points at external resources that contain the data
How do we create globally unique identifiers consistently?Long story short—developed by the Le Novère group at the EBI
• Resource identifiers (URIs) combine 2 parts:
• There’s a registry for namespaces: MIRIAM Registry
- Allows people & software to use same namespace identifiers
• There’s a URI resolution service: MIRIAM Resources & identifiers.org
- Allows people & software to take a given identifier and figure out what it points to
namespace entity identifier{ {
Identifies a dataset Identifies a datumwithin the dataset
Another problem: software can’t read figure legends
?
BIOMD0000000319 in BioModels Database
Decroly & Goldbeter, PNAS, 1982
SED-ML = Simulation Experiment Description MLApplication-independent format
• Captures procedures, algorithms, parameter values
Can be used for
• Simulation experiments encoding parametrizations & perturbations
• Simulations using more than one model and/or method
• Data manipulations to produce plot(s)
http://sedml.org
Simulation
Model
Task Data generators
Reports
Efforts like SED-ML improve reproducibility of publications
Waltemath et al., BMC Sys Bio 5, 2011.
Outli
ne
Background and introduction
The Systems Biology Markup Language (SBML)
Complementary efforts: MIRIAM and SED-ML
COMBINE: the Computational Modeling in Biology Network
Conclusion
Need interoperable formats, but developing them is not easyNeed people with diverse set of knowledge & skills
• Scientific needs
• Technical implementation skills
• Practical experience
Need manage multiple phases of a standardization effort
• Creation
• Evolution
• Support
Need interoperable formats, but developing them is not easyNeed people with diverse set of knowledge & skills
• Scientific needs
• Technical implementation skills
• Practical experience
Need manage multiple phases of a standardization effort
• Creation
• Evolution
• Support} This is just for the specification of the
standards, to say nothing of the necessary software and other infrastructure!
Realizations about the state of affairs in late-2000’s
• Many standardization efforts overlapped, but lacked coordination
• Efforts were inventing their own processes from scratch
• Many individual meetings meant more travel for many people
• Limited and fragile funding didn’t support solid, coherent base
COMBINE = Computational Modeling in Biology Network
• Coordinate standards development
• Develop common procedures & tools (but not impose them!)
• Coordinate meetings
• Provide a recognized voice
Motivations for the creation of COMBINE
Standardization efforts represented in COMBINE today
BioPAX
Qualifiers
GPML
COMBINE Standards
Associated Standardization Efforts
Related Standardization Efforts
Those are the products of successful, open collaborations!
Examples of community organizationTwo main annual meetings, plus ad hoc workshops
• COMBINE meeting: status updates, presentations, outreach
- Next COMBINE: Paris, Sep 16–20, 2013
• HARMONY: Hackathon on Resources for Modeling in Biology
- Software development, interoperability hacking
COMBINE 2012, TorontoCOMBINE 2011, Heidelberg
What motivates people to do this?Solving a problem for yourself/your closed group is easier and quicker
• So what are these people getting out of it?
Some advantages of an open, community-oriented approach:
• Finding better solutions they wouldn’t find alone
- Arguments Discussions leads to realizations & better solutions
• Contributions to science – publications, peer recognition
• Support of a standard makes their software more desirable
• Sense of community involvement
Some admitted disadvantages:
• Agreement takes time – progress can be very slow
• Solutions may include features you didn’t plan on, or need
COMBINE is open to all—and COMBINE needs you!
http://co.mbine.org
Current coordinators:
• Nicolas Le Novère, Mike Hucka, Falk Schreiber, Gary Bader
Outli
ne
Background and introduction
The Systems Biology Markup Language (SBML)
Complementary efforts: MIRIAM and SED-ML
COMBINE: the Computational Modeling in Biology Network
Conclusion
Time it well
• Too early and too late are bad
Start with actual stakeholders
• Address real needs, not perceived ones
Start with small team of dedicated developers
• Can work faster, more focused; also avoids “designed-by-committee”
Engage people constantly, in many ways
• Electronic forums, email, electronic voting, surveys, hackathons
Make the results free and open-source
• Makes people comfortable knowing it will always be available
Be creative about seeking funding
Some things we (maybe?) got right with SBML
Not waiting for implementations before freezing specifications
• Sometimes finalized specification before implementations tested it
- Especially bad when we failed to do a good job
‣ E.g., “forward thinking” features, or “elegant” designs
Not formalizing the development process sufficiently
• Especially early in the history, did not have a very open process
Not resolving intellectual property issues from the beginning
• Industrial users ask “who has the right to give any rights to this?”
Some things we certainly got wrong
Nicolas Le Novère, Henning Hermjakob, Camille Laibe, Chen Li, Lukas Endler, Nico Rodriguez, Marco Donizelli, Viji Chelliah, Mélanie Courtot, Harish Dharuri
Attendees at SBML 10th Anniversary Symposium, Edinburgh, 2010
John C. Doyle, Hiroaki Kitano
Mike Hucka, Sarah Keating, Frank Bergmann, Lucian Smith, Andrew Finney, Herbert Sauro, Hamid Bolouri, Ben Bornstein, Bruce Shapiro, Akira Funahashi, Akiya Juraku, Ben Kovitz
Original PI’s:
SBML Team:
SBML Editors:
BioModels DB:
Mike Hucka, Nicolas Le Novère, Sarah Keating, Frank Bergmann, Lucian Smith, Chris Myers, Stefan Hoops, Sven Sahle, James Schaff, Darren Wilkinson
And a huge thanks to many others in the COMBINE community
This work was made possible thanks to a great community
SBML http://sbml.org
BioModels Database http://biomodels.net/biomodels
MIRIAM http://biomodels.net/miriam
identifiers.org http://identifiers.org
SED-ML http://biomodels.net/sed-ml
SBO http://biomodels.net/sbo
SBGN http://sbgn.org
COMBINE http://co.mbine.org
URLs
I’d like your feedback!You can use this anonymous form:
http://tinyurl.com/mhuckafeedback