creating a new language to support open innovation

Creating a new language to support open innovation

Michael Hucka, Ph.D.Department of Computing + Mathematical Sciences

California Institute of TechnologyPasadena, CA, USA

BioBriefing – BioMelbourne Network, Australia, August 2013

Email: [email protected] Twitter: @mhucka

http://cshlccb.wordpress.com/2012/08/03/michael-hucka-california-institute-of-technology-the-systems-biology-markup-language-sbml-model-databases-and-translation/




mailto:[email protected]?subject=About%20your%20presentation

mailto:[email protected]?subject=About%20your%20presentation

https://twitter.com/%23!/mhucka

https://twitter.com/%23!/mhucka

Outli

ne

Background and introduction

The Systems Biology Markup Language (SBML)

Complementary efforts: MIRIAM and SED-ML

COMBINE: the Computational Modeling in Biology Network

Conclusion

Research today: experimentation, computation, cogitation

“ The nature of systems biology”Bruggeman & Westerhoff,

Trends Microbiol. 15 (2007).

Large-scale integrative models are growing

Many models have traditionally been published this way

Problems:

• Errors in printing

• Missing information

• Dependencies onimplementation

• Outright errors

• Can be a hugeeffort to recreate

Is it enough to communicate the model in a paper?

Experiences from BioModels DatabaseBioModels Database:

• Public database of published computational models in biology

• Many models are curated – i.e., made to work & annotated

- If not available in electronic form, they encode it from the paper

Their experiences?

• Vast majority of models encoded directly from the publication did not work as published

- Often (not always) due to common errors – typos, omissions

• Success rate improved in recent years thanks to more people providing their models in electronic formats

More is needed to make computational results reproducible





Their experiences?





http://biomodels.net/biomodels

http://www.biomodels.net/sed%C2%ADml






Their experiences?





Is it enough to make your (software X) code available?It’s vital for good science:

• Someone with access to the same software can try to run it, understand it, verify the computational results, build on them, etc.

• Opinion: you should always do this in any case

Is it enough to make your (software X) code available?It’s vital for good science—

• Someone with access to the same software can try to run it, understand it, build on it, etc.

• Opinion: you should always do this in any case

But it’s still not ideal for communication of scientific results:

• What if they don’t have access to the same software?

• What if they don’t want to use that software?

• What if they want to use a different conceptual framework?

• And how will people be able to relate the model to other work?

Different tools ⇒ different interfaces & languages

Outli

ne





Conclusion

SBML: a lingua fra

nca

for software

Format for representing computational models of biological processes

• Data structures + usage principles + serialization to XML

• (Mostly) Declarative, not procedural—not a scripting language

Neutral with respect to modeling framework

• E.g., ODE, stochastic systems, etc.

Important: software reads/writes SBML, not humans <Beginning of SBML model definition>

List of function definitionsList of unit definitionsList of compartments

List of molecular speciesList of parameters

List of rulesList of reactions

List of events<End of SBML model definition>

SBML = Systems Biology Markup Language

The raw SBML

The process is central

• Literally called a “reaction” in SBML

• Participants are pools of entities (biochemical species)

Models can further include:

• Compartments

• Other constants & variables

• Discontinuous events

• Other, explicit math

Core SBML concepts are fairly simple

• Unit definitions

• Annotations

SBML is now widely used

Dozens of journals accept models in SBML format

100’s of software tools available today

1000’s of models available in SBML format today

0

100

200

300

2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

254+ today

Contents of BioModels DatabaseContents today:

• 142,000+ pathway models (converted from KEGG)

• 460+ hand-curated quantitative models

• 460+ non-curated quantitative models

8%2%

3%6%

6%

7%

8%

9%24%

27%

signal transductionmetabolic processmulticelullar organismal processrhythmic processcell cyclehomeostatic processresponse to stimuluscell deathlocalizationothers (e.g., developmental process)

Database data from 2013

Free software libraries – libSBMLReads, writes, validates SBML

Can check & convert units

Written in portable C++

Runs on Linux, Mac, Windows

APIs for C, C++, C#, Java, Octave, Perl, Python, R, Ruby, MATLAB

Well documented API

Open-source (LGPL)

http://sbml.org/Software/libSBML





Free software libraries – JSBMLPure Java implementation

API is compatible with libSBML but more Java-like

Functionality is subset of libSBML

Open source (LGPL)

http://sbml.org/Software/JSBML





Evolution of SBML continuesToday: SBML Level 3

• Level 3 Core provides framework for common models

• Level 3 packages add additional constructs to the Core

Level 3 package What it enablesHierarchical model composition Models containing submodels ✔

Flux balance constraints Constraint-based models ✔

Qualitative models Petri net models, Boolean models ✔

Graph layout Diagrams of models ✔

Multicomponent/state species Entities w/ structure; also rule-based models draft

Spatial Nonhomogeneous spatial models draft

Graph rendering Diagrams of models draft

Groups Arbitrary grouping of components draft

Distributions Numerical values as statistical distributions in dev

Arrays & sets Arrays or sets of entities in dev

Dynamic structures Creation & destruction of components in dev

Annotations Richer annotation syntax

Status

National Institute of General Medical Sciences (USA) European Molecular Biology Laboratory (EMBL)JST ERATO Kitano Symbiotic Systems Project (Japan) (to 2003)JST ERATO-SORST Program (Japan)ELIXIR (UK)Beckman Institute, Caltech (USA)Keio University (Japan)International Joint Research Program of NEDO (Japan)Japanese Ministry of AgricultureJapanese Ministry of Educ., Culture, Sports, Science and Tech.BBSRC (UK)National Science Foundation (USA)DARPA IPTO Bio-SPICE Bio-Computation Program (USA)Air Force Office of Scientific Research (USA)STRI, University of Hertfordshire (UK)Molecular Sciences Institute (USA)

SBML funding sources over the past 13+ years

Outli

ne





Conclusion

Mathematical semantics

Biological semantics

Visual interpretation

Discrete stochastic entities

Continuous lumped parameter

State transition

Mean field approximation

Model type

Model creation

Model annotation

Model analysis

Numerical results

Model life-cycle

Model representation level

COMBINE efforts cover different facets of modeling

...

Conc

ept d

ue to

Nic

olas

Le N

ovèr

e

Modelers want to use their own conventions


No standard identifiers


Low info content


Raw models alone are insufficient

Need standard schemes for machine-readable annotations

• Identify entities

• Mathematical semantics

• Links to other data resources

• Authorship & pub. info


Low info content


Addresses 2 general areas of annotation needs:

MIRIAM is not specific to SBML

MIRIAM (Minimum Information Requested In the Annotation of Models)

Requirements for reference correspondence

Scheme for encoding annotations

Annotations for attributing model creators & sources

Annotations for referring to external

data resources

Addresses 2 general areas of annotation needs:

MIRIAM is not specific to SBML

MIRIAM (Minimum Information Requested In the Annotation of Models)

Requirements for reference correspondence

Scheme for encoding annotations

Annotations for attributing model creators & sources


data resources


data resources

Example of a problem that can be solved with annotations

http://www.ebi.ac.uk/chebi

Low info content





Example of a problem that can be solved with annotations


Low info content

Known by different names – do you want to write all of

them into your model?

salicylic acid

MIRIAM annotations for external referencesGoal: link model constituents to corresponding entities in bioinformatics resources (e.g., databases, controlled vocabularies)

• Supports:

- Precise identification of model constituents

- Discovery of models that concern the same thing

- Comparison of model constituents between different models

MIRIAM approach avoids putting data content directly in the model

• Instead, it points at external resources that contain the data

How do we create globally unique identifiers consistently?Long story short—developed by the Le Novère group at the EBI

• Resource identifiers (URIs) combine 2 parts:

• There’s a registry for namespaces: MIRIAM Registry

- Allows people & software to use same namespace identifiers

• There’s a URI resolution service: MIRIAM Resources & identifiers.org

- Allows people & software to take a given identifier and figure out what it points to

namespace entity identifier{ {

Identifies a dataset Identifies a datumwithin the dataset

Another problem: software can’t read figure legends

?

BIOMD0000000319 in BioModels Database

Decroly & Goldbeter, PNAS, 1982

SED-ML = Simulation Experiment Description MLApplication-independent format

• Captures procedures, algorithms, parameter values

Can be used for

• Simulation experiments encoding parametrizations & perturbations

• Simulations using more than one model and/or method

• Data manipulations to produce plot(s)

http://sedml.org

Simulation

Model

Task Data generators

Reports



Efforts like SED-ML improve reproducibility of publications

Waltemath et al., BMC Sys Bio 5, 2011.

Outli

ne





Conclusion

Need interoperable formats, but developing them is not easyNeed people with diverse set of knowledge & skills

• Scientific needs

• Technical implementation skills

• Practical experience

Need manage multiple phases of a standardization effort

• Creation

• Evolution

• Support

Need interoperable formats, but developing them is not easyNeed people with diverse set of knowledge & skills

• Scientific needs

• Technical implementation skills

• Practical experience

Need manage multiple phases of a standardization effort

• Creation

• Evolution

• Support} This is just for the specification of the

standards, to say nothing of the necessary software and other infrastructure!

Realizations about the state of affairs in late-2000’s

• Many standardization efforts overlapped, but lacked coordination

• Efforts were inventing their own processes from scratch

• Many individual meetings meant more travel for many people

• Limited and fragile funding didn’t support solid, coherent base

COMBINE = Computational Modeling in Biology Network

• Coordinate standards development

• Develop common procedures & tools (but not impose them!)

• Coordinate meetings

• Provide a recognized voice

Motivations for the creation of COMBINE

Standardization efforts represented in COMBINE today

BioPAX

Qualifiers

GPML

COMBINE Standards

Associated Standardization Efforts

Related Standardization Efforts

Those are the products of successful, open collaborations!

Examples of community organizationTwo main annual meetings, plus ad hoc workshops

• COMBINE meeting: status updates, presentations, outreach

- Next COMBINE: Paris, Sep 16–20, 2013

• HARMONY: Hackathon on Resources for Modeling in Biology

- Software development, interoperability hacking

COMBINE 2012, TorontoCOMBINE 2011, Heidelberg

What motivates people to do this?Solving a problem for yourself/your closed group is easier and quicker

• So what are these people getting out of it?

Some advantages of an open, community-oriented approach:

• Finding better solutions they wouldn’t find alone

- Arguments Discussions leads to realizations & better solutions

• Contributions to science – publications, peer recognition

• Support of a standard makes their software more desirable

• Sense of community involvement

Some admitted disadvantages:

• Agreement takes time – progress can be very slow

• Solutions may include features you didn’t plan on, or need

COMBINE is open to all—and COMBINE needs you!

http://co.mbine.org

Current coordinators:

• Nicolas Le Novère, Mike Hucka, Falk Schreiber, Gary Bader

http://co.mbine.org

http://co.mbine.org

http://co.mbine.org

http://co.mbine.org

Outli

ne





Conclusion

Time it well

• Too early and too late are bad

Start with actual stakeholders

• Address real needs, not perceived ones

Start with small team of dedicated developers

• Can work faster, more focused; also avoids “designed-by-committee”

Engage people constantly, in many ways

• Electronic forums, email, electronic voting, surveys, hackathons

Make the results free and open-source

• Makes people comfortable knowing it will always be available

Be creative about seeking funding

Some things we (maybe?) got right with SBML

Not waiting for implementations before freezing specifications

• Sometimes finalized specification before implementations tested it

- Especially bad when we failed to do a good job

‣ E.g., “forward thinking” features, or “elegant” designs

Not formalizing the development process sufficiently

• Especially early in the history, did not have a very open process

Not resolving intellectual property issues from the beginning

• Industrial users ask “who has the right to give any rights to this?”

Some things we certainly got wrong

Nicolas Le Novère, Henning Hermjakob, Camille Laibe, Chen Li, Lukas Endler, Nico Rodriguez, Marco Donizelli, Viji Chelliah, Mélanie Courtot, Harish Dharuri

Attendees at SBML 10th Anniversary Symposium, Edinburgh, 2010

John C. Doyle, Hiroaki Kitano

Mike Hucka, Sarah Keating, Frank Bergmann, Lucian Smith, Andrew Finney, Herbert Sauro, Hamid Bolouri, Ben Bornstein, Bruce Shapiro, Akira Funahashi, Akiya Juraku, Ben Kovitz

Original PI’s:

SBML Team:

SBML Editors:

BioModels DB:

Mike Hucka, Nicolas Le Novère, Sarah Keating, Frank Bergmann, Lucian Smith, Chris Myers, Stefan Hoops, Sven Sahle, James Schaff, Darren Wilkinson

And a huge thanks to many others in the COMBINE community

This work was made possible thanks to a great community

SBML http://sbml.org

BioModels Database http://biomodels.net/biomodels

MIRIAM http://biomodels.net/miriam

identifiers.org http://identifiers.org

SED-ML http://biomodels.net/sed-ml

SBO http://biomodels.net/sbo

SBGN http://sbgn.org

COMBINE http://co.mbine.org

URLs

http://sbml.org

http://sbml.org



http://biomodels.net/miriam

http://biomodels.net/miriam

http://identifiers.org

http://identifiers.org

http://biomodels.net/sed-ml

http://biomodels.net/sed-ml

http://biomodels.net/sbo

http://biomodels.net/sbo

http://sbgn.org

http://sbgn.org

I’d like your feedback!You can use this anonymous form:

http://tinyurl.com/mhuckafeedback



creating a new language to support open innovation

Technology

pathway models

s of models available

vast majority of models

largescale integrative

sbml format

software readswrites

humans sbml

raw sbml