science 291(5507): 16 feb 2001 “we will release the entire consensus human genome sequence freely...

Science 291(5507): 16 Feb 2001

“We will release the entire consensus human genome sequence freely to researchers…. We will place no restrictions on how scientists can use this data….”

- JC Venter, US House science subcommittee testimony, 2000

“…archival data sets must be deposited with the appropriate data bank…” - Science, instructions to authors

“We are willing to be flexible…. In this domain, change is everywhere” - B. Jasny, D. Kennedy, editors of Science; 2001

“Sharing publication-related data and materials: responsibilities of authorship in the life sciences”

National Academies Presshttp://www.nap.edu/catalog/10613.html

[T]he general principle [is] that the publication of scientific information is intended to move science forward....

[T]he act of publishing is a quid pro quo in which authors receive credit and acknowledgement in exchange for disclosure of their scientific findings.

An author’s obligation is not only to release data and materials to enable others to verify or replicate published findings... but also to provide them in a form on which other scientists can build with further research.

All members of the scientific community – whether working in academia, government, or a commercial enterprise – have equal responsibility....

National Research Council commissions study: 2001

Q: Are there community standards?

1665, Philosophical Transactions of the Royal SocietyHenry Oldenburg, editor

rejects “amplifications, digressions, and swellings of style bringing all things as near the Mathematical plainness, as they can; and preferring the language of Artizans, Countrymen, and Merchants, before that, of Wits and Scholars.”

“for the relief of those either too indolent or too occupied to read whole books. It is a means of becoming learned with little trouble.”

1665, Journal des scavansDenis de Sallo, editor

Principle 1. Authors should include in their publications the data, algorithms, and other information that is central or integral to the publication – that is, whatever is necessary to support the major claims of the paper...

Principle 2. If central or integral information cannot be included in the publication for practical reasons (for example, because a dataset is too large), it should be made freely (without restriction on its use for research purposes, and at no cost) and readily accessible through other means (for example, on line).

Moreover, when necessary to enable future research, integral information should be made available in a form that enables it to be manipulated, analyzed, and combined with other scientific data.

Principle 3. If publicly accessible repositories for data have been agreed on by a community of researchers and are in general use, the relevant data should be deposited in one of these repositories by the time of publication.

Q: Shouldn’t authors have the right to share publication-related data or materials with academic investigators only?

“This view must be rejected as an artificial taxonomy. There should be a single scientific community that operates under a single set of principles.... There is no clear line between “for-profit sector” and “academic” research.”

Bioinformatics’ glass house

• We rely on data access: sequences, microarrays, etc.; and not just to look -- we need to be able to integrate, repackage, and redistribute data to create new databases and tools

• Full-text access is an essential next step: the Googleization of the biomedical literature will by driven by bioinformatics

• But our field is one of the prime offenders in data access. Published software is routinely treated as proprietary or academics-only.

• “Why should I give my work away?”

• “This is the only way I can fund my software development.”

• “My technology transfer office told me I have to license it this way.”

Reasons why bioinformatics software is proprietary

http://www.cu.edu/techtransfer

• Computational biologists should expect published software to be available as open source, enabling one to verify and to build on the code.

• The user community expects to have access to robust research tools.

• Providing robust tools is a commercial activity.

• But unlike instruments or enzymes, software is easy to copy and distribute. Open source, free software contaminates commercial business models.

• As a result, bioinformatics software is in a terrible state; commercialization is essentially failing

But how should robust software tools be supported?

a part of an answer: HMMER’s dual license strategy

• Open source version is under a viral license – the GNU General Public License (GPL)– http://hmmer.wustl.edu/

• Commercial versions available under separate (non-GPL) license– http://otm.wustl.edu; WashU Office of Technology Management– currently five licensed codebase forks to software companies– [and two possible infringers]

• Works because we (WashU) own the copyright; we can license under different licenses if we wish.

• The GPL blocks at exactly the right point: proprietary use by anyone, whether academic or commercial

http://hmmer.wustl.edu/

http://otm.wustl.edu/

“Enough. I will no longer respond to any letters on optics.”Isaac Newton, circa 1669;

to Phil. Trans. editor Henry Oldenburg, following letters in response to Newton’s first (only) article

The black hole of user support: HMMER went too far

• It is not feasible for an academic lab to support robust software tools.

• No publications or tenure forthcoming for software maintenance.

• It is crazy to have large pharma/ag bioinformatics groups relying on spare time, late-night, wine-soused efforts of single academics.

• A handoff to commercial-quality software engineering must happen.

• Open source software released at point of publication

• Increased crossreferencing between source code and publication: we’re treating the software as supplementary material, not a product.

• No upgrades, no support, no user questions

• Still dual licensed; commercial partnering encouraged

The new strategy: Zero support

• Software has a weird dynamic that instruments, enzymes, and other research tools don’t have: version 2.0 is built on version 1.0

• A forked commercial version falls rapidly into obsolescence as the next generation algorithms go into the academic code (example: GCG)

• The user faces a choice: do you want the latest, best algorithms? Or do you want user support?

• Somehow, the industrialized tool and the academic testbed need to be in the same code repository.

But Plan Zero won’t work either

• Plan A: A new startup open source company?– A model right in town: Object Computing, www.ociweb.com

• Plan B: An institute/center/nonprofit entity?– The NIH Road Map provides funds for National Centers for Biomedical

Computing

• Plan C: Partner with a broad-based hardware and services company?– IBM Life Sciences; www.ibm.com

Dreaming of a world without forks

http://www.ociweb.com/

http://www.ibm.com/

• 47% of geneticists say they have been denied a request for published data or material

• 28% say denials prevented them from confirming published research

• 12% say they had denied someone else’s request for published material [excuses: too much effort (80%); protecting career of junior staff (64%); protecting their own future work (53%)]

Campbell et al, “Data withholding in academic genetics”, JAMA 287:473 2002

science 291(5507): 16 feb 2001 “we will release the entire consensus human genome sequence freely...

Documents

scientific data

data access

relevant data

archival data sets

single scientific community

academic research

time of publication

integral information