is that a scientific report or just some cool pictures from the lab? reproducibility and...

45
Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry Gregory Landrum Ph.D. NIBR IT Novartis Institutes for BioMedical Research Basel 2013 CADD Gordon Conference, Mount Snow VT 23 July, 2013

Upload: greg-landrum

Post on 10-May-2015

1.042 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Is that a scientific report or just some cool pictures from the lab? Reproducibility and

computational chemistry

Gregory Landrum Ph.D.

NIBR IT

Novartis Institutes for BioMedical Research Basel

2013 CADD Gordon Conference, Mount Snow VT 23 July, 2013

Page 2: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Publishing…

Page 3: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Publishing…

Scientific publications have at least two goals: (i) to announce a result and (ii) to convince readers that the result is correct. Mathematics papers are expected to contain a proof complete enough to allow knowledgeable readers to fill in any details. Papers in experimental science should describe the results and provide a clear enough protocol to allow successful repetition and extension.

Mesirov, J. P. Accessible Reproducible Research. Science 327, 415–416 (2010).

Page 4: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Outline

§  Reproducibility?

§  Requirements for reproducibility of published research

§  Practical aspects

Landrum, G. A. & Stiefl, N. Is that a scientific publication or an advertisement? Reproducibility, source code and data in the computational chemistry literature. Future Medicinal Chemistry 4, 1885–1887 (2012).

Page 5: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Reproducibility

http://en.wikipedia.org/wiki/Reproducibility

Reproducibility is the ability of an entire experiment or study to be reproduced, either by the researcher or by someone else working independently. It is one of the main principles of the scientific method.

Page 6: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Reproducibility

An author’s central obligation is to present an accurate and complete account of the research performed, absolutely avoiding deception, including the data collected or used, as well as an objective discussion of the significance of the research. Data are defined as information collected or used in generating research conclusions. The research report and the data collected should contain sufficient detail and reference to public sources of information to permit a trained professional to reproduce the experimental observations.

ACS “Ethical Guidelines to Publication of Chemical Research”

Page 7: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Reproducibility

Experimental reproducibility is the coin of the scientific realm. The extent to which measurements or observations agree when performed by different individuals defines this important tenet of the scientific method. The formal essence of experimental reproducibility was born of the philosophy of logical positivism or logical empiricism, which purports to gain knowledge of the world through the use of formal logic linked to observation. A key principle of logical positivism is verificationism, which holds that every truth is verifiable by experience. In this rational context, truth is defined by reproducible experience, and unbiased scientific observation and determinism are its underpinnings. … The assumption that objectively true scientific observations must be reproducible is implicit, yet direct tests of reproducibility are rarely found in the published literature. This lack of published evidence of reproducibility stems from the limited appeal of studies reproducing earlier work to most funding bodies and to most editors. Furthermore, many readers of scientific journals— especially of higher-impact journals—assume that if a study is of sufficient quality to pass the scrutiny of rigorous reviewers, it must be true; this assumption is based on the inferred equivalence of reproducibility and truth described above.

Loscalzo, J. Irreproducible Experimental Results: Causes, (Mis)interpretations, and Consequences. Circulation 125, 1211–1214 (2012).

Page 8: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

If it’s not reproducible science?

Page 9: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

“Let me show you some cool pictures from my lab…”

Page 10: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility

thanks to Martin Stahl for the picture

Page 11: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

A great start (1) Wherever possible, source code should be provided for new computational methods. The source code can be a reference implementation of a method or algorithm and does not need to include a graphical interface. If it is not possible to release the source code for a new method, authors should provide a sufficient justification. Reviewers and editors will then consider this explanation. Any paper that does not comply with the reproducibility guidelines will include this explanation when published. In cases where it is not possible to release code due to intellectual property or other limitations, an executable version of the new method should be readily accessible. Commercial products should provide time limited licenses to facilitate evaluation and comparison of published methods. (2) Any chemical structures and data mentioned in the paper should be made available in a commonly used (SDF or SMILES) format. Distribution of data in pdf format is not sufficient. (3) Any publications that employ commercial or open-source software should include scripts or parameter files as well as data files that will enable others to easily reproduce the work. (4) A clear easy to follow description of any new method should be a key criterion during the review process. Wherever possible, a paper should contain a simple worked example that demonstrates the application of the method. Parameter values and intermediate results for example compounds should be included as part of the supporting material. (5) Reviewers should put particular emphasis on the reproducibility of the method described in a manuscript. Each reviewer should evaluate the description of the method, as well as the presence of associated code, data, or executables, to ensure that the results can be independently reproduced.

Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J. Chem. Inf. Model. (2013). doi:10.1021/ci400197w

Page 12: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility

§ Data used

§ Code/algorithm description

§ Results

Page 13: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Data

As a condition of publication, authors must agree to make available all data necessary to understand and assess the conclusions of the manuscript to any reader of Science. Data must be included in the body of the paper or in the supplementary materials, where they can be viewed free of charge by all visitors to the site. Certain types of data must be deposited in an approved online database, including DNA and protein sequences, microarray data, crystal structures, and climate records.

http://www.sciencemag.org/site/feature/contribinfo/faq/index.xhtml#data_faq

Page 14: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Data

§  This is a no brainer, right?

§  Unless it’s completely unprocessed (or the processing is part of the detailed method description/code), it’s better to include the actual data

§  “Ligands from PDB structures X, Y, and Z” probably not good enough

§  For sources like ChEMBL, a version number and SQL to grab the data are probably adequate

Page 15: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Data

Goodman, L., Lawrence, R. & Ashley, K. Data-set visibility: Cite links to data in reference lists. Nature 492:356–6 (2012).

A huge amount of work goes into creating data sets. It is crucial that these data, big or small, should be more prominently linked to their associated research articles as standard practice. To achieve this, data can be cited directly in a publication's reference section using a permanent identifier such as a digital object identifier (DOI; see, for example, go.nature.com/vnyidi and go.nature.com/zdfbcl). So far, however, only very few journals do this. Publishers, funders, researchers and institutions all need to recognize that data sets constitute a valuable scholarly resource. Authors should be credited for these career-making contributions. Enhanced data-set visibility would also benefit referees and readers by raising standards of data analysis, promoting more detailed review, encouraging data curation and boosting reproducibility and data reuse.

Page 16: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Data

§  What about chemical structures? •  a table with drawings of molecules? •  names instead of structures?

§  Why not include the structures in a machine-readable format?

This expanded use of electronic resources offers an excellent opportunity to make chemical information more accessible and user-friendly to readers of scientific papers. To take advantage of these opportunities, we have developed several online features that expand the usefulness of chemical compound information for Nature Chemical Biology readers … In all original research papers, compounds that are relevant to the background or results of the paper are assigned a bolded, Arabic numeral that serves as a unique identifier for the compound. Each numerical abbreviation in the HTML and PDF versions of the article is linked to a Compound Data page, which shows the structure and the IUPAC or common name of the chemical compound. From there, readers can download a ChemDraw file of the compound…To provide readers with rapid access to all of the chemical compounds discussed in an article, we feature a Compound Data Index page, which is accessible from the Compound Data page, the table of contents entry for the paper, and the navigation tools on the right side of the Nature Chemical Biology website.

http://www.nature.com/nchembio/journal/v3/n6/full/nchembio0607-297.htm

Page 17: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Chemical Data

From Nature Chemical Biology

Page 18: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Chemical Data

From Nature Chemistry

Huigens, R. W., et al. A ring-distortion strategy to construct stereochemically complex and structurally diverse compounds from natural products. Nature Chemistry 5:195-202 (2013). doi:10.1038/nchem.1549

Page 19: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

It’s not always easy

Data Sets. For this study we arbitrarily chose 18 Merck data sets shown in Table 1. These include a mix of on-target data sets and ADME data sets. Some data sets are so large (>100,000) that we randomly selected a smaller subset of compounds (50,000) to expedite the study. It is useful to use proprietary data sets for two reasons: 1.  We wanted data sets which are realistically large and have a

realistic level of noise but are not as noisy as high- throughput data sets.

2.  Time-splitting requires dates of testing, and these are almost impossible to find in public domain data sets.

Chen, B., Sheridan, R. P., Hornak, V. & Voigt, J. H. Comparison of Random Forest and Pipeline Pilot Naïve Bayes in Prospective QSAR Predictions. J. Chem. Inf. Model. 52, 792–803 (2012).

Page 20: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility

§ Data used

§ Code/algorithm description

§ Results

Page 21: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Code

Stahl, M. & Bajorath, J. Computational Medicinal Chemistry. J. Med. Chem. 54, 1-2 (2011).

Computational methods must be described in sufficient detail for the reader to reproduce the results.

Page 22: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Code

Ince, D. C., Hatton, L. & Graham-Cumming, J. The case for open computer programs. Nature 482, 485–488 (2012).

We argue that, with some exceptions, anything less than the release of source programs is intolerable for results that depend on computation. The vagaries of hardware, software and natural language will always ensure that exact reproducibility remains uncertain, but withholding code increases the chances that efforts to reproduce results will fail.

Page 23: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Code

Data and materials availability All data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science. All computer codes involved in the creation or analysis of data must also be available to any reader of Science. After publication, all reasonable requests for data and materials must be fulfilled. Any restrictions on the availability of data, codes, or materials, including fees and original data obtained from other sources (Materials Transfer Agreements), must be disclosed to the editors upon submission.

http://www.sciencemag.org/site/feature/contribinfo/prep/gen_info.xhtml#dataavail

Page 24: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Code

An inherent principle of publication is that others should be able to replicate and build upon the authors' published claims. Therefore, a condition of publication in a Nature journal is that authors are required to make materials, data and associated protocols promptly available to readers without undue qualifications. Any restrictions on the availability of materials or information must be disclosed to the editors at the time of submission. Any restrictions must also be disclosed in the submitted manuscript, including details of how readers can obtain materials and information. If materials are to be distributed by a for-profit company, this must be stated in the paper.

http://www.nature.com/authors/policies/availability.html

In the meantime, researchers must, when they are arranging the commercialization of their work, bear in mind the implications that these deals may have on their freedom to publish to the standards that the community is entitled to expect.

http://www.nature.com/nature/journal/v442/n7098/full/442001a.html

Page 25: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Code

§  “Black box” code sharing: installing the software on a publicly accessible server, or providing executables for people to test

§  Does this help with reproducibility?

§  Doesn’t demonstrate that the implementation corresponds to the algorithm description

§  Not cut and dried.

Page 26: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

The Recomputation Manifesto

From Ian Gent, University of St. Andrews 1.  Computational experiments should be recomputable for all time 2.  Recomputation of recomputable experiments should be very easy 3.  It should be easier to make experiments recomputable than not to 4.  Tools and repositories can help recomputation become standard 5.  The only way to ensure recomputability is to provide virtual

machines 6.  Runtime performance is a secondary issue

http://www.software.ac.uk/blog/2013-07-09-recomputation-manifesto http://arxiv.org/pdf/1304.3674v1.pdf

Page 27: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility

§ Data used

§ Code/algorithm description

§ Results

Page 28: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Results

§  Including the actual results is even more of a no brainer, right?

Page 29: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Results

§  Including the actual results is even more of a no brainer, right?

Homology Models of Human All-Trans Retinoic Acid Metabolizing Enzymes CYP26B1 and CYP26B1 Spliced Variant Homology models of CYP26B1 (cytochrome P450RAI2) and CYP26B1 spliced variant were derived using the crystal structure of cyanobacterial CYP120A1 as template for the model building. The quality of the homology models generated were carefully evaluated, and the natural substrate all-trans-retinoic acid (atRA), several tetralone-derived retinoic acid metabolizing blocking agents (RAMBAs), and a well-known potent inhibitor of CYP26B1 (R115866) were docked into the homology model of full-length cytochrome P450 26B1. The results show that in the model of the full-length CYP26B1, the protein is capable of distinguishing between the natural substrate (atRA), R115866, and the tetralone derivatives. The spliced variant of CYP26B1 model displays a reduced affinity for atRA compared to the full-length enzyme, in accordance with recently described experimental information.

This paper, presenting two new homology models, does not include either model. Unfortunately I didn’t have to search long to find this example

Page 30: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility: Results

§ This is the primary output of the research

§ Helps dampen some of the arguments about statistics

§ Need the unprocessed data

§ All of it

Page 31: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

How are we doing?

§  Survey of recent publications: •  Everything in JCIM vol 52 #10 •  Everything in JCAMD vol 26 #10 •  Journal of Cheminformatics from July 2012-Nov 4 2012

§  Big differences between journals §  Plenty of room for improvement §  Analysis is presence/absence of full results

Journal   Type  of  paper   Count   Full  Data   Par3al  Data   Missing  Data   Code?  JCIM   Method   13   6   3   4   1  JCIM   Non-­‐method   16   10   3   3   0  

JCAMD   Method   3   3   0   0   0  JCAMD   Non-­‐method   4   0   3   1   0  

JChemInf   Method   12   7   3   3   8  JChemInf   Non-­‐method   3   0   0   0   0  

Page 32: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Practical considerations

§  Where to put the data and code? •  Supplementary material •  Code-sharing sites (sourceforge.net, google code, github) •  Data sharing: Zenodo/Labarchives.com •  A hybrid: Figshare

§  Considerations: •  It needs to still be there 5+ years from now •  Having a solid connection to the original paper is good • Others have to actually be able to do something with it

Page 33: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Practical considerations

§  Where to put the data and code? •  Supplementary material •  Code-sharing sites (sourceforge.net, google code, github) •  Data sharing: Zenodo/Labarchives.com •  A hybrid: Figshare

§  Considerations: •  It needs to still be there 5+ years from now •  Having a solid connection to the original paper is good • Others have to actually be able to do something with it

Page 34: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Some stuff to look at

§  vagrant (virtual box configuration and provisioning): http://www.vagrantup.com/

§  openshift (cloud-based application deployment): https://www.openshift.com/

§  wakari (ipython in the cloud): https://wakari.io/

Page 35: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Tools for reproducible research Knime

§  Open-source workflow tool §  Strong data manipulation and mining capabilities §  Data and results can be stored with the workflow.

Page 36: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Tools for reproducible research IPython notebook

§  Python session running in a browser •  Tab completion •  Access to docstrings

§  Text formatting options available for including discussion or capturing mathematics (access to LaTeX for formatting math)

§  Captures all data transformations and displays output §  Tight integration with matplotlib

Page 37: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Tools for reproducible research IPython notebook

Page 38: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Tools for reproducible research IPython notebook

Page 39: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Here’s a cool picture from my lab. … and here’s how you can make it too.

Page 40: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

A great start (1) Wherever possible, source code should be provided for new computational methods. The source code can be a reference implementation of a method or algorithm and does not need to include a graphical interface. If it is not possible to release the source code for a new method, authors should provide a sufficient justification. Reviewers and editors will then consider this explanation. Any paper that does not comply with the reproducibility guidelines will include this explanation when published. In cases where it is not possible to release code due to intellectual property or other limitations, an executable version of the new method should be readily accessible. Commercial products should provide time limited licenses to facilitate evaluation and comparison of published methods. (2) Any chemical structures and data mentioned in the paper should be made available in a commonly used (SDF or SMILES) format. Distribution of data in pdf format is not sufficient. (3) Any publications that employ commercial or open-source software should include scripts or parameter files as well as data files that will enable others to easily reproduce the work. (4) A clear easy to follow description of any new method should be a key criterion during the review process. Wherever possible, a paper should contain a simple worked example that demonstrates the application of the method. Parameter values and intermediate results for example compounds should be included as part of the supporting material. (5) Reviewers should put particular emphasis on the reproducibility of the method described in a manuscript. Each reviewer should evaluate the description of the method, as well as the presence of associated code, data, or executables, to ensure that the results can be independently reproduced.

Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J. Chem. Inf. Model. (2013). doi:10.1021/ci400197w

Page 41: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Requirements for Reproducibility

§ Data used

§ Code/algorithm description

§ Results

Page 42: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Pat’s not completely off the hook

Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J. Chem. Inf. Model. (2013). doi:10.1021/ci400197w

Page 43: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Pat’s not completely off the hook

Walters, W. P. Modeling, Informatics, and the Quest for Reproducibility. J. Chem. Inf. Model. (2013). doi:10.1021/ci400197w

No data No code No algorithm description Results only as a figure

Page 44: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Acknowledgements

§  NIBR: •  Nik Stiefl (GDC/CADD) •  Nikolas Fechner (NIBR IT/IS Sigma) •  Sereina Riniker (NIBR IT/IS Sigma)

§  Matthias Rarey §  Pat Walters

Page 45: Is that a scientific report or just some cool pictures from the lab? Reproducibility and computational chemistry

Perhaps the biggest barrier to reproducible research is the lack of a deeply ingrained culture that simply

requires reproducibility for all scientific claims.

Peng, R. D. Reproducible Research in Computational Science. Science 334, 1226–1227 (2011).