tim osborn: research integrity: integrity of the published record
DESCRIPTION
Tim Osborn, Reader, University of East AngliaTRANSCRIPT
11/04/23 Wellcome Collection Conference Centre, 13 September 2011 slide 1
Research Integrity ConferenceThe importance of good data management
Climate research data and research integrity
Dr Tim Osborn
Climatic Research Unit
School of Environmental Sciences
University of East Anglia
JISC Research Integrity Conference:the Importance of Good Data Management
13 September 2011
Integrity of the published research record
Why is it important for climate research and why now?(Of course it’s always been important and not just for this discipline)
The global warming issue:Scientifically challengingPolitically, socially and economically contentiousHigh stakes (economic and non-economic)Under intense scrutiny
Climate change hacked emails controversy
The integrity of our research was severely questionedWhat role did research data issues (management, sharing, etc.) play in this?
Need to distinguish research integrity from perceptions of research integrity
These issues probably played a rather small roleOur research data and the research record were preservedWe “created” very little raw data and we have an excellent record in preserving and publishing for re-use our derived data
Instead, the perception of doubt arose very much more from the contents of the hacked emails and their
interpretation
Climate change hacked emails controversy
Improved research data management and sharing would have made little difference to the attacks on our integrity
Not to our critics, perhaps a small role in the cross-over to the main-stream media
Nevertheless, there are areas where we can improve and we received some criticism in these areasThe climate science community as a whole should improve
Data sharing for openness, for re-useImproved data management for preserving workflows and linking
articles to analysis to data (e.g. JISC ACRID)
Managing and sharing research data:why should we improve?
Supports reproducibility (necessary) and repeatability (desirable)Maintains (actual and perceived) integrity of researchEssential because high-stake decisions must be informed by sound scientific assessment
Supports further exploration of scientific findingsScientific findings that are not clear cut (e.g. in the vicinity of the statistical significance) are more sensitive to variations in data, methodological choices, assumptions, etc.
Supports data re-use for other studiesWe are data poor (despite > 10,000 TB) relative to the complexity of the climate system
Estimated numbers of climate change articles:Total > 100,000Just 2009 > 13,000which is > 1 / hour
Grieneisen & Zhang (2011) doi: 10.1038/nclimate1093
Sharing climate data: some challenges
Data volume is already large (> 10,000 TB)Projected to grow tenfold by end of this decade
Overpeck et al. (2011) doi: 10.1126/science.1197869
Sharing climate data: some challenges
Sharing climate data: some limitations
Data with non-disclosure agreements Formal or informal agreements
Holding back for future exploitation Controlling use, getting recognition
Time and resources Costs may be obvious, benefits may be unrealised Standards, meta-data and software increase the value in re-
use, but can increase the time needed
Non-disclosure agreements: real or excuse?
Example 1: UK climate data Data sets must not be passed on to third parties under any
circumstances... Once the project work using the data has been completed, copies of the datasets held by the end user should be deleted... The introduction of sanctions against individuals or Departments may be considered if breaches occur.http://badc.nerc.ac.uk/conditions/ukmo_agreement.html
Non-disclosure agreements: real or excuse?
Example 2: Global precipitation data One of the most widely used analyses of variations in
precipitation across the global land surface is “based on the complete GPCC monthly rainfall station data-base (the largest monthly precipitation station database of the world with data from ca. 85,000 different stations)... Corresponding to international agreement, station data provided by Third Parties are protected.”http://gpcc.dwd.de
Non-disclosure agreements: real or excuse?
Informal agreements exist too Especially with newly collected data provided in advance of its
formal publication These agreements with colleagues, and the consequences of
breaching them, are genuine (regardless of what the ICO might decide if tested under FOI/EIR legislation!)
Holding back data for future exploitation
Traditionally, climate data itself aren’t publishedInstead, a journal article is published reporting findings
arising from some analysis of the data Provides a citable outcome for which the scientist gains credit
This could take many months to a few years Because publishable findings may only arise from extensive
analysis of the data or from a collection of multiple records and it has to go through peer-review system
In the meantime, the data may have been shared and used under non-disclosure restrictions
Ways forward…1
Providing data (and other materials) with a publication to allow it to be reproduced (or perhaps repeated)
E.g. supplementary online materialsSeen as a burden for all 13,000 climate change articles per year
Co-benefits must be evident to make this worthwhileCitation and data re-use
Potential proliferation of copies of identical (or perhaps not!) copies of datasets
Better to provide a unique identifier to existing data that have been used, rather than a copy of the data
Ways forward…2
Data publicationNewly collected (observed, simulated, derived) datasets published in their own right, not as part of scientific paperMeta-data and other accompanying information
But could speed up the lag from data collection to data publication, and much lighter-touch peer review
Citable (e.g. DOI) allows due creditIdentifiable (long-lasting URI) allows unique identification
Should be unique – updates or modifications to the data should have separate unique identifier (how to link between versions –
considered in our JISC ACRID project)
Preferred data archives…1
Storing data with publisher, linked directly to articleUseful (not essential) for a strong link between article and dataNot ideal for long term preservation, large datasets, tools for exploring data, searches of databases etc.Not ideal for re-use
University archiving possible, but similar disadvantagesDiscipline-specific, dedicated data centres are preferable
E.g. World Data Center system (http://www.icsu-wds.org/)WDC-Climate, WDC-Paleoclimate, BADC, BODC, ITRDB, CMIP5
Preferred data archives…2
Sub-discipline specific archives superior to broader archives
More generalised approaches provide a steeper barrier for submission (e.g. describing all environmental data sets via one standard meta-data model – very large model, much to learn etc.)Approaches tailored to sub-disciplines avoid irrelevant structures, formats, meta-dataSometimes expertise is needed rather than extra meta-data
Summary points Improved data sharing and links to published findings are needed across the climate science community, to increase the pace of knowledge creation and to support the integrity of published work New approaches to publishing newly constructed datasets should be encouraged and adopted where possible
Bringing benefits of citations, credit and unique identification Published articles should identify data used, preferably via citation/identification of already published data rather than providing a further copy of the data Subject-specific data archives are preferred, offering better support for data re-use Other issues (non-disclosure agreements, time and resources) need to be considered – benefits must be clear to encourage them to be overcome
Global warming issue: high stakes
Easy contexts for decision making:Cost of reducing GHGs low, adverse impact of not doing so is highCost of reducing GHGs high, adverse impact of not doing so is low
Decision making in the actual context is much harder:Significantly reducing GHGs may prove difficult with moderate to high costsNet effects of not reducing GHGs are very uncertain and could range from fairly moderate to very severe adverse impact
Global warming issue: high stakes
Easy contexts for decision making:Cost of reducing GHGs low, adverse impact of not doing so is high
Global warming issue: high stakes
Easy contexts for decision making:Cost of reducing GHGs low, adverse impact of not doing so is high
Global warming issue: high stakes
Easy contexts for decision making:Cost of reducing GHGs low, adverse impact of not doing so is high
Global warming issue: high stakes
Easy contexts for decision making:Cost of reducing GHGs low, adverse impact of not doing so is highCost of reducing GHGs high, adverse impact of not doing so is low
Global warming issue: high stakes
Decision making in the actual context is much harder:Significantly reducing GHGs may prove difficult with moderate to high costsNet effects of not reducing GHGs are very uncertain and could range from fairly moderate to very severe adverse impact
Time and resources
Must not mistake reluctance to commit time and resources with desire to avoid disclosureThere is a real cost involved
Standards, meta-data and software increase the value in re-use, but can increase the time needed
The answer is not simply to obtain fundingEven with specific funding, unless the benefits of sharing data, meta-data are clear there will be pressure to do things with more obvious benefits
11/04/23 Wellcome Collection Conference Centre, 13 September 2011 slide 27
Research Integrity ConferenceThe importance of good data management