research data, or: how i learned to stop worrying and love ... · research data, or: how i learned...
TRANSCRIPT
Research Data, or: How I Learned to Stop Worrying and Love the PolicyRDMF14:(Research(Data((and)(Systems
York,&9th November&2015
Dr(Torsten(Reimer
Scholarly&Communications&Officer
Imperial&College&London
[email protected] /&@torstenreimer
http://orcid.org/0000H0001H8357H9422
Why are we here?
Why we fight – Compliance! Really?
“Well%compliance%is%really%important,%yes%that's%the%whole%reason%we%are%doing%it%really.%I%mean%to%comply%with%Research%Council%guidelines%yes.%I%am%not%saying%the%whole%reason%butthat's%the%main%driver,%yes.”10.1371/journal.pone.0114734
There%are%issues%with%RCUK/EPSRC%policy:• costMbenefit%analysis,%anyone?• expensive/issues%around%funding• enough%support/incentive%for%culture%change?• fine%in%theory,%but%is%it%workable%in%practice?
But…
Blame funders, or blame ourselves (hedgehog and hare)?
It&seems&wherever&we&go,&the&funders&have&
already&been&there:&HEFCE&open&
access&policyR&EPSRC&data&policy…
Are&the&funders&too&fast?&Or&we&too&slow?
Imagine&the§or&had&agreed&on&best&
practice&years&ago&– and&implemented&
it&in&a&sensible&way!
So, why are we here again? No really, why?
Data Science hub and KPMG Data Observatory
Data Science hub and KPMG Data Observatory launch (04 Nov)
"At%a%research%intensive%university%like%Imperial%it%is%hard%to%do%anything%that%doesn't%involve%data.“
James&Stirling,&Provost
"Data%is%at%the%heart%of%the%human%condition."
Joanna&Shields,&UK&Minister&for&Internet&
Safety&and&Security
Considering these statements you’d think everyone, especially Imperial, would have RDM all sorted, wouldn’t you?
… and yet we are losing research data
“In their parents' attic, in boxes in the garage, or stored on now-defunct floppy disks — these are just some of the inaccessible places in which scientists have admitted to keeping their old research data.”http://www.nature.com/news/scientists-losing-data-at-a-rapid-rate-1.14416
Isn’t research meant to be reproducible?
The(results(of(only(6(out(53(‘landmark’(studies(were(found(reproducible.
Drug&development:&Raise&standards&for&preclinical&cancer&research.&DOI:&doi:10.1038/483531a
“Several%recent%publications%suggested%that%the%seminal%findings%from%academic%laboratories%could)only)be)reproduced)11–50%)of%the%time.%The%lack%of%data%reproducibility%likely%contributes%to%the%difficulty%in%rapidly%developing%new%drugs%and%biomarkers%that%significantly% impact%the%lives%of%patients%with%cancer%and%other%diseases.”
A&Survey&on&Data&Reproducibility& in&Cancer&Research&Provides&Insights&into&Our&Limited&Ability&to&Translate&Findings& from&the&Laboratory&to&the&Clinic.&DOI:&10.1371/journal.pone.0063221
Shouldn’t the public too be allowed to play with data?
(This is in our own interest!)
RDM systems landscape
Case for a national infrastructure?
Currently,&~100&UK&institutions&spend&effort&to&define&and&implement
an&RDM&infrastructure&(storage,&workflows,&interfaces,&metadata,&
compliance,&monitoring,&business&model&etc.).&Some&aspects
have&to&be&local,&but…
…imagine&a&national&research&data&infrastructure&(say&for&data
publishing&and&preservation),&run&by&RCUK:
• Economies&of&scale
• No&issues&with&funding
• Just&one&system&to&interface&with
• Increased&visibility/discoverability
• Solution&would&by&default&be&compliant
• No&commercial&“ownership”&of&public&data
However, past experience suggests…
However, past experience suggests…
One RDM system to rule them all?
• Is&community&track&record&actually
better&than&funders’?
• Jisc offers&components,&but&have&we
found&right&model&for&collaboration&
(supplier?&leader?&partner?)?
• Commercial&solutions&exist– trust?
Should&they&define&our&infrastructure?
⇒ Funders&set&policyR&3rd&parties&
infrastructure&– we’ve&been&too&slow&again!
⇒ However,&is&one&system&actually&suitable&
(redundancy,&competition,&disciplines&etc.)?
Until&the&one&solution&emerges&(if&ever),&we&should:
• consider&defining&minimum&requirements&(metadata,
identifiers,&embargoes)&for&3rd party&solutions?
• use&a&flexible&approach&that&enables&us&to&learn&and&change
Imperial College London(From funder policy to) institutional strategy
Imperial College London
• Seven&London&campuses
• Four&Faculties:&Engineering,
Medicine,&Natural&Sciences
and&Business&School
• Ranked&3rd in&Europe&/&8th in&the
world&(THE&2015H16&rankings)
• Net&income&(2014):&£855m,&incl.
£351m&research&grants&and&contracts
• ~15,000&students,&~7,400&staff,&incl.&~3,900&academic&&&research&staff
• Staff&publish&10H12,000&scholarly&articles&per&year
• Largest(data(traffic(into(Janet(network(of(all(UK(universities
Process of policy development
• 2014:&Draft&policy:&“Statement&of&Strategic&Aims”
• Lack&of&reliable&data&(on&data&storage&needs&(scale)&in&particular)
• Concerns&about&cost&of&maintaining&infrastructure
• Concerns&about&uncertainties&and&changing&market&/&policy&landscape
• Decision:&reHthink&approach&– more&costHeffective,&based&on&better&data
• Approach:&RDM&Green&Shoots&and&RDM&Investigation
• Funded&by&ViceHProvost&(Research)
• Green&Shoots:&6&bottomHup,&academic&projects&(2nd half&of&2014)
• RDM&investigation&(Oct&2014HJan&2015)
• Online&survey&(academicsR&390&responses)
• ~40&interviews&(academics)
• Workshops&(academics&&&data&managers)
RDM Green Shoots
• Haystack&– a&computational&molecular&data¬ebook&(Dr&Mike&Bearpark,&Chemistry)
• Imperial&College&Healthcare&Tissue&Bank&&&&&&&&&&&&&&(Prof.&Gerry&Thomas,&Surgery&&&Cancer)
• Integrated&RuleHbased&Data&Management&System&for&
Genome&Sequencing&Data&(Dr&Michael&Mueller,&Medicine)
• RDM&in&Computational&and&Experimental&Molecular&
Sciences&(Prof.&Henry&Rzepa,&Chemistry)
• RDM:&Where&software&meets&data&(Dr&Gerard&Gorman&&&Dr Matthew&Piggott,&Earth&Science&&&Engineering)
• Time&Series&(Dr&Nick&Jones,&Mathematics)
Idea
• Provide&a&platform&and&technology&which&automatically&connects&researchers&through&their&timeHseries&data,&models&and&analysis&methods
Achievements
• Online&interdisciplinary&collection&of&timeHseries&data&and&timeHseries&analysis&code
• Functionality&to&automatically&profile&time&series
• Functionality&to&automatically&profile&time&series&algorithms
• Functionality&to&use&these&profiles&to&place&a&user’s&work&in&the&context&of&others
RDM&Benefits
• Incentivises&data&sharing&by&allowing&data&comparison&– increases&discoverability&of&an&academic’s&data&plus&increases&likelihood&of&finding&other&relevant&data
• Resource&also&available&to&general&public&
More&Information
• http://www.compHengine.org/timeseries/
Example project: Time Series
Online survey – where does active data live?
0 10 20 30 40 50 60 70 80
College&computer
External/portable&storage
Cloud&storage
Personal&computer
Departmental/group&storage
College&H&drive
ICT¢ral&storage
Use(of(different( types(of(storage( in(%
Online survey – growth of data volume
0 5 10 15 20 25 30
>&1&PB
100&TB&– 1&PB
10&TB&– 100&TB
1&TB&– 10&TB
100&GB&– 1&TB
10&GB&– 100&GB
<&10&GB
Research(group(data(storage(needs(in(%
Now
In&2&years
Findings (best practice)
• RDM&principles&are&considered&to&be&sound&but¬&fully&practised
• Sharing&publiclyHfunded&data&accepted&in&principle&but&some&question&
value&and&cost
• Concerns&about&(metadata)&effort&to&make&shared&data&discoverable
• Metadata&schemas&are¬&yet&widely&available&across&disciplines
• AutoHgenerate&metadata&where&possible
• Consensus&that&RDM&training&for&PhDs&is&vital
(also&to&ensure&data&loss&when&they&leave)
Findings (data)
• 60O100%(of(grant(required(to(reOgenerate(data(used(in(publications
• %&of&data&that&needs&retaining&to&support&publications:&~60%
• Data&storage&capacity&will&have&to&grow&significantly
• Concerns&around&backHup&and&archiving,&esp.&considering&data&volume
• Popularity&of&cloud&services&(as&opposed&to&College&storage)
⇒ Researchers&want&selfHadministered,&secure,&responsive&solution
for&data&sharing,&storing&and&archivingR&open&APIs&preferred
(“Yes&[storage]&is&really&important.&Basically,&whenever&we&have&been&out&
to&talk&to&researchers,&that's&the&thing&they&have&latched&on&to&and&want&to&
talk&about&the&most.”&10.1371/journal.pone.0114734)
Conclusions / policy implementation principles
• Provide&platformHindependent,&flexible&data&storage
• Embed&RDM&training&into&PhD&progression
• Where&available,&uses&existing&workflows:
• Symplectic Elements:&metadata&management
• Spiral&(DSpace):&public&(metadata)&catalogue
• Additional&infrastructure:
• use&external&resources
• no&longHterm&commitment
• as&flexible&as&possible
• costHeffective
Reesult: Imperial College RDM Policy
“Imperial&College&London&is&committed&to&
promoting&the&highest(standards(of(
academic(research,&including&excellence&in&
research&data&management.&This&includes&a&
robust&digital&curation&infrastructure&that&
supports&open(data(access and&protects&
confidential&data.&The&College&acknowledges&
legal,ðical&and&commercial&constraints(on(
data(sharing(and&the&need&to&preserve&the&
academic&entitlement&to&publication.”
“Principal(Investigators(have(overall(
responsibility(for&the&effective&management&
of&research&data&generated&within&or&obtained&
for&their&research,&including&by&their&research&
groups.&The&Library and&ICT will&provide&
training,&guidance&and&services&to&support
PIs.”http://imperial.ac.uk/researchHdataHmanagement
Building a flexible RDM infrastructure
Research(Project
Data:(BoxSoftware:(GitHub
Data/software(still(needed Delete
External(repositoryInternal(Storage
Elements
Spiral
Creates(data/software
Project(ends
no
yes
Metadata,(manualor(automatic
Can(it(be(published(or(embargoed(externally?
yesno
Metadata,(manualor(automatic
Can(metadata(be(published?
Library(reviews
yes
Summarising RDM in 6 steps
1. Make(a(data(management(plan: use&DMPOnline
2. Store(your(data(management(plan(centrally:(use&InfoEd
3. Store(your(live(data(securely(and(safely:(use&Box
4. Store(your(final(data((and/or(code)(for(10+(years,
making(it(publicly(available:(use&Zenodo
5. Tell(the(College(where(your(data((and/or(code)(is
published(or(stored:(use&Symplectic
6. Reference(your(funding(and(your(data(in(the(
publications(it(underpins:(tell&your&publisher
Box – Data storage, sharing and syncing
RollHout&across&College:
• unlimited&data&
storage
• online&access,&easy&
sharing,&data&syncing
• file&viewers&included
• backup,&data&remains&
even&when&staff&leave
• machine&learning&
tools&to&describe&data
• API
Infrastructure summary
• Flexible,&can&react&to&market&/&policy&changes
• Components&can&be&exchanged,&no&additional
inHhouse&infrastructure
• Make&a&start,&collect&data,&learn&– change&as&required
• Preservation&infrastructure&needs&further&work&
(discussions&with&Arkivum about&‘framework’&for&
costing&into&grants)&– how&much&do&we&need
to&retain&beyond&published&data?
• It&isn’t&perfect,&but&we&can&make&a&start
“In, through … and beyond”
RDM policy with research software requirements
“3.6.7 Cost&Effectiveness&– where&computerHgenerated&data&may&be&
reliably&recreated&at&a&cost&less&than&that&of&storing&raw&output&data,&&
then&the&inputs&and&humanHreadable&outputs&of&the&relevant&
programme&may&be&stored&instead&along&with&a&reference&to&or©&of&
the&software&version&used.”
“3.7& If&software&is&developed&as&part&of&a&research&project,&Principal&
Investigators&must&archive(the(particular(version(of(the(software(
used&to&generate&or&analyse&the&data&in&a&repository&and&inform(the(
Library(of(its(location,&taking&account&of&the&points&raised&in&3.5&
above.&Principal&Investigators&are&encouraged&to&follow&the&
Sustainability(and(Preservation(Framework(of(the(Software(
Sustainability(Institute.”
Treat software as valuable research output
PyRDM&Green&Shoots&project
Zenodo integrates&with&GitHub
College&survey&on&distributed&version&control
Software&Sustainability&Institute&– I&a&fellow
ORCID – Open Researcher and Contributor ID
• Emerging&global&standard&for&identifying&authors&of&academic&outputs
• The&College&created&ORCID&iDs for&academics&staff&in&late&2014
(now&2,088&of&3,200&iDs claimed,&~1,500&linked&in&Elements)
• Imperial&hosted&launch&of&Jisc ORCID&consortium&with
50 UK&universities&in&September&2015
http://www.imperial.ac.uk/orcid
Towards automating RDM reporting with ORCID
Author&links&ORCID&
with&CRIS
…shares&ORCID&iD
with&repository
…publishes&datasetDataCite DOI&linked&to&
ORCID&iD
CRIS&pulls&metadata&
from&ORCID&/&DataCite /&Repository
But: is the external metadata likely to be complete “enough”?
Useful infrastructure makes compliance a by-product
• One&workflow&for&data&generation,&publishing,&reporting&and&curation
• Link&data&generation&directly&to&storage&(log&into&facility,&data&“at&your&
desk”&before&you&are&out&of&the&“lab”)
• (HSS&colleagues&– “facility”&can&also&be&a&book&scanner
• Automate&reporting&and&generating&/&sharing&of&metadata
Facilities&write&
(meta)&data&into&
Box
Data&processed&
/&analysed&from&Box
MachineHlearning&
adds&metadata
Publish&to&repository&
from&Box,&with&
reference
Metadata&directly&or&
indirectly&(ORCID)&
to&CRISS
Make data useful for us, not just for external re-use
Now&that&we&get&data,&shouldn’t&we&analyse&it?
Add&value&by:
• connect&researchers&who&have&similar&data&interests
• connect&researchers&to&relevant&data
• present&data&in&a&way&that’s&suitable&for&public&reuse
• develop&data&analytics&and&knowledge&transfer&service
• collect&impact&information&on&data
• Let’s%make%a%start%and%learn%from%doing,%from%actual%data• Think%about%where%we%can%coordinate%(3rd party%requirements)• It%is%early%stages,%take%a%flexible%approach• Don’t%wait%for%funders,%interpret%policies%in%a%useful%way%and%lead
=>)If)we)lead)instead)of)following)there)will)be)fewer)unpleasant)surprises)to)deal)with!
Research%Data,%or:%How%I%Learned%to%Stop%Worrying%and%Love%the%Policy
Image Credit (note NC licence!)
1. https://en.wikipedia.org/wiki/File:Dr._Strangelove_H
_Group_Captain_Lionel_Mandrake.png public&domain
2. https://it.wikipedia.org/wiki/Why_We_Fight#/media/File:Why_We_Fight
_title.jpg public&domain
3. https://commons.wikimedia.org/wiki/File:Hase_und_Igel_%281%29.jpg
public&domain
4. https://www.flickr.com/photos/jdhancock/4617759902/CH3PO&vs.&Data&
(137/365),&by&JD&Hancock,&CC&BY&2.0
5. https://en.wikipedia.org/wiki/One_Ring#/media/File:Unico_Anello.png
public&domain&
6. https://www.flickr.com/photos/dinnerseries/14994148089/OXO&tools,&
by&Didriks,&CC&BY&2.0
7. https://www.flickr.com/photos/albertovo5/3908190631/How&I&Learned&
To&Stop&Worrying...,&by&hjhipster,&CC&BY&NC&2.0