big data and the future of publishing
TRANSCRIPT
Big Data and the Future of (Physics) Publishing
Anita de Waard, VP Research Data Collaborations
Elsevier RDM Services
Columbia University, June 2, 2017
Present
Data is becoming distributed
Michael Tuts:
Ideas are becoming distributed
Kirk Borne:
Tools are becoming
distributed
Mike Hildreth:
• Preserved workflows can be used to
compare new models with a published
analysis
• Reinterpretation possible with full detector
simulation, analysis chain
• “Folding” rather than “Unfolding” like in
HEPData
Ideas are becoming distributed
Tools are becoming
distributed
Easy to create networks of
tools to run anywhere
(Docker, Jupyter Notbook
collections etc)
Many sources, formats,
owners, types: global,
interconnected
Computers make hypotheses, too*;
citizen science/MOOCs enable
ubiquitous access to knowledge
*
http://ieeexplore.ieee.org/abstract/document/7
515118/: Computer-Aided Discovery: Toward
Scientific Insight Generation with Machine
Support
Data is becoming distributed
Data
Tool
Article
Resear
cher
Towards Networked Knowledge:
Science Can Now Scale With the Network!
5
https://en.wikipedia.org/wiki/Metcalfe%27s_law
http://spectrum.ieee.org/computing/networks/metcalfes-law-is-wrong
• Metcalfe's Law: The value of a (telecommunications)
network is proportional to the square of the number of
connected users of the system (n2).
• Reed’s Law: Proportional to 2^n (-1)
|
Crisis # 1: Reproducibility/Scientific Integrity
Richard Feynman on Scientific Integrity:
• If you're doing an experiment, you should report
everything that you think might make it
invalid - not only what you think is right about it.
• Details that could throw doubt on your
interpretation must be given, if you know
them.
• If you make a theory, for example, and advertise
it, or put it out, then you must also put down all
the facts that disagree with it, as well as those
that agree with it.
• When you have put a lot of ideas together to
make an elaborate theory, you want to make
sure, when explaining what it fits, that those
things it fits are not just the things that gave
you the idea for the theory; but that the finished
theory makes something else come out right,
in addition.
http://calteches.library.caltech.edu/51/2/CargoCult.htmhttp://theconversation.com/the-science-reproducibility-
crisis-and-what-can-be-done-about-it-74198
|
Crisis # 2: Not Enough Brains To Interpret All This!
https://www.aps.org/programs/education/statistics/
https://www.insidehighered.com/news/2013/10/03/departments-under-threat-few-majors-physicists-say-value-isnt-reflected-numbers
0%
10%
20%
30%
40%
50%
60%
1995 1997 1999 2001 2003 2005 2007 2009 2011 2013
% t
o T
em
po
rary
Re
sid
en
ts
Doctoral Degrees
Master's Degrees
Bachelor's Degrees
0
2,000
4,000
6,000
8,000
10,000
12,000
14,000
0
50,000
100,000
150,000
200,000
250,000
300,000
350,000
1965 1970 1975 1980 1985 1990 1995 2000 2005 2010 2015
Ph
ysic
s
STEM
STEM
Physics
To paraphrase Remi the Rat (Ratatouille):
‘Not everyone can be a great scientist, but a great
scientist can come from anywhere’
|
Crisis # 3:
|
Networked Knowledge To The Rescue!
1. Reproducibility:
Disconnect creation of data from
interpretation to prevent confirmation bias
2. Lack of brains:
Making data and tools available to the
planet allows interested outsiders to help
explore new interpretations; support
tutoring
3. Diminishing trust/funding:
Putting datasets in multiple places and
allowing many different parties to
participate helps make systems
sustainable
9
| 10
Data
Journal
Inst. Data
Repositorie(s)
Lab
ELN(s)
Data search
Data Management
Plans
Metadata, methods & protocols
ready for preservation and
publishing
Link to article
Journal
Publish data
(under embargo)
Secure
discoverability
in & outside
the institution
FindTopic
Identify gaps
Plan & Fund
Discover data, people, methods & protocols
Collect, analyze & vizualize
Store, preserve & share
Publish
Prepare, reproduce, re-use & benchmark
Domain-specific
Repositories
Primary research data lifecycle
Integrate RDM and
monitor outputs
So How Do You Publish A Network?
|
https://www.rd-alliance.org/
http://www.nationaldataservice.org/
http://www.scholix.org/
https://www.force11.org/
https://ec.europa.eu/research/
openscience/index.cfm?pg=open-science-cloud
More About Our Collaborations And Tools:
https://www.hivebench.com/
https://datasearch.elsevier.com/#/
https://data.mendeley.com/
https://www.elsevier.com/authors/author-services/research-elements
The Research Data Alliance
(RDA) builds the social and
technical bridges that enable
open sharing of data.
Links existing data
archiving and sharing
efforts together with a
common set of tools.
A framework for
exchanging information
about links between
literature and data
A community of scholars,
librarians, and others that
helps facilitate the change
toward improved knowledge
creation and sharing.
A blueprint for cloud-based
services and data infrastructure
to ensure science, business and
public services reap the benefits
of the big data revolution.
An Electronic Lab Notebook
that helps prepare,
conduct and analyze
experiments vritually.
Search for research data
across domains and
repositories.
A secure cloud-based
repository, making it easy to
share, access and cite data.
Research Elements:
Publish data, software,
materials and methods in
brief, citable articles
A service to support
research librarians in
tracking data sharing and
use across campus.
• As tools, software and data become distributed,
science experiences the network effect
• This can solve three crises facing science:
• Detaching observation from interpretation
combats issues with reproducibility
• Opening up data and tools can draw new minds
to scientific reasoning
• Redundant storage and delivery systems and
new players in cyberinfrastructure relieve
dependencies on (US) gov’t funds
• “Networked science publishing” involves:
• Adapting to and being interoperable with many
different platforms, technologies, and scholarly
habits of practice
• Collaborating with others (institutions, funders
etc) to develop knowledge ecosystems
• Complying with/helping develop new standards,
in multi-stakeholder platforms
In Summary:
Anita de Waard, [email protected], June 2, 2017