graham pryor

Big data

– no big deal for curation?

Graham Pryor, Associate Director, UK Digital Curation Centre

Eduserv Symposium 2012: Big Data, Big Deal?

This work is licensed under a Creative Commons Attribution 2.5 UK: Scotland License

Because good research needs good data

Big data – big deal or same deal?

“What need the bridge much broader than the flood?

The fairest grant is the necessity.

Look, what will serve is fit…”Much Ado About Nothing, Act 1 Scene 1

• Operating Systems & Networking

• Computer and Network Security

• Distributed Systems

• Mobile Computing

• Wireless Networking

• Software Engineering

• High performance compute clusters

• Cloud and grid technologies

• Effective management of large clusters and

cluster file-systems

• Very large database systems (architecture,

management and application optimization)

Eduserv Symposium 2012 –

speakers’ Research Areas

The Digital Curation Centre• a consortium comprising units from the Universities of Bath

(UKOLN), Edinburgh (DCC Centre) and Glasgow (HATII)

• launched 1st March 2004 as a national centre for solving

challenges in digital curation that could not be tackled by

any single institution or discipline

• funded by JISC to build capacity, capability and skills in

research data management across the UK HEI community

• awarded additional HEFCE funding 2011/13 for

• the provision of support to national cloud services

• targeted institutional development

Three perspectives

Scale and complexity– Volume and pace

– Infrastructure

– Open science

Policy– Funders

– Institutions

– Ethics & IP

Management– Storage

– Incentives

– Costs & Sustainability

http://www.nonsolotigullio.com/effettiottici/images/escher.jpg/

Challenges of scale and complexity

• Globally, >100,000 neuroscientists study the CNS, generating massive, intricate and highly interrelated datasets

• Analysts require access to these data to develop algorithms, models and schemata that characterise the underlying system

• Resources and actors are rarely collocated and are therefore difficult to combine.

• The virtual laboratory is a federation of server nodes that allows distributed data to be stored local to acquisition

• Analysis codes can be uploaded and executed on the nodes so that derived datasets need not be transported over low bandwidth connections

• Data and analysis codes are described by structured metadata, providing an index for search, annotation and audit over workflows leading to scientific outcomes

• Users access the distributed resources through a web portal emulating a PC desktop

http://www.carmen.org.uk/

But this is only talking terabytes…

Big data? – The Large Hadron Collider

• Predicted annual generation of around 15 petabytes (15 million gigabytes) of data

• Would need >1,700,000 dual layer DVDs

Searching for the Higgs Boson

Big data – the GridPP solution

“With GridPP you need never have those data processing blues again…”

http://www.gridpp.ac.uk/about

With the Large Hadron Collider running at CERN the grid is

being used to process the accompanying data deluge. The UK

grid is contributing more than the equivalent of 20,000 PCs to

this worldwide effort.

Crowd sourcing for the LHC Home and office computer users

can sign up to the LHC at home

project (based at Queen Mary,

University of London), which

makes use of idle CPU time. So

far, 40,000 users in more than 100

countries have contributed the

equivalent of 3000 years on a

single computer to the project.

Yet…..Data Preservation in High

Energy Physics?

Data from high–energy physics (HEP) experiments are collected with significant financial and human effort and are in many cases unique. At the same time, HEP has no coherent strategy for data preservation and re–use, and many important and complex data sets are simply lost.

David M. South, on behalf of the ICFA DPHEP Study GrouparXiv:1101.3186v1 [hep-ex]

Big data in genomics

These studies are generating

valuable datasets which, due to

their size and complexity, need to

be skilfully managed…

There’s a bigger deal than big data…

Socio-technical

management perspectives

Information systems

perspectives

Research practice

perspectives

• Identify drivers and champions

• Analyse stakeholders, issues

• Identify capability gaps

• Assess costs, benefits, risks

2. • Inventory data assets• Profile norms, roles,

values• Identify capability gaps• Analyse current

workflows

• Produce feasible, desirable changes

• Evaluate fitness for purpose

Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012

• 18 institutional engagements, 14 roadshows

• advice and assistance in strategy and policy

• use of curation tools for audit and planning

• training and skills transfer

The DCC - building capacity and capability through targeted institutional development

Why do we do this?

1. Reports that researchers are often unaware of threats and opportunities

“Departments don’t have guidelines or norms for personal back-up and researcher procedure, knowledge and diligence varies

tremendously. Many have experienced moderate to catastrophic data loss”

Incremental Project Report, June 2010

http://www.flickr.com/photos/mattimattila/3003324844/

Why do we do this?

2. There is a lack of clarity in terms of skills availability and acquisition

…researchers are

reluctant to adopt new tools and

services unless they know

someone who can recommend

or share knowledge about

them. Support needs to be

based on a close understanding

of the researchers’ work, its

patterns and timetables.

Why do we do this?

3. Many institutions are unprepared to meet the increasingly prescriptive demands of funders

EPSRC expects all those institutions it funds

• to have developed a roadmap aligning their policies

and processes with EPSRC’s nine expectations by

1st May 2012

• to be fully compliant with each of those expectations

by 1st May 2015

• to recognise that compliance will be monitored and

non-compliance investigated and that

• failure to share research data could result in the

imposition of sanctions

Why do we do this?

4. …and legislators

Rules and regulations…

Compliance

• Rights, Exemptions, EnforcementData Protection Act

• Climategate, Tree Rings, Tobacco and…(what’s next?)

Freedom of Information Act 2000

• etc. etc. etc………..Computer Misuse Act

Why do we do this?

4. …and legislators

5. The advantages from planning, openness and sharing are not understood

Open to all? Case studies of openness in research

Choices are made according to context, with degrees of openness reached according to:

• The kinds of data to be made available

• The stage in the research process

• The groups to whom data will be made available

• On what terms and conditions it will be provided

Default position of most:

• YES to protocols, software, analysis tools, methods and techniques

• NO to making research data content freely available to everyone

After all, where is the incentive?Angus Whyte, RIN/NESTA, 2010

Institutional

Engagements

http://www.dcc.ac.uk/community/institutional-engagements

Adapted from Developing Research Data Management Capabilities by Whyte et al, DCC, 2012

Main institutional concerns

– Compliance

– Asset management

– Cost benefits

– Incentivisation

– Complexity of the data environment

And big data? There has been no mention yet of any specific challenge from big data but…

Institutions are providing resources to work on big data, both equipment and people, and more importantly…

…the issues central to effective data management are common across the data spectrum, irrespective of size

Some current institutional engagements

Assessing needs

RDM roadmaps

Piloting tools e.g. DataFlow

Policy development

Policy implementation

Support offered by the DCC

Assess needs

Make the case

Develop support

and services

RDM policy development

Customised Data Management Plans

DAF & CARDIO assessments Guidance

and training

Workflow assessment

DCC support

Advocacy to senior management

Institutional data catalogues

Pilot RDM tools

…and support policy implementation

Four DCC Tools

Your Data as Assets: DAF

• What are the characteristics of your

research data assets?

– Number?

– Scale?

– Complexity?

– Dependencies?

– Liabilities?

• Why do researchers act the way they do

with respect to data?

• Which data do they need to undertake

productive research?

DMP Online is a web-based data management

planning tool that allows you to build and edit plans

according to the requirements of the major UK

funders.

The tool also contains helpful guidance and links for

researchers and other data professionals.

http://www.dcc.ac.uk/dmponline

An online tool for departments or research groups to

identify their current data management capabilities

and identify coordinated pathways to future

enhancement via a dedicated knowledge base.

CARDIO emphasises a collaborative, consensus-

driven approach, and enables benchmarking with

other groups and institutions.

http://cardio.dcc.ac.uk/

DRAMBORA is an audit methodology and tool for

identifying and planning for the management of risks

which may threaten the availability and/or usability of

content in a digital repository or archive.

http://www.repositoryaudit.eu

So, big data – no big deal for curation?

• Yes, it’s big

• It’s also very complex

• There is no single technology solution

• Issues of human infrastructure are possibly a bigger challenge

• But for big data aficionados the technology challenges are big enough

Data Management – infrastructure

and data storage challenges...

The case for cloud computing in genome informatics. Lincoln D Stein, May 2010

Scaleability

Cost-effectiveness

Security (privacy and IPR)

Robust and resilient

Low entry barrier

Ease-of-use

Data-handling / transfer /

analysis capabilities

Help desk:0131 651 1239

info@dcc.ac.uk

www.dcc.ac.uk

graham pryor

distributed data

data preservation

big data big deal

gigabytes of data

good databig data

low bandwidththese data

complex data setsare

accompanying data deluge

Technology

richard pryor (excerpt)

pryor-marx and medieval economy

pryor art gallery

pryor john bonita 1979 png

julius pryor, iii leading edge thinker jpryorgroup llc...

rickeya pryor

corporate partner opportunities - karen pryor 2017 _...

lora pryor

united states court of appeals for the eleventh circuit ·...

edward and claudia bujold graham bullock jeff and perrin...

morrisons kerrin pryor

pryor john bonita 1983 png

the pryor family - by daniel piantothe pall –bearers were,...

pryor david sharran 1977 png

pryor david sharran 1979 png

the times, pryor, oklahoma

pryor john bonita 1988 png

pryor david sharran 1983 png

pryor john bonita 2001 png

pryor john bonita 1978 png