data and society lecture 3: data-driven sciencebermaf/data course 2016/data and society lecture...

59
Fran Berman, Data and Society, CSCI 4370/6370 Data and Society Lecture 3: Data-driven Science 2/12/16

Upload: buixuyen

Post on 11-Apr-2018

225 views

Category:

Documents


0 download

TRANSCRIPT

Fran Berman, Data and Society, CSCI 4370/6370

Data and Society Lecture 3: Data-driven Science

2/12/16

Fran Berman, Data and Society, CSCI 4370/6370

Announcements

• Section 1 Exam February 26.

– Practice test given out at the end of lecture today.

Fran Berman, Data and Society, CSCI 4370/6370

Today (2/12/16)

• Any questions about Lecture 2?

• Lecture 3: Data-driven Science

– Some history

– IT-enabled Applications

– Data and Computing at SDSC

– Guiding the future of data Science

• Break

• Data Roundtable (L2)

3

Fran Berman, Data and Society, CSCI 4370/6370

Section Theme Date First “half” Second “half”

Section 1: The Data Ecosystem -- Fundamentals

January 29 Class introduction; Digital data in the 21st Century (L1)

Data Roundtable / Fran

February 5 Data Stewardship and Preservation (L2)

L1 Data Roundtable / 5 students

February 12 Data-driven Science (L3) L2 Data Roundtable / 5 students

February 19 Future infrastructure – Internet of Things (L4)

L3 Data Roundtable / 5 students

February 26 Section 1 Exam L4 Data Roundtable / 5 students

Section 2: Data and Innovation – How has data transformed science and society?

March 4 Paper assignment description Section 1 Data Roundtable / 5 students

March 11 Data and Health: Phil Bourne guest lecture (L5)

Section 2 Data Roundtable / 5 students

March 18 Spring Break / no class

March 25 Data and Entertainment (L6) L5 Data Roundtable / 5 students

April 1 Big Data Applications (L7) L6 Data Roundtable / 5 students

Section 3: Data and Community – Social infrastructure for a data-driven world

April 8 Data in the Global Landscape (L8) Section 2 paper due

L7 Data Roundtable / 5 students

April 15 Digital Rights (L9) L8 Data Roundtable / 5 students

April 22 Bulent Yener Guest Lecture, Data Security (L10)

L9 Data Roundtable / 5 students

April 29 Digital Governance and Ethics (L11) L10 Data Roundtable / 5 students

May 6 Section 3 Exam L11 Data Roundtable / 5 students

We are here

Fran Berman, Data and Society, CSCI 4370/6370

Lecture 3: Data and Computing

Fran Berman, Data and Society, CSCI 4370/6370

What is the potential impact of Global

Warming?

How will natural disasters effect urban centers?

What therapies can be used to cure or

control cancer?

Can we accurately predict market outcomes?

Modeling, simulation, analysis a critical tool in addressing

science and societal challenges

What plants work best for biofuels?

Is there life on other planets?

Fran Berman, Data and Society, CSCI 4370/6370

Computational science increasing focus in the 80’s and 90’s (data issues often in the background …)

• Many reports in 80’s and early 90’s focused on the potential of information technologies (primarily computers and high-speed networks) to address key scientific and societal challenges

• First federal “Blue Book” in 1992 focused on key computational problems including

– Weather forecasting

– Cancer genes

– Predicting new superconductors

– Aerospace vehicle design

– Air pollution

– Energy conservation and turbulent combustion

– Microsystems design and packaging

– Earth’s bioshpere

– Broader education resources

Fran Berman, Data and Society, CSCI 4370/6370

Enabling IT: Increasing focus on a broader spectrum of resources

COMPUTE (more FLOPS)

DA

TA

(m

ore

BY

TE

S)

Home, Lab,

Campus, Desktop

Applications

Compute-

intensive

HPC

Applications

Data-intensive

and

Compute-

intensive

HPC

applications

Compute-intensive Grid,

Distributed, and Cloud

Applications

Data - oriented

Grid, Distributed

and Cloud

Applications

NETWORK

(more

BW)

Data-intensive

applications

More key resources: • Software • Human resources

(workforce) ‘80’s, 90’s +: Computational Science ‘90’s, 00’s +: Informatics 00’s, 10’s +: Data Science

Fran Berman, Data and Society, CSCI 4370/6370

In the beginning … The Branscomb Pyramid, circa 1993

Branscomb Pyramid provides a framework to associate computational power with community use.

Original Branscomb Committee Report (“From Desktop to TeraFlop”) at http://www.csci.psu.edu/docs/branscomb.txt

Fran Berman, Data and Society, CSCI 4370/6370

The Branscomb Pyramid, circa 2016

Small-scale devices and personal

computers

Small-scale Campus/Commercial

Clusters

Large-scale campus/commercial

resources, Center supercomputers

Leadership Class

PF EF

TF, PF

TF

MF, GF

Opportunities for Innovation at all levels …

Kilo 103

Mega 106

Giga 109

Tera 1012

Peta 1015

Exa 1018

Zetta 1021

Yotta 1024

Fran Berman, Data and Society, CSCI 4370/6370

Also in 1993: The Top500 List created to rank supercomputers

• TOP500 list ranks and details the 500 most powerful supercomputers in the world

• Most powerful = performance on the LinPack benchmark.

• Rankings provide invaluable statistics on supercomputer trends by country, vendor, sector, processor characteristics, etc.

• List compiled by Hans Meuer of University of Mannheim, Jack Dongarra of University of Tennessee, and Erich Strohmaier and Horst Simon of NERSC / LBNL. List comes out in November and June each year.

http://top500.org/

Fran Berman, Data and Society, CSCI 4370/6370

What the Top500 List measures

Rmax and Rpeak values are in TFlops

• Computers assessed based on their performance on the LINPACK Benchmark – calculating the solution to a dense system of linear equations.

– User may scale the size of the problem and optimize the software in order to achieve the best performance for a given machine

– Algorithm used must conform to LU factorization with partial pivoting (operation count for the algorithm must be 2/3 n^3 + O(n^2) double precision floating point operations.

• Rpeak values calculated using the advertised clock rate of the CPU. (theoretical performance)

• Rmax = maximal LINPACK performance achieved (actual performance)

Rensselaer CCI Blue Gene Q on current Top500 list (November 2015):

• 97th most powerful supercomputer in the world

• 30th most powerful Academic supercomputer in the world

• 5th most powerful Academic supercomputer in the US

Fran Berman, Data and Society, CSCI 4370/6370

Performance Development (Slide courtesy of Jack Dongarra)

0.1

1

10

100

1000

10000

100000

1000000

10000000

100000000

1 Gflop/s

1 Tflop/s

100 Mflop/s

100 Gflop/s

100 Tflop/s

10 Gflop/s

10 Tflop/s

1 Pflop/s

100 Pflop/s

10 Pflop/s

59.7 GFlop/s

400 MFlop/s

1.17 TFlop/s

10.5 PFlop/s

51 TFlop/s

74 PFlop/s

SUM

N=1

N=500

6-8 years

Jack’s Laptop (12 Gflop/s)

1993 1995 1997 1999 2001 2003 2005 2007 2009 2011

Jack’s iPad2 & iPhone 4s (1.02 Gflop/s)

Fran Berman, Data and Society, CSCI 4370/6370

Fast Forward to 2016: Development of coordinated / integrated research information infrastructure on all axes needed driving application innovation

COMPUTE (more FLOPS)

DA

TA (

mo

re B

YTES

)

NETWORK (more BW)

Protein analysis and modeling of function and

structures

Storage and Analysis of Data from the CERN Large Hadron

Collider

Development of biofuels

Cosmology

Seti@home, MilkyWay@Home, BOINC

Real-time disaster response

Fran Berman, Data and Society, CSCI 4370/6370

Data has emerged as the “4th Paradigm” for research and discovery

Experimental methods

Theoretical modeling

Computational methods

Data analysis

Fran Berman, Data and Society, CSCI 4370/6370

Increasing Federal Expectations for Data

Data Management Plans:

• Grantee specification of plans for use, stewardship and preservation of data from their projects.

• Mandatory part of proposal process for increasing number of federal agencies

Agency Plans for Public Access of Research Data and Publications

• Relevant to federal R&D agencies with over 100M research expenditures (NSF, NIH, DOE, etc.)

• Agencies required to provide strategies for increasing / enhancing discoverability, access, dissemination, reproducibility, stewardship, preservation

Fran Berman, Data and Society, CSCI 4370/6370

Multi-paradigm applications

• Terashake Earthquake Simulation

• Large Hadron Collider Data Analysis

Fran Berman, Data and Society, CSCI 4370/6370

Earthquake Simulation

Fran Berman, Data and Society, CSCI 4370/6370

Earthquake Simulation

Background:

• Earth constantly evolving through the

movement of “plates”

• Using plate tectonics, the Earth's outer shell

(lithosphere) is posited to consist of seven large

and many smaller moving plates.

• As the plates move, their boundaries collide,

spread apart or slide past one another, resulting

in geological processes such as earthquakes and

tsunamis, volcanoes and the development of

mountains, typically at plate boundaries.

Fran Berman, Data and Society, CSCI 4370/6370

Why Earthquake Simulations are Important

Terrestrial earthquakes damage homes, buildings, bridges, highways

Tsunamis come from earthquakes in the ocean

• If we understand how earthquakes can happen, we can

– Predict which places might be hardest hit

– Reinforce bridges and buildings to increase safety

– Prepare police, fire fighters and doctors in high-risk areas to increase their effectiveness

• Information technologies drive more accurate earthquake simulation

Fran Berman, Data and Society, CSCI 4370/6370

6.7 M earthquake in Northridge California,1994, earthquake brought estimated $20B damage

Major Earthquakes on the San Andreas Fault, 1680-present

1906 M 7.8

1857 M 7.8

1680 M 7.7

?

What would be the impact of an earthquake on the lower San Andreas fault?

Fran Berman, Data and Society, CSCI 4370/6370

Simulation decomposition strategy leverages parallel high performance computers

– Southern California partitioned into “cubes” then mapped onto processors of high performance computer

– Data choreography used to move data in and out of memory during processing

Builds on data and models from the Southern California Earthquake Center, Kinematic source (from Denali) focuses on Cajon Creek to Bombay Beach

Fran Berman, Data and Society, CSCI 4370/6370

TeraShake Simulation

Simulation of Southern of 7.7 earthquake on lower San Andreas Fault

• Physics-based dynamic source model – simulation of mesh of 1.8 billion cubes with spatial resolution of 200 m

• Simulated first 3 minutes of a magnitude 7.7 earthquake, 22,728 time steps of 0.011 second each

• Simulation for TeraShake 1 and 2 simulations generated 45+ TB data

Fran Berman, Data and Society, CSCI 4370/6370

Under the surface

Fran Berman, Data and Society, CSCI 4370/6370

Resources must support a

complicated orchestration of

computation and data

movement

47 TB output

data for 1.8

billion grid

points

Continuous

I/O 2GB/sec

240 procs on

SDSC Datastar,

5 days, 1 TB

of main memory

“Fat Nodes” with 256 GB of

DataStar used for pre-processing

and post visualization

10-20 TB data archived a day

Finer resolution simulations require even more resources.

TeraShake scaled to run on petascale architectures

Behind the Scenes: TeraShake Data Choreography

Parallel file system

Data parking

Fran Berman, Data and Society, CSCI 4370/6370

Terashake and Data

• Data Management

– 10 Terabytes moved per day during

execution over 5 days

– Derived data products registered into

SCEC digital library (total SCEC library

had 168 TB)

• Data Post-processing:

– Movies of seismic wave propagation

– Seismogram formatting for interactive

on-line analysis

– Derived data:

• Velocity magnitude

• Displacement vector field

• Cumulative peak maps

• Statistics used in visualizations

TeraShake Resources

Computers and Systems

• 80,000 hours on IBM Power 4 (DataStar)

• 256 GB memory p690 used for testing,

p655s used for production run,

TeraGrid used for porting

• 30 TB Global Parallel file GPFS

• Run-time 100 MB/s data transfer from

GPFS to SAM-QFS

• 27,000 hours post-processing for high

resolution rendering

People

• 20+ people for IT support

• 20+ people in domain research

Storage

• SAM-QFS archival storage

• HPSS backup

• SRB Collection with 1,000,000 files

Fran Berman, Data and Society, CSCI 4370/6370

TeraShake at Petascale – better prediction accuracy creates greater

resource demands

Estimated figures for simulated 240 second period,

100 hour run-time

TeraShake domain (600x300x80 km^3)

PetaShake domain

(800x400x100 km^3)

Fault system interaction

NO YES

Inner Scale 200m 25m

Resolution of terrain grid

1.8 billion mesh points

2.0 trillion mesh points

Magnitude of Earthquake

7.7 8.1

Time steps 20,000

(.012 sec/step) 160,000

(.0015 sec/step)

Surface data 1.1 TB 1.2 PB

Volume data 43 TB 4.9 PB

Information courtesy of the Southern California Earthquake Center

Tera 1012

Peta 1015

Fran Berman, Data and Society, CSCI 4370/6370

Application Evolution

• TeraShake PetaSHA, PetaShake, CyberShake, etc. at SCEC

• Evolving applications improving resolution, models, simulation accuracy, scope of results, etc.

• PetaSHA foci:

– Create a hierarchy of simulations for 10 most probably large (M>7) ruptures in southern California

– Validation of earthquake simulations using well-recorded regional events (M<=6.7) and assimilation of regional waveform data into community velocity models

– Validation of hazard curves and extension of maps to higher frequencies and more extensive geographic coverage, creating rich new database for earthquake scientists and engineers

Fran Berman, Data and Society, CSCI 4370/6370

Large Hadron Collider Data Analysis

Fran Berman, Data and Society, CSCI 4370/6370

The Large Hadron Collider (LHC) • LHC is the world’s most powerful particle collider.

• LHC’s goal is to allow physicists to test the predictions of

different theories of particle physics, high-energy physics,

(in particular the properties of the Higgs Boson) and the

large family of new particles predicted by supersymmetric

theories.

• LHC contains seven detectors, each designed for a different kind of research. LHC built near Geneva between 1998 and 2008 in collaboration with over 10,000 scientists and engineers from over 100 countries

• LHC lies in a 17 mile circumference tunnel beneath the France-Switzerland border.

• LHC collisions produce 10’s of PBs of data per year.

– Subset of data analyzed by distributed grid of 170+ computers in 36 countries

A collider is a type of a particle accelerator with two directed

beams of particles.

In particle physics colliders are used as a research tool: they

accelerate particles to very high kinetic energies and let them

impact other particles.

Analysis of the byproducts of these collisions gives scientists

good evidence of the structure of the subatomic world and the laws

of nature governing it.

Many of these byproducts are produced only by high energy

collisions, and they decay after very short periods of time. Thus many of them are hard or near

impossible to study in other ways.

Information from Jamie Shiers and Wikipedia

Fran Berman, Data and Society, CSCI 4370/6370

Higgs and Beyond

“A major goal of Run1 of the LHC was to find evidence of the Higgs boson…” Its discovery was announced on July 4 2012. After a prolonged downtime to prepare for running at almost twice the energy of Run1, the LHC restarted in 2015 and will run (mainly during the summer months) for another 3 years before yet another upgrade. Major goals of Run2 are to search “beyond the standard model”, including searches for Dark Energy and Dark Matter.

Information from Jamie Shiers, CERN

Fran Berman, Data and Society, CSCI 4370/6370

Worldwide LHC Computing Grid

Image from http://wlcg.web.cern.ch/

Fran Berman, Data and Society, CSCI 4370/6370

Data: Outlook for HL-LHC

• The LHC – including all

foreseen upgrades – will

run until circa 2040. By

that time between 10 and

100 EB of

data will have been

gathered.

• These data (the

uninteresting stuff has

already been discarded)

should be preserved for a

number of decades.

• Very rough estimate of a new RAW data per year of running using a simple extrapolation of current data volume scaled by the output rates.

• To be added: derived data (ESD, AOD), simulation, user data…

At least 0.5 EB / year (x 10 years of data taking)

PB

0.0

50.0

100.0

150.0

200.0

250.0

300.0

350.0

400.0

450.0

Run 1 Run 2 Run 3 Run 4

CMS

ATLAS

ALICE

LHCb

We are here!

Slide adapted from Jamie Shiers, CERN

Fran Berman, Data and Society, CSCI 4370/6370

LHC – Stewardship and Preservation Challenges

• Significant volumes of high energy physics data are thrown away “at birth” – i.e. via very strict filters (aka triggers) before writing to storage. To a first approximation, all remaining data needs to be preserved for a few decades.

– LHC data particularly valuable as reproducibility of experiments is tremendously expensive and almost impossible to achieve

• Tier 0 and 1 sites currently provide bit preservation at scale

– Data more usable and accessible when services coupled with bit preservation

– In the process of “self certification” according to ISO 16363 of the Tier0 and TIer1 sites.

• CERN also developing an advanced Data Management Plan that will be updated roughly annually based on “an intelligent super-set” of the Horizon 2020, DoE and NSF guidelines – with a few of our own for good measure.

Slide adapted from Jamie Shiers, CERN

Fran Berman, Data and Society, CSCI 4370/6370

Post-collision

David South | Data Preservation and Long Term Analysis in HEP | CHEP 2012, May 21-25 2012 | Page 6

After the collisions have stopped

> Finish the analyses! But then what do you do with the data?

§ Until recently, there was no clear policy on this in the HEP community

§ It’s possible that older HEP experiments have in fact simply lost the data

> Data preservation, including long term access, is generally not part of

the planning, software design or budget of an experiment

§ So far, HEP data preservation initiatives have been in the main not planned by the

original collaborations, but rather the effort a few knowledgeable people

> The conservation of tapes is not equivalent to

data preservation!

§ “We cannot ensure data is stored in file formats appropriate for

long term preservation”

§ “The software for exploiting the data is under the control of the

experiments”

§ “We are sure most of the data are not easily accessible!”

Slide adapted from Jamie Shiers, CERN

Fran Berman, Data and Society, CSCI 4370/6370

Cyberinfrastructure

Fran Berman, Data and Society, CSCI 4370/6370

• Cyberinfrastructure is the

organized aggregate of

information technologies

coordinated to address

problems in science and

society

• Cyberinfrastructure components:

– Digital data

– Computers

– Wireless and wireline

networks

– Personal digital devices

– Scientific instruments

– Storage

– Software

– Sensors

– People …

“Cyberinfrastructure” (aka e-Infrastructure)

“If infrastructure is required for an industrial economy, then we could say that cyberinfrastructure is required for

a knowledge economy.”

NSF Final Report of the Blue Ribbon Advisory Panel on Cyberinfrastructure

(“Atkins Report”, 2003)

Fran Berman, Data and Society, CSCI 4370/6370

Cyberinfrastructure (CI) an emerging national focus in 2000 and beyond

• Publication of the Atkins report

accelerated CI as a critical national focus

within federal R&D investments and

especially at NSF

• CI elevated from a division within CISE

directorate to an Office within NSF.

(Atkins became first OCI Director).

• San Diego Supercomputer Center

(SDSC), a pioneer in data-intensive

computing, focused on leadership in

data cyberinfrastructure and data-

enabled applications

Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf

Fran Berman, Data and Society, CSCI 4370/6370

Building Data Cyberinfrastructure at SDSC, 2001 - 2009

Fran Berman, Data and Society, CSCI 4370/6370

SDSC in a Nutshell:

• 1985 – 2004: NSF supercomputer / cyberinfrastructure center hosted at UCSD. 2004 – present: UCSD center with national impact.

• Multi-faceted facility focused on cyberinfrastructure-enabled applications, high performance computing, data stewardship

– Several hundred research, technology, and productions systems -focused staff

– Home to 100+ national research projects, allocated national machines, research data infrastructure

– Funded by NSF, Department of Energy, NIH, DHS, Library of Congress, National Archives and Records Administration, UC system, etc.

Data Cyberinfrastructure at the San Diego Supercomputer Center (SDSC), 2001-2009

Fran Berman, Data and Society, CSCI 4370/6370

SDSC Strategic Focus in the 2000’s : Support for Data-oriented science from the small to the large-scale

• Special needs at the extremes …

• Data Cyberinfrastructure should support

– Petabyte sized collections

– 100 PetaByte archive

– Collections which must be preserved 100 years or more

– Data-oriented simulation, analysis, and modeling at 10-100X university/research lab-level capacities

– Professional data services, software, curation beyond what is feasible in university, campus, and research lab facilities

size

number

timeframe

capability

SW support

Fran Berman, Data and Society, CSCI 4370/6370

SDSC Data Cyberinfrastructure / Selected Projects

SDSC

Data CI

DATA PORTALS

DATA

VISUALIZATION

DATA

MANAGEMENT

HPC

DATA

DATA

ANALYTICS

DATA

SERVICES

DATA

STORAGE

DATA

PRESERVATION

Data

Oasis

Fran Berman, Data and Society, CSCI 4370/6370

SDSC Data Central

• One of the first general-purpose programs of its kind to support research and community data collections and databases

• Data Central was available without charge to the scientific community and provided a facility to store, manage, analyze, mine, share and publish data collections, enabling access and collaboration in the broader scientific community

• Project led by Natasha Balac at SDSC

Who could apply

• Open to researchers affiliated with US educational institutions

• Proposals were merit-reviewed quarterly by Data Allocations Committee

Types of Allocations:

• Expedited Allocations

– 1 TB or less of disk & tape 1st year

– 5 GB Database 1st year

– Yearly review

• Medium Allocations

– Under 30 TB

• Large Allocations

– Larger than 30 TB

Fran Berman, Data and Society, CSCI 4370/6370

DataCentral Allocated Collections included

Seismology 3D Ground Motion Collection for the LA Basin

Atmospheric Sciences50 year Downscaling of Global Analysis over California Region

Earth Sciences NEXRAD Data in Hydrometerology and Hydrology

Elementary Particle Physics

AMANDA data

Biology AfCS Molecule Pages

Biomedical Neuroscience

BIRN

Networking Backbone Header Traces

Networking Backscatter Data

Biology Bee Behavior

Biology Biocyc (SRI)

Art C5 landscape Database

Geology Chronos

Biology CKAAPS

Biology DigEmbryo

Earth Science Education

ERESE

Earth Sciences UCI ESMF

Earth Sciences EarthRef.org

Earth Sciences ERDA

Earth Sciences ERR

Biology Encyclopedia of Life

Life Sciences Protein Data Bank

Geosciences GEON

Geosciences GEON-LIDAR

Geochemistry Kd

Biology Gene Ontology

Geochemistry GERM

Networking HPWREN

Ecology HyperLter

Networking IMDC

Biology Interpro Mirror

Biology JCSG Data

Government Library of Congress Data

Geophysics Magnetics Information Consortium data

Education UC Merced Japanese Art Collections

Geochemistry NAVDAT

Earthquake Engineering

NEESIT data

Education NSDL

Astronomy NVO

Government NARA

Anthropology GAPP

Neurobiology Salk data

Seismology SCEC TeraShake

Seismology SCEC CyberShake

Oceanography SIO Explorer

Networking Skitter

Astronomy Sloan Digital Sky Survey

Geology Sensitive Species Map Server

Geology SD and Tijuana Watershed data

Oceanography Seamount Catalogue

Oceanography Seamounts Online

Biodiversity WhyWhere

Ocean Sciences Southeastern Coastal Ocean Observing and Prediction Data

Structural Engineering TeraBridge

Various TeraGrid data collections

Biology Transporter Classification Database

Biology TreeBase

Geoscience Tsunami Data

Education ArtStor

Biology Yeast regulatory network

Biology Apoptosis Database

Cosmology LUSciD

Fran Berman, Data and Society, CSCI 4370/6370

Focus of Computing Procurements: Supercomputers that supported Data-Intensive Applications

• A balanced system to support data-oriented applications requires a trade-off of flops and other key system characteristics

• Balanced system provides support for tightly-coupled and strong I/O applications

– Grid platforms not a strong option – Data local to computation – I/O rates exceed WAN capabilities – Continuous and frequent I/O is latency intolerant

• Scalability is key

– Need high-bandwidth and large-capacity local parallel file systems

– Need large-capacity flexible “parking” storage for post-processing

– Need high-bandwidth and large-scale archival storage

• Application performance determines the best configuration

Locality Data Stride 0 Ignored

0.00

0.10

0.20

0.30

0.40

0.50

0.60

0.70

0.80

0.90

1.00

0.00 0.10 0.20 0.30 0.40 0.50 0.60 0.70 0.80 0.90

Spatial Locality

Te

mp

ora

l lo

ca

lity

IBM Pwr3

Linpack

RandomAccess STREAM

Overflow

DoD apps plotted for locality

Compute

Data

Fran Berman, Data and Society, CSCI 4370/6370

Data Activities at SDSC in 2016 • Data Oasis – 4 PB of storage

for users of SDSC Gordon and Comet in XSEDE

• Data-focused projects:

– NSF Big Data Hub West

– Health Cyberinfrastructure -- Cancer Data Infrastructure

– Big Data Benchmarking

– Data and Information Virtualization

– Data Integration

– Data Modeling

– Graph Analytics

– Spatial Data

– Predictive Analytics

– Time Series and Streaming Data

– Visualization, etc.

Fran Berman, Data and Society, CSCI 4370/6370

Data Science – an emerging field

• Slides based on slides by Rob Rutenbar for the NSF CISE AC Data Science Subcommittee

• [Wikipedia] Data Science is an interdisciplinary field about processes and systems to extract knowledge or insights from data in various forms, either structured or unstructured, which is a continuation of some of the data analysis fields such as statistics, data mining, and predictive analytics, similar to Knowledge Discovery in Databases (KDD).

Fran Berman, Data and Society, CSCI 4370/6370

How do we develop a roadmap for data science?

• Curriculum & pedagogy – Who teaches foundations? Where? To whom?

• Research – Who is doing data science research? Who is doing research enabled by

data science? How can this be fostered/facilitated? What are the programmatic needs?

• Infrastructure – What is needed to support data sci?

• Verticals – Disruptions of end to end computer stack

• Ethics & society – What informs use and collection of data? Social impacts?

Fran Berman, Data and Society, CSCI 4370/6370

How do we connect the dots in data science?

Data Science

Education and training

• Should help create data-literate citizenry

• Should prepare students for data-driven jobs in all sectors

• Should prepare students for jobs in data science in all sectors

Research and innovation

• Should support cutting edge research on key problems in the understanding, use and supporting environments for data

• Should support integration of data as driver for innovation in all fields

infrastructure

• Should support data education and training

• Should support data research at sufficient scale

• Should provide stewardship and preservation of valuable data

• Should support state-of-the art data services

Social Policy and Governance

• Should provide basis for appropriate social policy and regulation around the appropriate and ethical use of data

Fran Berman, Data and Society, CSCI 4370/6370

Foundational question to guide data science as it emerges: What kind of discipline is data science?

• Statistics?

• Machine learning?

• Hybrid discipline?

– X-analytics

– X-informatics (e.g. data expansion of bio-informatics)

– data + X (e.g. data equivalent of computational biology)

• Discipline on its own?

Maybe all of the above, but … what mechanisms can be used to grow and invest in the discipline to maximize its potential?

Fran Berman, Data and Society, CSCI 4370/6370

Org? Where Does Data Science Live, Grow?

CS

DS

Stat

DS

Stat CS DS

DataSci

CS Stat

A DS

B DS

C DS

Z DS

***

Where does foundational pedagagy get developed? Aimed at what audiences?

Fran Berman, Data and Society, CSCI 4370/6370

NSF CISE Current Data Science Investment Strategy

to develop new techniques and technologies to derive

knowledge from data

to manage, curate, and serve data to domain research

communities

for a growing emerging discipline

to support interdisciplinary science and communities

Fran Berman, Data and Society, CSCI 4370/6370

Future Investment Strategies: Proposed Focus of the NSF CISE Advisory Committee on Data Science

• What is data science?

• Where is data science? (i.e. within higher education)

• Who is data science? (i.e. within areas and professional occupations in the workforce)

• Why is data science a priority?

• PREPARING A WORKFORCE FOR DATA SCIENCE-ENABLED JOBS

– What do we need to teach and how? • Recommendations for curriculum development programs • Recommendations for graduate training and education programs • Recommendations for public-private data science internships

• ACCELERATING THE STATE OF THE ART OF DATA SCIENCE

– What are the big problems in data science? What are the big problems that data science enables? • Recommendations for NRC study on data science grand challenges • Recommendations for CISE programs that encourage research in data science grand challenges • Recommendations for joint CISE and other directorate (or agency) programs that encourage research in data science

grand challenges • Recommendations for joint CISE and private sector programs on data science grand challenges

• SUPPORTING DATA SCIENCE AND DATA-ENABLED RESEARCH AND EDUCATION

– What kind of infrastructure is needed to support successful data science research and education? • Recommendations for sustainable and reliable at-scale infrastructure and data collections • Recommendations for public-private partnerships to support data science infrastructure • Recommendations for broad training and education in using data science infrastructure and resources (data literacy …)

Fran Berman, Data and Society, CSCI 4370/6370

Lecture Materials (not already on slides)

• Southern California Earthquake Center, http://www.scec.org/

• Atkins Report: http://www.nsf.gov/cise/sci/reports/atkins.pdf

• LHC, www.wikipedia.com

• Worldwide LHC Computing Grid website, http://wlcg-public.web.cern.ch/tier-centres

• San Diego Supercomputer Center, www.sdsc.edu

Fran Berman, Data and Society, CSCI 4370/6370

Break

Fran Berman, Data and Society, CSCI 4370/6370

Data Round Table

Fran Berman, Data and Society, CSCI 4370/6370

Two Weeks: L4 Roundtable February 26 • “How the Internet of Things got hacked,” Wired, December, 2015,

http://www.wired.com/2015/12/2015-the-year-the-internet-of-things-got-hacked/ (Theo B)

• “The Measured Life,” MIT Technology Review, June 21, 2011, http://www.technologyreview.com/featuredstory/424390/the-measured-life/ (Brenda T)

• “GM and Lyft are building a network of self-driving cars,” Wired, January 4, 2016, http://www.wired.com/2016/01/gm-and-lyft-are-building-a-network-of-self-driving-cars/ (Chris P)

• “Hijackers remotely kill a jeep on the highway – with me in it,” Wired, July 2, 2015, http://www.wired.com/2015/07/hackers-remotely-kill-jeep-highway/ (TK W)

• “Robot doctors, online lawyers and automated architects: the future of the professions?”, The Guardian, June 15, 2014, http://www.theguardian.com/technology/2014/jun/15/robot-doctors-online-lawyers-automated-architects-future-professions-jobs-technology (Sri I)

Fran Berman, Data and Society, CSCI 4370/6370

Next week: L3 Data Roundtable for February 19

• “Scientists Say that a Neptune-Sized Planet Lurks Beyond Pluto,” Science, January 20, 2016, http://www.sciencemag.org/news/2016/01/feature-astronomers-say-neptune-sized-planet-lurks-unseen-solar-system (Kiana M.)

• “Birdwatchers help Science fill gaps in the Migratory Story,” NYTimes, January 30, 2016, http://www.nytimes.com/2016/01/29/science/bird-watchers-help-science-fill-gaps-in-the-migratory-story.html?rref=collection%2Fsectioncollection%2Fscience&action=click&contentCollection=science&region=rank&module=package&version=highlights&contentPlacement=8&pgtype=sectionfront (Amelia GB)

• “NCAR announces powerful new supercomputer for advanced atmospheric, geosciences modeling,” Scientific Computing, January 11, 2016, http://www.scientificcomputing.com/news/2016/01/ncar-announces-powerful-new-supercomputer-advanced-atmospheric-geosciences-modeling (Kienan K.)

• “Scientists Use Stargazing Technology in the Fight against Cancer,” Time, February, 2013 http://healthland.time.com/2013/02/27/scientists-use-stargazing-technology-in-the-fight-against-cancer/ (Courtney T.)

• “Digital Keys for Unlocking the Humanities’ Riches”, the New York Times, November 16, 2010, http://www.nytimes.com/2010/11/17/arts/17digital.html?pagewanted=all&_r=0 (Jessica J.)

Fran Berman, Data and Society, CSCI 4370/6370

Today: Readings for Lecture 2 Data Roundtable

• “Got Data? A Guide to Digital Preservation in the Information Age,” CACM (December, 2008) http://www.cs.rpi.edu/~bermaf/CACM08.pdf (Caitlin C.)

• “A Digital Life,” Scientific American (March, 2007) http://www.scientificamerican.com/article/a-digital-life/ (Rob R.)

• “Thirteen Ways of Looking at … Digital Preservation,” D-Lib Magazine (August, 2004), http://www.dlib.org/dlib/july04/lavoie/07lavoie.html (Amreen A.)

• “The Lost NASA Tapes: Restoring Lunar Images after 40 Years in the Vault”, ComputerWorld (June, 2009), http://www.computerworld.com/article/2525935/computer-hardware/the-lost-nasa-tapes--restoring-lunar-images-after-40-years-in-the-vault.html?page=2 (Jordan D.)

• “Preserving the Internet”, CACM (January 2016), http://cacm.acm.org/magazines/2016/1/195738-preserving-the-internet/fulltext (Dan L.)