![Page 1: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/1.jpg)
Clouds and Commons for the Data Intensive Science Community
Robert Grossman University of Chicago
Open Cloud Consor>um
June 8, 2015 2015 NSF Open Science Data Cloud PIRE Workshop
Amsterdam
![Page 2: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/2.jpg)
Collect data and distribute files via Np and apply data mining
Make data available via open APIs and apply data science
2000
2010-‐2015
2020-‐2025
???
Grids and federated computa>on
![Page 3: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/3.jpg)
1. Data Commons
![Page 4: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/4.jpg)
We have a problem … The commodi>za>on of sensors is crea>ng an explosive growth of data
It can take weeks to download large geo-‐spa>al datasets
Analyzing the data is more expensive than producing it
There is not enough funding for every researcher to house all the data they need
![Page 5: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/5.jpg)
Data Commons
Data commons co-‐locate data, storage and compu>ng infrastructure, and commonly used tools for analyzing and sharing data to create a resource for the research community.
Source: Interior of one of Google’s Data Center, www.google.com/about/datacenters/
![Page 6: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/6.jpg)
The Tragedy of the Commons
Source: Garre[ Hardin, The Tragedy of the Commons, Science, Volume 162, Number 3859, pages 1243-‐1248, 13 December 1968.
Individuals when they act independently following their self interests can deplete a deplete a common resource, contrary to a whole group's long-‐term best interests.
Garre[ Hardin
![Page 7: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/7.jpg)
7 www.opencloudconsor>um.org
• U.S based not-‐for-‐profit corpora>on with interna>onal partners.
• Manages cloud compu>ng infrastructure to support scien>fic research: Open Science Data Cloud, OCC/NASA Project Matsu, & OCC/NOAA Data Commons.
• Manages cloud compu>ng infrastructure to support medical and health care research: Biomedical Data Commons.
• Manages cloud compu>ng testbeds: Open Cloud Testbed.
![Page 8: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/8.jpg)
What Scale? • New data centers are some>mes divided into “pods,” which can be built out as needed.
• A reasonable scale for what is needed for a commons is one of these pods (“cyberpod”)
• Let’s use the term “datapod” for the analy>c infrastructure that scales to a cyberpod.
• Think of as the scale out of a database.
• Think of this as 5-‐40+ racks.
Pod A Pod B
![Page 9: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/9.jpg)
experimental science
simula>on science
data science
1609 30x
1670 250x
1976 10x-‐100x
2004 10x-‐100x
![Page 10: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/10.jpg)
Core Data Commons Services
• Digital IDs • Metadata services • High performance transport • Data export • Pay for compute with images/containers containing commonly used tools, applica>ons and services, specialized for each research community
![Page 11: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/11.jpg)
Cloud 1 Cloud 3
Data Commons 1 Commons provide data to other commons and to clouds
Research projects producing data
Research scien>sts at research center B
Research scien>sts at research center C
Research scien>sts at research center A downloading data
Community develops open source soNware stacks for commons and clouds
Cloud 2 Data
Commons 2
![Page 12: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/12.jpg)
Complex sta>s>cal models over small data that are highly manual and update infrequently.
Simpler sta>s>cal models over large data that are highly automated and update frequently.
memory databases
GB TB PB
W KW MW
datapods
cyber pods
![Page 13: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/13.jpg)
Is More Different? Do New Phenomena Emerge at Scale in Biomedical Data?
Source: P. W. Anderson, More is Different, Science, Volume 177, Number 4047, 4 August 1972, pages 393-‐396.
![Page 14: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/14.jpg)
2. OCC Data Commons
![Page 15: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/15.jpg)
matsu.opensciencedatacloud.org
OCC-‐NASA Collabora>on 2009 -‐ present
![Page 16: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/16.jpg)
• Public-‐private data collabora>ve announced April 21, 2015 by US Secretary of Commerce Pritzker.
• AWS, Google, IBM, MicrosoN and Open Cloud Consor>um will form five collabora>ons.
• We will develop an OCC/NOAA Data Commons.
![Page 17: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/17.jpg)
University of Chicago biomedical data commons developed in collabora>on with the OCC.
![Page 18: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/18.jpg)
Data Commons Architecture
Object storage (permanent)
Scalable light weight workflow
Community data products (data harmoniza>on)
Data submission portal and APIs
Data portal and open APIs for data access
Co-‐located “pay for compute”
Digital ID Service & Metadata Service
Devops suppor>ng virtual machines and containers
![Page 19: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/19.jpg)
3. Scanning Queries over Commons and the Matsu Wheel
![Page 20: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/20.jpg)
What is the Project Matsu?
Matsu is an open source project for processing satellite imagery to support earth sciences researchers using a data commons.
Matsu is a joint project between the Open Cloud Consor>um and NASA’s EO-‐1 Mission (Dan Mandl, Lead)
![Page 21: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/21.jpg)
All available L1G images (2010-‐now)
![Page 22: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/22.jpg)
NASA’s Matsu Mashup
![Page 23: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/23.jpg)
1. Open Science Data Cloud (OSDC) stores Level 0 data from EO-‐1 and uses an OpenStack-‐based cloud to create Level 1 data.
2. OSDC also provides OpenStack resources for the Nambia Flood Dashboard developed by Dan Mandl’s team.
3. Project Matsu uses a Hadoop applica>ons to run analy>cs nightly and to create >les with OGC-‐compliant WMTS.
![Page 24: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/24.jpg)
Amount of data retrieved
Number of queries
mashup
re-‐analysis
“wheel”
row-‐oriented column-‐oriented
done by staff
self-‐service by community
![Page 25: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/25.jpg)
Spectral anomaly detected: Nishinoshima active volcano, Dec, 2014
![Page 26: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/26.jpg)
4. Data Peering for Research Data
![Page 27: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/27.jpg)
Tier 1 ISPs “Created” the Internet
![Page 28: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/28.jpg)
Amount of data retrieved
Number of queries
Number of sites
download data
data peering
![Page 29: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/29.jpg)
Cloud 1
Data Commons 1
Data Commons 2
Data Peering
• Tier 1 Commons exchange data for the research community at no charge.
![Page 30: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/30.jpg)
Three Requirements for Data Peering Between Data Commons
Two Research Data Commons with a Tier 1 data peering rela>onship agree as follows: 1. To transfer research data between them at no cost
beyond the fixed cost of a cross-‐connect. 2. To peer with at least two other Tier 1 Research Data
Commons at 10 Gbps or higher. 3. To support Digital IDs (of a form to be determined
by mutual agreement) so that a researcher using infrastructure associated with one Tier 1 Research Data Commons can access data transparently from any of the Tier 1 Research Data Commons that holds the desired data.
![Page 31: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/31.jpg)
5. Requirements and Challenges for Data Commons
![Page 32: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/32.jpg)
Cyber Pods • New data centers are some>mes divided into “pods,” which can be built out as needed.
• A reasonable scale for what is needed for biomedical clouds and commons is one (or more) of these pods.
• Let’s use the term “cyber pod” for a por>on of a data center whose cyber infrastructure is dedicated to a par>cular project.
Pod A Pod B
![Page 33: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/33.jpg)
The 5P Requirements
• Permanent objects
• SoNware stacks that scale to cyber Pods • Data Peering • Portable data • Support for Pay for compute
![Page 34: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/34.jpg)
Requirement 1: Permanent Secure Objects
• How do I assign Digital IDs and key metadata to open access and “controlled access” data objects and collec>ons of data objects to support distributed computa>on of large datasets by communi>es of researchers? – Metadata may be both public and controlled access – Objects must be secure
• Think of this as a “dns for data.” • The test: One Commons serving the cancer community
can transfer 1 PB of BAM files to another Commons and no bioinforma>cians need to change their code
![Page 35: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/35.jpg)
Requirement 2: SoNware stacks that scale to cyber Pods
• How can I add a rack of compu>ng/storage/networking equipment to a cyber pod (that has a manifest) so that – ANer a[aching to power – ANer a[aching to network – No other manual configura>on is required – The data services can make use of the addi>onal infrastructure
– The compute services can make use of the addi>onal infrastructure
• In other words, we need an open source soNware stack that scales to cyber pods.
• Think of data services that scale to cyber pods as “datapods.”
![Page 36: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/36.jpg)
Core Services for a Biomedical Cloud
• On demand compute, either virtual machines or containers
• Access to data from commons or other cloud
Core Services for a Biomedical Data Commons • Digital ID Service • Metadata Service • Object-‐based Storage (e.g. S3 compliant) • Light weight work flow that scales to a pod • Pay as you go compute environments
![Page 37: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/37.jpg)
Common Services
• Authen>ca>on that uses InCommon or similar federa>on
• Authoriza>on from third party (DACO, dbGAP) • Access controls • Infrastructure monitoring • Infrastructure automa>on framework • Security and compliance that scales • Accoun>ng and billing
![Page 38: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/38.jpg)
Requirement 3: Data Peering
• How can a cri>cal mass of data commons support data peering so that a research at one of the commons can transparently access data managed by one of the other commons – We need to access data independent of where it is stored
– “Tier 1 data commons” need to pass research data and other community data at no cost
– We need to be able to transport large data efficiently “end to end” between commons
![Page 39: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/39.jpg)
Cloud 1
Data Commons 1
Data Commons 2
Data Peering
• Tier 1 Data Commons exchange data for the research community at no charge.
![Page 40: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/40.jpg)
Requirement 4: Data Portability
• We need a simple bu[on that can export our data from one data commons and import it into another one that peers with it.
• We also need this to work for controlled access biomedical data. • Think of this as “Indigo Bu[on” which safely and compliantly
moves biomedical data between commons, similar to the HHS “Blue Bu[on.”
![Page 41: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/41.jpg)
Requirement 5: Support Pay for Compute
• The final requirement is to support “pay for compute” over the data in the commons. Payments can be through: – Alloca>ons – “Chits” – Credit cards – Data commons “condos” – Joint grants – etc.
![Page 42: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/42.jpg)
6. OCC Global Distributed Data Commons
![Page 43: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/43.jpg)
The Open Cloud Consor>um is prototyping interopera>ng and peering data commons throughout the world (Chicago, Toronto, Cambridge and Asia) using 10 and 100 Gbps research networks.
![Page 44: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/44.jpg)
Collect data and distribute files via Np and apply data mining
Make data available via open APIs and apply data science
2000
2010-‐2015
2020 -‐ 2025
Interoperate data commons, support data peering and apply ???
![Page 45: Clouds’and’Commons’for’the’DataIntensive’ Science’Community’delaat/pire/2015/... · Cloud1 Cloud3 DataCommons’’1’ Commons provide’datato’ other’commons’](https://reader034.vdocument.in/reader034/viewer/2022052015/602dd48a08cafb09a04d800e/html5/thumbnails/45.jpg)
Ques>ons?
45
For more informa>on: rgrossman.com @bobgrossman