1
Cyberinfrastructure for the 21st Century (CIF21): Data
MRI and STCI
EarthCube
CASCSept 9, 2011
Rob PenningtonOffice of Cyberinfrastructure (OCI)
National Science [email protected]
1
Framing the Challenge:Science and Society Transformed by
Data Modern science
Data- and compute-intensive
Integrative, multiscale Multi-disciplinary
Collaborations for Complexity Individuals, groups,
teams, communities Sea of Data
Age of Observation Distributed, central
repositories, sensor- driven, diverse, etc 2
Advisory Committee for Cyberinfrastructure Task Force
Reports
GrandChallenges
CampusBridgingData and Viz
Cyberlearning
HPC
HIGH P ERFORMANCE COMPUTING
Software
More than 25 workshops and Birds of a Feather sessions and more than 1300 people involved
Final recommendations presented to the NSF Advisory Committee on Cyberinfrastructure (ACCI) Dec 2010
Final reports on-line at: http://www.nsf.gov/od/oci/taskforces/
3
Data Task Force Recommendations
Infrastructure: Recognize data infrastructure and services (including visualization) as essential long term research assets fundamental to today’s science
Economic sustainability: Develop realistic cost models to underpin institutional/national business plans for research repositories/data services
Culture Change: Emphasize expectations for data sharing; support the establishment of new citation models in which data and software tool providers and developers are recognized and credited with their contributions
Data Management Guidelines: Identify and share best-practices for the critical areas of data management
Ethics and IP: Train researchers in privacy-preserving data access
4
Evolution of Cyberinfrastructure for the 21st Century (CIF21) and
Data
5
ACCI Data Task Force
National Science Board (NSB)
DataNet Program
Community Input
NSFCIF21 Data
Programs
On-going input
Science &Engineering Research
+ Cyberinfrastructure
DiscoveryCollaboration
Education
Maintainability, sustainability, and extensibility
Cyberinfrastructure Ecosystem (CIF21)
Organizations Universities, schools Government labs, agencies Research and Medical Centers Libraries, Museums Virtual Organizations Communities
Expertise Research and Scholarship Education Learning and Workforce Development Interoperability and operations Cyberscience
Networking Campus, national, international networks Research and experimental networks End-to-end throughput Cybersecurity
Computational Resources Supercomputers Clouds, Grids, Clusters Visualization Compute services Data Centers
Data Databases, Data repositories Collections and Libraries Data Access; storage, navigation management, mining tools, curation, privacy
Scientific Instruments Large Facilities, MREFCs,,telescopes Colliders, shake Tables Sensor Arrays - Ocean, environment, weather, buildings, climate. etc
Software Applications, middleware Software development and supportCybersecurity: access, authorization, authentication
DiscoveryCollaboration
Education
CIF21: Four Major Thrust Areas
Organizations Universities, schools Government labs, agencies Research and Medical Centers Libraries, Museums Virtual Organizations Communities
Expertise Research and Scholarship Education Learning and Workforce Development Interoperability and operations Cyberscience
Networking Campus, national, international networks Research and experimental networks End-to-end throughput Cybersecurity
Computational Resources Supercomputers Clouds, Grids, Clusters Visualization Compute services Data Centers
Data Databases, Data repositories Collections and Libraries Data Access; storage, navigation management, mining tools, curation, privacy
Scientific Instruments Large Facilities, MREFCs,,telescopes Colliders, shake Tables Sensor Arrays - Ocean, environment, weather, buildings, climate. etc
Software Applications, middleware Software development and supportCybersecurity: access, authorization, authentication
Data-Enabled Science
New ComputationalResources
Community ResearchNetworks
Access andConnections toCI Resources
Education: integral and embedded
Scientific Data Challenges
8
Byt
es
pe
r d
ay
2012 2020
Genomics
LHC
TeraGrid, BlueWaters
SquareKilometer
Array
Genomics
LHC
Climate, Environment
LSST
ExaBytes
PetaBytes
TeraBytes
GigaBytes
Climate, Environment
Volume
Useful
Lifetim
e
Distribution
Data Access
Many smaller datasets…
DataNet
Support data intensive and multi-disciplinary science
Provide reliable digital access, integration, management and preservation capabilities for science and engineering data over a decades-long timeline
Develop innovative data analysis and mining tools to support data manipulation, modeling, and discovery
Engage at the frontiers of technological innovation and transformative science to drive the leading edge forward
9
CIF21 Data Goals
DataNet is a strategic part of Foundation-wide investments in data in CIF21• Focus on center–scale awards
DataNet efforts effectively balance:• Production infrastructure to provide operational
services• Research to create next generation infrastructure
DataNet awards are partnerships• Responsive to user communities to define their
meaningful and useful scope• Form a coordinated network to provide national,
interdisciplinary data models and infrastructure
DataNet Role in CIF21
DataNet: A Multi-tiered and Multi-Disciplinary Landscape
11
GenomicsCommunities
Modeling and Simulation Communities
Population, Climate, Environment Communities
Data Curation
Data Storage
Data-enabled Science
DataNet supported
Data Storage
National storage infrastructure for scientific data Accommodate scale and heterogeneity of scientific
data through robust, open, and broadly accepted standards
Sustainable cost model that can be implemented with governmental, academic, non profit, and commercial stakeholders such that it is sustainable.
Make strategic investments that: Leverage existing resources in TeraGrid, commercial
clouds, federal data centers Meet growing capacity needs at optimum cost Provide coordinating and integrative functions for
integrity, access control, availability, persistence Catalyze a national data infrastructure in a
similar role that NSFNet played in Internet12
Data Curation
Sustainable, community-based networks for management of critical scientific data resources in a life-cycle context.
Overcome challenges of culture change, policy development and implementation, sustainable operations, quality and usability control.
Strategic awards that address heterogeneity in formats, complexity, semantics of data collections that are valued by science communities of significant breadth.
Operate as a network of data services that promote interoperability, multidisciplinarity, and scalability. 13
Data Enabled Science
Provide critical tools and services for data mining, integration, analysis, modeling and visualization.
Overcome barriers to scaling, synthesis, and interoperability to promote effective use of large scale, shared data resources.
Strategic investments that concentrate tools, resources and expertise in support of compelling grand challenge science questions.
14
Cross Cutting Challenges
Balancing research into next generations of infrastructure with operation & maintenance of current capacity. Stimulate innovation and manage transitions
Sustainable, long term programs Technical design, development of business models,
and integration with the research cycle. Integration
Vertical – Linking low-level bit storage infrastructure to data collections, and finally to applications
Horizontal– Achieving connectivity and interoperability between activities that vary in scale, disciplinarity, and funding source.
15
Life cycle perspective covering the use of the data Research, development, implementation, operations,
sustainability, close-out
Apply project management methods WBS, risk management, change control, schedule, milestones,
deliverables
Standardized process: Evaluate science merit, conceptual design Develop draft PEP, design and reporting metrics. Critical review – prototype, finalize baseline (approval/mid-
course correction/off-ramp) Implementation & operations – subject to change control,
oversight based on milestones & metrics Final operational review – informs decision for renewal,
termination.
DataNet Program Management
16
DataNet Federation ConsortiumData Driven Science
Implement national data grid Federate existing discipline-specific data management
systems to enable national research collaborations
Enable collaborative research on shared data collections Manage collection life cycle as the user community
broadens
Integrate “live” research data into education initiatives Enable student research participation through control
policies
Project
Shared Collection
Processing Pipeline
Digital Library
Reference Collection
Federation
Collection Life Cycle
Cyber-infrastructure Partners: Univ. of North Carolina, Chapel HillUniv. of California, San DiegoArizona State UniversityDrexel UniversityDuke UniversityUniversity of ArizonaUniversity of South Carolina
Science and Engineering Initiatives: Ocean Observatories Initiativethe iPlant CollaborativeCUAHSICIBER-UOdum Social Science InstituteTemporal Dynamics of Learning Center
National Science Foundation Cooperative Agreement: OCI-0940841Policy-based
data management
CUNY SI: Instrumentation for Enabling Data Analysis, Sharing, Storage, and Preservation
UC Boulder: Acquisition of a Scalable Petascale Storage Infrastructure for Data-Collections and Data-Intensive Discovery
RPI: Acquisition of a Balanced Environment for Simulation
NCA&T: Acquisition of a Complete High-Performance Modeling and Visualization System for Research in Mathematical Biology and Mathematical Geosciences
OSU: Acquisition of a High Performance Compute Cluster for Multidisciplinary Research
MRI 2011
18
WHAT IS EARTHCUBE?
A Call to Action
Over the next decade, the geosciences community commits to developing a
framework to understand and predict responses of the Earth as a
system—from the space-atmosphere boundary to the Earth’s core,
including the influences of humans and ecosystems
Transitions and Tipping Points in Complex Environmental Systems, NSF AC for Environmental Research and Education, 2009
Earth Science and Applications from Space: National Imperatives for the Next Decade and Beyond, 2007
High-Performance Computing Requirements for the Computational Solid Earth Sciences, 2005
Goal of EarthCube
To transform the conduct of research in geosciences by supporting community-based cyberinfrastructure to integrate data and information for knowledge management across the Geosciences.
What Needs To Be Done?
Integrate data, tools and communities through cyberinfrastructure
Establish a governance mechanism that is inclusive and adopted by the community
Utilize current and emerging technologies to create transparent infrastructure for the geosciences community
Modes of Support
Convergence to a Unifying Architecture
EARTHCUBE ASSUMPTIONS
The geosciences community is ready to take on the EarthCube challenge
Community will start self-organizing prior to EarthCube activities, like the Nov 1-4 Charrette
Current and emerging technology will help achieve the convergence envisioned for EarthCube
A broad range of expertise and resources must be engaged to shape EarthCube
Jun 2011Jul-Sept 2011
Nov 1-4 2011Nov/11-Apr/12
May 2012
DCLReleased
TwoWebExevents
Charrette Proposed
FrameworkApproaches
Developed through EAGERs
Sandpit/IdeasLabto determine18 mo. prototype award(s)
EARTHCUBE TIMELINE
On-line Community Information: August to November, 2011
EarthCube Charrette: Early November, 2011
EarthCube Ideas/Lab: Tentatively Early May, 2012
Prototype Development: May to December 2013
Fully integrated geosciences infrastructure: 2014-2022
Pre-Charrette Organization(August – September)
Second WebEx on Aug. 22 NSF seeks input from wide range of sources
Individuals, inst./org., representatives of scientific groups or communities
Facilities and managers of CI endeavors Industry, Federal Labs., Federal Agencies, and
International Partners NSF will establish on-line resources and forums to
Gather community inputs/requirements Facilitate partnerships and collaborations Encourage submission of approaches to the EarthCube
design
Charrette Process
Stakeholders focus EarthCube Ideas and Activities Plenary Sessions to
discuss user requirements refine approaches and designs for EarthCube develop partnerships and new collaborations
Remote participation and real-time comments system will be available
Summary Session Comments from NSF, facilitators, and participants on
process NSF provides guidance on post-Charrette activities
29
Questions?