support for the full e-experimentation cycle in the virtual laboratory infrastructure

Polish InfrastructurePolish Infrastructurefor Supporting Computational Sciencefor Supporting Computational Science

in the European Research Spacein the European Research Space

Support for the Full e-ExperimentationSupport for the Full e-Experimentation Cycle Cycle in the Virtual Laboratory Infrastructurein the Virtual Laboratory Infrastructure

Piotr Nowakowski (1), Eryk Ciepiela (1), Tomasz Gubała (1), Maciej Malawski (1, 2), Marian Bubak (1, 2)

(1) ACC Cyfronet AGH, ul. Nawojki 11, 30-950 Kraków, Poland(2) Institute of Computer Science AGH, Mickiewicza 30, 30-059Kraków, Poland

KUKDM’10

Zakopane, 18-19 March 2010

OutlineOutline

Motivation Problem definition Scientific challenges Iterative experimentation support Experiment pipelines and traces Sharing experiment data through Data Nets

Motivation: Motivation: e-Science e-Science EExperiments,xperiments,DDataata and and PublicationsPublications

Reproducible experiments, provenance in e-Science

Need to link publications with primary data (experimental data, algorithms, software, workflows, scripts)

Plentitude of scientific software: jobs, workflows, services, components, scripts, experiment plans

Huge amount of scientific data consumed and produced by e-Science

Earth and life Sciences, HEP, etc. Large number of publications

makes research difficult: Computer Science: DBLP contains more

than 220 = 1,048,576 publications, PubMed stores ~17 million articles to

date, CM digital library, ISI Web of Knowledge,

Scopus, Citeseer, arXiv, Google Scholar Emergence of the Web 2.0-based

Scientific Social Community (SSC) model

Open Science & Science 2.0Open Science & Science 2.0

New means of scientific communication:Wikis, blogscollaborative web 2.0 technologies

New methods of conducting science: e-science, in-silico experiments, exploratory applications

Democratization of science Increasing role of openness

Problem DefinitionProblem Definition

To construct a theoretical model facilitating open, collaborative e-experimentation, from experiment inception to publication of results, including primary scientific data

To develop a framework implementing the above model

To exploit the emerging solution in the context of existing HPC infrastructures and scientific collaboration

Scientific ChallengesScientific Challenges

Theoretical: A common method for referencing primary data (experimental data, algorithms, software, workflows, scripts) as part of publications should be developed and integrated with modern e-Science infrastructures

Technological: An integrated architecture for storing, annotating, publishing, referencing and reusing primary data sources.This architecture should span existing virtual laboratory and grid computing systems

Description of the SolutionDescription of the Solution

Phase 1: Iterative experiment preparationPhase 2: Experiment execution involving semantic

storage of results and ensuring repeatability

Experimentation PipelineExperimentation Pipeline

The process of developing an experiment beings with drafting its specification

This is followed by iteratively constructing an experiment plan

Each prototype is tested by a specific research community, using tools provided by the PL-Grid virtual laboratory

Upon completion of tests the experiment can be executed in a production mode

Obtained results can be published along with the experiment plan (i.e. a set of operations which enable reenactment and validation of a given experiment)

Experiment TracesExperiment Traces

An experiment trace consists of the following: any input data provided by the experiment enactor; all steps performed in order to transform this data

into publishable scientific results (chronologically arranged);

the documentation of the experiment plan, prepared by a domain scientist (in the form of annotations and comments).

The outcome of this process will be easily manageable and readable, similarly to weblog entries

Our VL system will enable enrichment of individual data elements with provenance information, linking them to appropriate stages of the experiment

Sharing Primary Data: Data NetsSharing Primary Data: Data Nets

Data Net – unifying modern data storage mechanisms (relational databases, Grid-based file systems, Wiki pages etc.)A Data Net is a group of data entities linked by named relationships. Such relationships impose a structure upon the dataset and facilitate querying for entities

ReferencesReferences

W. Funika, D. Harezlak, D. Krol, M. Bubak; Environment for Collaborative Development and Execution of Virtual Laboratory Applications. In: M. Bubak, G.D.v. Albada, J. Dongarra, P.M.A. Sloot (Eds.), Proceedings ICCS 2008, Kraków, Poland, LNCS 5103, pp. 246-458, Springer 2008.

T. Gubala, M. Bubak, P.M.A. Sloot; Semantic Integration of Collaborative Research Environments, M. Cannataro (ed.) Handbook of Research on Computational Grid Technologies for Life Sciences, Biomedicine and Healthcare, Information Science Reference, 2009, IGI Global.

M. Bubak, M. Malawski, T. Gubala, M. Kasztelnik, P. Nowakowski, D. Harezlak, T. Bartynski, J. Kocot, E. Ciepiela, W. Funika, D. Krol, B. Balis, M. Assel, and A. Tirado Ramos. Virtual laboratory for collaborative applications. In M. Cannataro, editor, Handbook of Research on Computational GridTechnologies for Life Sciences, Biomedicine and Healthcare, chapter XXVII, pages 531-551. IGI Global, 2009.

https://gs2.cyfronet.pl

support for the full e-experimentation cycle in the virtual laboratory infrastructure

Documents

experiment data

primary scientific data

primary data experimental

data netsmotivation

experiment execution

input data

experiment inception

experiment planshuge