e-science and grid the vl-e approach

e-Science and GridThe VL-e approach

L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van [email protected]

ContentDevelopments in GridDevelopments in e-ScienceObjectivesVirtual Lab for e-ScienceResearch philosophyConclusions

Background ICTpush developmentsProcessing power doubles every 18 monthMemory size doubles every 12 monthNetwork speed doubles every 9 monthSomething has to be done to harness this developmentVirtualization of ICT resources InternetWebGrid

Internal versus external bandwidthMbit/sComputer bussesnetworks

Chart1

19900.064

19954

1997155

1998622

20012500

200220000

1980

.0.0015

Chart2

19900.064

19954

1997155

1998622

20012500

200220000

1980

.0.0015

Chart3

0.0015128

0.064320

4640

1551056

6221056

25004224

200004224

networking

internal

Sheet1

TimeNetworkbus

19800.001512816

19900.06432040

1995464080

19971551056132

19986221056132

200125004224528

2002200004224528

Sheet2

Sheet3

Web& Grid & Web/Grid Services

Web services is a paradigm/way of using/accessing informationWeb resources are mostly human-centered (understood by humans, computers can read but cant understand)Grid is about accessing & sharing computing resources by virtualization Data & Information repositoriesExperimental facilitiesOGSA: Service oriented Grid standard based on Web services

Web & Grid & Semantic Web/Grid

Semantic Web resources should also be understandable by computersThis way, many complex tasks formulated and assigned by humans can be automated and executed by agents working on the Semantic WebBut, both Grid and Web Services only focus on single services.Semantic Web/Grid should be able to describe a single service as well as the relationship between services e.g. aggregate service using a number of services so making knowledge explicit

Levels of Grid abstractionComputational Grid Data GridInformation Web/Grid Knowledge Web/Grid

Background informationexperimental sciencesThere is a tendency to look ever deeper in:Matter e.g. PhysicsUniverse e.g. AstronomyLife e.g. Life sciencesTherefore experiments become increasingly more complexInstrumental consequences are increase in detector: Resolution & sensitivityAutomation & robotization Results for instance in life science in:So called high throughput methods Omics experimentation genome ===> genomics

New technologies in Life Sciences researchUniversity of AmsterdamcellMethodology/Technology

Paradigm shift in Life sciencePast experiments where hypothesis drivenEvaluate hypothesisComplement existing knowledge

Present experiments are data drivenDiscover knowledge from large amounts of dataApply statistical techniques

Background informationexperimental sciencesExperiments become increasingly more complexDriven by detector developmentsResolution increasesAutomation & robotization increasesResults in an increase in amount and complexity of data

The Application data crisisScientific experiments start to generate lots of datamedical imaging (fMRI): ~ 1 GByte per measurement (day)Bio-informatics queries:500 GByte per databaseSatellite world imagery: ~ 5 TByte/yearCurrent particle physics: 1 PByte per yearLHC physics (2007): 10-30 PByte per year

Data is often very distributed

Background informationexperimental sciencesExperiments become increasingly more complexDriven by detector developmentsResolution increasesAutomation & robotization increasesResults in an increase in amount and complexity of dataSomething has to be done to harness this developmentVirtualization of experimental resources: e-Science

The what of e-Sciencee-Science is the application domain Science of Grid & Web More than only coping with data explosionA multi-disciplinary activity combining human expertise & knowledge between:A particular domain scientistICT scientiste-Science demands a different approach to experimentation because computer is integrated part of experimentConsequence is a radical change in design for experimentatione-Science should apply and integrate Web/Grid methods where and whenever possible

Grid and Web ServicesConvergenceGridDefinition of Web Service Resource Framework(WSRF) makes explicit distinction between service and stateful entities acting upon service i.e. the resources Means that Grid and Web communities can move forward on a common base!!! Ref: FosterWeb

e-Science Objectives

It should enhance the scientific process by:Stimulating collaboration by sharing data & informationResult is re-use of data & information

The data sharing potential for CognitionCollaborative scientific researchInformation sharingMetadata modelingAllows for experiment validationIndependent confirmation of resultsStatistical methodologiesAccess to large collections of data and metadataTrainingTrain the next generation using peer reviewed publications and the associated data

Electron tomography data pipeline Acquisition Alignment Reconstruction Segmentation Interpretation

Cell Centred Database

NCIMR (San Diego)Maryann Martone and Mark Ellisman


It should enhance the scientific process by:Stimulating collaboration by sharing data & informationImprove re-use of data & informationCombing data and information from different modalitiesSensor data & information fusion

The anatomy of a painting: Scrutinizing Rembrandt in multiple dimensions

To metadatadatabase

VL chemical imaging data experiment

Experiment 1

Experiment 2

VL Exp.1

VL Exp.2

VL Result 1

VL Result 2

VL Exp.3 + 4

Metadata correlation analysis (CCA)

VL Result 3

VL Result 4

300mmx300mm256x256 pixels3-D Dataset

400mmx400mm64x64 pixels3-D Dataset

Metadata generation

EXP 1

EXP 1

VL EXP 1

VL EXP 2

Score images

Reconstructed MS

Score images

Reconstructed FTIR spectra

R=0.858

R=0.618

Pos

Pos

Neg

Neg

VL EXP 3

VL EXP 4

Improved FTIR spectra, S/N >100% increase


It should enhance the scientific process by:Stimulating collaboration by sharing data & informationImprove re-use of data & informationCombing data and information from different modalitiesSensor data & information fusionRealize the combination of real life & (model based) simulation experiments

Simulated Vascular Reconstructionin a Virtual Operating Theatre

An example of the combination of real life & (model based) simulation experiments patient specific vascular geometry (from CT)Segmentationblood flow simulation (Latice Bolzmann)Pre-operative planning (interaction)Suitable for parallelization through functional decompositionSimulated Fem-FembypassPatients vascular geometry (CTA)


It should enhance the scientific process by:Stimulating collaboration by sharing data & informationImprove re-use of data & informationCombing data and information from different modalitiesSensor data & information fusionRealize the combination of real life & (model based) simulation experimentsModeling of dynamic systems

Dynamicbird behaviourMODELSBird distributionsEnsemblesCalibration andData assimilationPredictions and on-linewarningsBird behaviour in relation to weather and landscape


It should result in :Computer aided support for rapid prototyping of ideasStimulate the creativity process

It should realize that by creating & applying:New ICT methodologies and a computing infrastructure stimulating thisFrom this ICT point of view it should support the following application steps:DesignDevelopment & realizationExecution Analysis & interpretationWe try to realize e-Science and their applications via the Virtual Lab for e-Science (VL-e) project

Virtual Lab for e-Science research Philosophy

Multidisciplinary research & the development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research

Managementof comm. & computingVL-e Application Oriented ServicesFood Informatics Dutch Telescience Medical Diagnosis & ImagingVL-e projectBio-Informatics

Data Insive Science/LOFARBio-Diversity Data intensive sciences

Two sides of Bioinformaticsas an e-Science

The scientific responsibility to develop the underlying computational concepts and models to convert complex biological data into useful biological and chemical knowledgeTechnological responsibility to manage and integrate huge amounts of heterogeneous data sources from high throughput experimentation

Role of bioinformaticscellmethodologybioinformatics

Virtual Lab for e-Science research PhilosophyMultidisciplinary research and development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specificRe-use of components via generic solutions whenever possible

Managementof comm. & computing Managementof comm. & computing Managementof comm. & computingPotential GenericpartPotential GenericpartPotential GenericpartApplicationSpecificPartApplicationSpecificPartApplicationSpecificPartVirtual Laboratory Application Oriented Services

Generic e-Science aspects

Virtual Reality Visualization & user interfacesModeling & SimulationInteractive Problem SolvingData & information managementData modelingdynamic work flow managementContent (knowledge) managementSemantic aspectsMeta data modelingOntologiesWrapper technologyDesign for Experimentation

Virtual Lab for e-Science research PhilosophyMultidisciplinary research and development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specificRe-use of components via generic solutions whenever possible Rationalization of experimental processReproducible & comparable

Issues for a reproducible scientific experimentinterpretationexperimentMuch of this is lost when an experiment is completed.

Scientific experiments & e-ScienceFor complex experiments: contain complex processes require interdisciplinary expertise need large scale resourceGrid & high level support

Components in a VL-e experimentExperiment TopologyGraphical representation of self-contained dataprocessing modules attached to each other in a workflowProcess-Flow Templates(PFT) Derived from ontologies Graphical representation of data elements and processing steps in an experimental procedure Information to support context-sensitive assistance (semantics)Study Descriptions of experimental steps represented as an instance of a PFT with references to experiment topologies

Step1: designing an experimentStep2: performing the experimentStep3: analyzing the experiment resultssuccessTaverna, Kepler, and Triana: model processes for computing tasks.In VL-e: model both computing tasks and human activity based processes, and model them from the perspective of an entire lifecycle. It tries to support Collaboration in different stages Information sharing Reuse of experimentPFT in VLAMGPFT instances in VLAMGProcess Flow Templatee-Science environment Scientific Workflow Management SystemsProcess Flow Template

Definition of experiment protocolsWorkflow definitionsRecreate complex experiments into process flowsWorkflow executionMaintain control over the experimentData processing executionTopologies of data processing modulesInterpretationVisualization of processed results to help intuition Ontology definitions can help in obtaining a well-structured definition of experiment, data and metadata.

Virtual Lab for e-Science research Philosophy

Multidisciplinary research and development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specificRe-use of components via generic solutions whenever possible Rationalization of experimental processReproducible & comparableTwo research experimentation environmentsProof of concept for application experimentationRapid prototyping for computer & computational science experimentation

The VL-e infrastructureGrid MiddlewareSurfnetApplication specificserviceApplication Potential Generic service &Virtual Lab. servicesGrid &NetworkServicesVirtual LaboratoryVL-e Proof of Concept EnvironmentTelescience Medical ApplicationBio InformaticsApplicationsVL-e Experimental EnvironmentVirtual Lab.rapid prototyping(interactive simulation)Additional Grid Services(OGSA services)Network Service (lambda networking)VL-e Certification EnvironmentTest & Cert.CompatibilityTest & Cert.Grid MiddlewareTest & Cert.VL-software

Infrastructure for ApplicationsApplications are a driving force of the PoC

Experience shows applications value stability

Foster two-way interaction to make this happen

Taverna, Kepler, Triana and VLAM.Step1: designing an experimentStep2: performing the experimentStep3: analyzing the experiment resultssuccessPFTe-Science environment Scientific Workflow Management SystemsSWMSResearch activity to be developed in the Rapid Prototyping environmentStable developmentsto be used in VL-e in the Proof of Concept environment Research activity to be developed in the Rapid Prototyping environmentPFT

VL-e PoC environmentLatest certified stable software environment of core grid and VL-e servicesCore infrastructure built around clusters and storage at SARA and NIKHEF (production quality)Controlled extension to other platforms and distributionsOn the user end: install needed servers: user interface systems, storage elements for data disclosure, grid-secured DB accessFocus on stability and scalability

Hosted services for VL-eKey services and resources are offered centrally for all applications in VL-eMass data and number crunching on the large resources at SARAStorage for data replication & distributionPersistent strategic storage on tapeResource brokers, resource discovery, user group management

Why such a complex scheme?software is part of the infrastructurestability of core software needed to develop the new scientific applicationsenable distributed systems management (who runs what version when?)the grid is one big error amplifiercomputers make mistakes like humans, only much, much faster

What did we learn

It is not enough to just transport current applications to Grid or e-Science infrastructuresTo fully exploit the potential of e-Science infrastructures one has to learn what is possible Therefore the full lifecycle of an experiment has to be taken into accountWorkflow management is a first stepWe add semantic information via Process Flow TemplatesApplication innovation such as for instance bio-banking should be the aimGrids should be transparent for the end-user

Conclusions

e-Science is a lot more more than trying to cope with data explosion aloneImplementation of e-Science systems requires further rationalization and standardization of experimentation processe-Science success demands the realization of an environment allowing application driven experimentation & rapid dissemination of feed back of these new methodsWe try to do that via development of Proof of Concept based on Grid

Electron tomography data pipeline developmentTOM (Matlab) acquisition and data storage with the SRBSRB StorageWith the CCDB database for retrieval of experimental data setsGrid-computing 3D reconstruction refinements Currently for EMAN and later also for TOM (Matlab)SARA - Amsterdam1 Gbit/s

Here the role of bioinformatics has to be introducedData presentation essentialBy virtualizing experimental resources they can become part of a computing network necessary for computer experimentationMention the essential paradime shiftGenomics has to be realized via bioinformaticsCognition via sharing of data & databasingMention the problem of making data comparable and experiments reproducibleCombination in vitro/vivo with in silicoPet/fMRITranscriptomics/proteomicsHuntington example fMRICombination in vitro/vivo with in silicoPet/fMRITranscriptomics/proteomicsHuntington example fMRIVoorbeeld van nieuwe meer exactere modellen en de noodzaak tot interactier met je computing experimentFunctionele decompositie

De verschillende componenten lopen elk op hun eigen processor en communiceren de resultaten met elkaar.

Combination in vitro/vivo with in silicoPet/fMRITranscriptomics/proteomicsHuntington example fMRIMention the tension between the necessity of rationalization and the demand of flexibilizationOur approach keep application and supporting research togetherExample bioinformaticsBioinformatics & biogridInformation management intensive scienceMentioned first reqirement to come to system wide approach towards biology including modeling of cell processesMention the tension between the necessity of rationalization and the demand of flexibilization

How easy is all of this?

For the end-users, VL has a friendly user interface, which in theory will allow the end-users to perform all the tasks we have discussed so far without writing a single program or script. The PFT is completely graphical, present the needed information to the user on demand (usually by a simple click of the mouse button). It communicates with all the other software components of the VL toolkit and convert there output to a human readable form (filling forms, sending queries, composing a data process flow).

In Reality the end-user doesnt use the PFT itself it create an instance and tuned it. As I said one of the tasks to be specified is the data flow process: where data is coming from? Which process needs to be applied? On which resource the processes should be executed? And where the outcome is saved? Mostly this is represented in the PFT as simple box by just clicking in the box, the use is provided with another graphical interface, which allows him to compose the process flow of the d

e-science and grid the vl-e approach

Documents

single services

life sciencepast experiments

number of services

service oriented grid

data drivendiscover

sharing computing resources

escienceobjectivesvirtual

aggregate service