e-science and grid the vl-e approach

52
e-Science and Grid The VL-e approach L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam [email protected]

Upload: aladdin-horton

Post on 02-Jan-2016

37 views

Category:

Documents


1 download

DESCRIPTION

e-Science and Grid The VL-e approach. L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van Amsterdam [email protected]. Content. Developments in Grid Developments in e-Science Objectives Virtual Lab for e-Science - PowerPoint PPT Presentation

TRANSCRIPT

  • e-Science and GridThe VL-e approach

    L.O. (Bob) Hertzberger Computer Architecture and Parallel Systems Group Department of Computer Science Universiteit van [email protected]

  • ContentDevelopments in GridDevelopments in e-ScienceObjectivesVirtual Lab for e-ScienceResearch philosophyConclusions

  • Background ICTpush developmentsProcessing power doubles every 18 monthMemory size doubles every 12 monthNetwork speed doubles every 9 monthSomething has to be done to harness this developmentVirtualization of ICT resources InternetWebGrid

  • Internal versus external bandwidthMbit/sComputer bussesnetworks

    Chart1

    19900.064

    19954

    1997155

    1998622

    20012500

    200220000

    1980

    .0.0015

    Chart2

    19900.064

    19954

    1997155

    1998622

    20012500

    200220000

    1980

    .0.0015

    Chart3

    0.0015128

    0.064320

    4640

    1551056

    6221056

    25004224

    200004224

    networking

    internal

    Sheet1

    TimeNetworkbus

    19800.001512816

    19900.06432040

    1995464080

    19971551056132

    19986221056132

    200125004224528

    2002200004224528

    Sheet2

    Sheet3

  • Web& Grid & Web/Grid Services

    Web services is a paradigm/way of using/accessing informationWeb resources are mostly human-centered (understood by humans, computers can read but cant understand)Grid is about accessing & sharing computing resources by virtualization Data & Information repositoriesExperimental facilitiesOGSA: Service oriented Grid standard based on Web services

  • Web & Grid & Semantic Web/Grid

    Semantic Web resources should also be understandable by computersThis way, many complex tasks formulated and assigned by humans can be automated and executed by agents working on the Semantic WebBut, both Grid and Web Services only focus on single services.Semantic Web/Grid should be able to describe a single service as well as the relationship between services e.g. aggregate service using a number of services so making knowledge explicit

  • Levels of Grid abstractionComputational Grid Data GridInformation Web/Grid Knowledge Web/Grid

  • Background informationexperimental sciencesThere is a tendency to look ever deeper in:Matter e.g. PhysicsUniverse e.g. AstronomyLife e.g. Life sciencesTherefore experiments become increasingly more complexInstrumental consequences are increase in detector: Resolution & sensitivityAutomation & robotization Results for instance in life science in:So called high throughput methods Omics experimentation genome ===> genomics

  • New technologies in Life Sciences researchUniversity of AmsterdamcellMethodology/Technology

  • Paradigm shift in Life sciencePast experiments where hypothesis drivenEvaluate hypothesisComplement existing knowledge

    Present experiments are data drivenDiscover knowledge from large amounts of dataApply statistical techniques

  • Background informationexperimental sciencesExperiments become increasingly more complexDriven by detector developmentsResolution increasesAutomation & robotization increasesResults in an increase in amount and complexity of data

  • The Application data crisisScientific experiments start to generate lots of datamedical imaging (fMRI): ~ 1 GByte per measurement (day)Bio-informatics queries:500 GByte per databaseSatellite world imagery: ~ 5 TByte/yearCurrent particle physics: 1 PByte per yearLHC physics (2007): 10-30 PByte per year

    Data is often very distributed

  • Background informationexperimental sciencesExperiments become increasingly more complexDriven by detector developmentsResolution increasesAutomation & robotization increasesResults in an increase in amount and complexity of dataSomething has to be done to harness this developmentVirtualization of experimental resources: e-Science

  • The what of e-Sciencee-Science is the application domain Science of Grid & Web More than only coping with data explosionA multi-disciplinary activity combining human expertise & knowledge between:A particular domain scientistICT scientiste-Science demands a different approach to experimentation because computer is integrated part of experimentConsequence is a radical change in design for experimentatione-Science should apply and integrate Web/Grid methods where and whenever possible

  • Grid and Web ServicesConvergenceGridDefinition of Web Service Resource Framework(WSRF) makes explicit distinction between service and stateful entities acting upon service i.e. the resources Means that Grid and Web communities can move forward on a common base!!! Ref: FosterWeb

  • e-Science Objectives

    It should enhance the scientific process by:Stimulating collaboration by sharing data & informationResult is re-use of data & information

  • The data sharing potential for CognitionCollaborative scientific researchInformation sharingMetadata modelingAllows for experiment validationIndependent confirmation of resultsStatistical methodologiesAccess to large collections of data and metadataTrainingTrain the next generation using peer reviewed publications and the associated data

  • Electron tomography data pipeline Acquisition Alignment Reconstruction Segmentation Interpretation

  • Cell Centred Database

    NCIMR (San Diego)Maryann Martone and Mark Ellisman

  • e-Science Objectives

    It should enhance the scientific process by:Stimulating collaboration by sharing data & informationImprove re-use of data & informationCombing data and information from different modalitiesSensor data & information fusion

  • The anatomy of a painting: Scrutinizing Rembrandt in multiple dimensions

  • To metadatadatabase

    VL chemical imaging data experiment

    Experiment 1

    Experiment 2

    VL Exp.1

    VL Exp.2

    VL Result 1

    VL Result 2

    VL Exp.3 + 4

    Metadata correlation analysis (CCA)

    VL Result 3

    VL Result 4

  • 300mmx300mm256x256 pixels3-D Dataset

    400mmx400mm64x64 pixels3-D Dataset

    Metadata generation

    EXP 1

    EXP 1

    VL EXP 1

    VL EXP 2

  • Score images

    Reconstructed MS

    Score images

    Reconstructed FTIR spectra

    R=0.858

    R=0.618

    Pos

    Pos

    Neg

    Neg

    VL EXP 3

    VL EXP 4

    Improved FTIR spectra, S/N >100% increase

  • e-Science Objectives

    It should enhance the scientific process by:Stimulating collaboration by sharing data & informationImprove re-use of data & informationCombing data and information from different modalitiesSensor data & information fusionRealize the combination of real life & (model based) simulation experiments

  • Simulated Vascular Reconstructionin a Virtual Operating Theatre

    An example of the combination of real life & (model based) simulation experiments patient specific vascular geometry (from CT)Segmentationblood flow simulation (Latice Bolzmann)Pre-operative planning (interaction)Suitable for parallelization through functional decompositionSimulated Fem-FembypassPatients vascular geometry (CTA)

  • e-Science Objectives

    It should enhance the scientific process by:Stimulating collaboration by sharing data & informationImprove re-use of data & informationCombing data and information from different modalitiesSensor data & information fusionRealize the combination of real life & (model based) simulation experimentsModeling of dynamic systems

  • Dynamicbird behaviourMODELSBird distributionsEnsemblesCalibration andData assimilationPredictions and on-linewarningsBird behaviour in relation to weather and landscape

  • e-Science Objectives

    It should result in :Computer aided support for rapid prototyping of ideasStimulate the creativity process

    It should realize that by creating & applying:New ICT methodologies and a computing infrastructure stimulating thisFrom this ICT point of view it should support the following application steps:DesignDevelopment & realizationExecution Analysis & interpretationWe try to realize e-Science and their applications via the Virtual Lab for e-Science (VL-e) project

  • Virtual Lab for e-Science research Philosophy

    Multidisciplinary research & the development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research

  • Managementof comm. & computingVL-e Application Oriented ServicesFood Informatics Dutch Telescience Medical Diagnosis & ImagingVL-e projectBio-Informatics

    Data Insive Science/LOFARBio-Diversity Data intensive sciences

  • Two sides of Bioinformaticsas an e-Science

    The scientific responsibility to develop the underlying computational concepts and models to convert complex biological data into useful biological and chemical knowledgeTechnological responsibility to manage and integrate huge amounts of heterogeneous data sources from high throughput experimentation

  • Role of bioinformaticscellmethodologybioinformatics

  • Virtual Lab for e-Science research PhilosophyMultidisciplinary research and development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specificRe-use of components via generic solutions whenever possible

  • Managementof comm. & computing Managementof comm. & computing Managementof comm. & computingPotential GenericpartPotential GenericpartPotential GenericpartApplicationSpecificPartApplicationSpecificPartApplicationSpecificPartVirtual Laboratory Application Oriented Services

  • Generic e-Science aspects

    Virtual Reality Visualization & user interfacesModeling & SimulationInteractive Problem SolvingData & information managementData modelingdynamic work flow managementContent (knowledge) managementSemantic aspectsMeta data modelingOntologiesWrapper technologyDesign for Experimentation

  • Virtual Lab for e-Science research PhilosophyMultidisciplinary research and development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specificRe-use of components via generic solutions whenever possible Rationalization of experimental processReproducible & comparable

  • Issues for a reproducible scientific experimentinterpretationexperimentMuch of this is lost when an experiment is completed.

  • Scientific experiments & e-ScienceFor complex experiments: contain complex processes require interdisciplinary expertise need large scale resourceGrid & high level support

  • Components in a VL-e experimentExperiment TopologyGraphical representation of self-contained dataprocessing modules attached to each other in a workflowProcess-Flow Templates(PFT) Derived from ontologies Graphical representation of data elements and processing steps in an experimental procedure Information to support context-sensitive assistance (semantics)Study Descriptions of experimental steps represented as an instance of a PFT with references to experiment topologies

  • Step1: designing an experimentStep2: performing the experimentStep3: analyzing the experiment resultssuccessTaverna, Kepler, and Triana: model processes for computing tasks.In VL-e: model both computing tasks and human activity based processes, and model them from the perspective of an entire lifecycle. It tries to support Collaboration in different stages Information sharing Reuse of experimentPFT in VLAMGPFT instances in VLAMGProcess Flow Templatee-Science environment Scientific Workflow Management SystemsProcess Flow Template

  • Definition of experiment protocolsWorkflow definitionsRecreate complex experiments into process flowsWorkflow executionMaintain control over the experimentData processing executionTopologies of data processing modulesInterpretationVisualization of processed results to help intuition Ontology definitions can help in obtaining a well-structured definition of experiment, data and metadata.

  • Virtual Lab for e-Science research Philosophy

    Multidisciplinary research and development of related ICT infrastructureGeneric application supportApplication cases are drivers for computer & computational science and engineering research Problem solving partly generic and partly specificRe-use of components via generic solutions whenever possible Rationalization of experimental processReproducible & comparableTwo research experimentation environmentsProof of concept for application experimentationRapid prototyping for computer & computational science experimentation

  • The VL-e infrastructureGrid MiddlewareSurfnetApplication specificserviceApplication Potential Generic service &Virtual Lab. servicesGrid &NetworkServicesVirtual LaboratoryVL-e Proof of Concept EnvironmentTelescience Medical ApplicationBio InformaticsApplicationsVL-e Experimental EnvironmentVirtual Lab.rapid prototyping(interactive simulation)Additional Grid Services(OGSA services)Network Service (lambda networking)VL-e Certification EnvironmentTest & Cert.CompatibilityTest & Cert.Grid MiddlewareTest & Cert.VL-software

  • Infrastructure for ApplicationsApplications are a driving force of the PoC

    Experience shows applications value stability

    Foster two-way interaction to make this happen

  • Taverna, Kepler, Triana and VLAM.Step1: designing an experimentStep2: performing the experimentStep3: analyzing the experiment resultssuccessPFTe-Science environment Scientific Workflow Management SystemsSWMSResearch activity to be developed in the Rapid Prototyping environmentStable developmentsto be used in VL-e in the Proof of Concept environment Research activity to be developed in the Rapid Prototyping environmentPFT

  • VL-e PoC environmentLatest certified stable software environment of core grid and VL-e servicesCore infrastructure built around clusters and storage at SARA and NIKHEF (production quality)Controlled extension to other platforms and distributionsOn the user end: install needed servers: user interface systems, storage elements for data disclosure, grid-secured DB accessFocus on stability and scalability

  • Hosted services for VL-eKey services and resources are offered centrally for all applications in VL-eMass data and number crunching on the large resources at SARAStorage for data replication & distributionPersistent strategic storage on tapeResource brokers, resource discovery, user group management

  • Why such a complex scheme?software is part of the infrastructurestability of core software needed to develop the new scientific applicationsenable distributed systems management (who runs what version when?)the grid is one big error amplifiercomputers make mistakes like humans, only much, much faster

  • What did we learn

    It is not enough to just transport current applications to Grid or e-Science infrastructuresTo fully exploit the potential of e-Science infrastructures one has to learn what is possible Therefore the full lifecycle of an experiment has to be taken into accountWorkflow management is a first stepWe add semantic information via Process Flow TemplatesApplication innovation such as for instance bio-banking should be the aimGrids should be transparent for the end-user

  • Conclusions

    e-Science is a lot more more than trying to cope with data explosion aloneImplementation of e-Science systems requires further rationalization and standardization of experimentation processe-Science success demands the realization of an environment allowing application driven experimentation & rapid dissemination of feed back of these new methodsWe try to do that via development of Proof of Concept based on Grid

  • Electron tomography data pipeline developmentTOM (Matlab) acquisition and data storage with the SRBSRB StorageWith the CCDB database for retrieval of experimental data setsGrid-computing 3D reconstruction refinements Currently for EMAN and later also for TOM (Matlab)SARA - Amsterdam1 Gbit/s

    Here the role of bioinformatics has to be introducedData presentation essentialBy virtualizing experimental resources they can become part of a computing network necessary for computer experimentationMention the essential paradime shiftGenomics has to be realized via bioinformaticsCognition via sharing of data & databasingMention the problem of making data comparable and experiments reproducibleCombination in vitro/vivo with in silicoPet/fMRITranscriptomics/proteomicsHuntington example fMRICombination in vitro/vivo with in silicoPet/fMRITranscriptomics/proteomicsHuntington example fMRIVoorbeeld van nieuwe meer exactere modellen en de noodzaak tot interactier met je computing experimentFunctionele decompositie

    De verschillende componenten lopen elk op hun eigen processor en communiceren de resultaten met elkaar.

    Combination in vitro/vivo with in silicoPet/fMRITranscriptomics/proteomicsHuntington example fMRIMention the tension between the necessity of rationalization and the demand of flexibilizationOur approach keep application and supporting research togetherExample bioinformaticsBioinformatics & biogridInformation management intensive scienceMentioned first reqirement to come to system wide approach towards biology including modeling of cell processesMention the tension between the necessity of rationalization and the demand of flexibilization

    How easy is all of this?

    For the end-users, VL has a friendly user interface, which in theory will allow the end-users to perform all the tasks we have discussed so far without writing a single program or script. The PFT is completely graphical, present the needed information to the user on demand (usually by a simple click of the mouse button). It communicates with all the other software components of the VL toolkit and convert there output to a human readable form (filling forms, sending queries, composing a data process flow).

    In Reality the end-user doesnt use the PFT itself it create an instance and tuned it. As I said one of the tasks to be specified is the data flow process: where data is coming from? Which process needs to be applied? On which resource the processes should be executed? And where the outcome is saved? Mostly this is represented in the PFT as simple box by just clicking in the box, the use is provided with another graphical interface, which allows him to compose the process flow of the d