querying large physics data sets over an information grid how can we ensure that we have adequate...

22
Querying Large Physics Data Sets Over an Information Grid How can we ensure that we have adequate information available over Grids to resolve our physics queries ? CHEP’2001 Beijing, China. 3rd-6th September 2001.

Post on 22-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

Querying Large Physics Data Sets Over an Information Grid

How can we ensure that we have adequate information available over Grids to resolve our physics queries ?

CHEP’2001 Beijing, China. 3rd-6th September 2001.

CRISTAL Project Team 2

Contents

• Data, Information & Knowledge Grids

• Distributed queries and workflow management

• Sub-query tracking and analysis processing

• Multi-layered systems and information abstraction

• Mapping description-driven workflow services onto Grids

• Conclusions, future activities.

CRISTAL Project Team 3

Layered Grid Technologies

Knowledge Grid

DataControl

DataAbstraction

Data Grid

Information Grid

Warehousing, distributed databases, streaming, near-line storage, large objects, access mechanisms, data staging…

Metadata, middleware, intelligent retrieval, information modelling, warehousing, workflow…

Data mining, visualisation, simulation, problem solving methods/environments …

Software component reuse, design reuse

CRISTAL Project Team 4

Computing Research Challenges

Knowledge Grid

DataControl

DataAbstraction

Data Grid

Information Grid

Metadata, ontologies, controlled vocabularies…

Semi structured data, info modelling and knowledge representation…

Data transformation & tracking… Multi-context and evolving data… Automated integration… Intelligent Agents, resource

discovery… Knowledge discovery, prediction,

machine learning…

Support creative discovery of knowledge/ information via fast, ubiquitous, universal, and homogeneous access to heterogeneous assets

CRISTAL Project Team 5

Metadata Requirements• Used to describe information for:

– data integration– to inform the analysis process e.g. navigation / location– to allow system flexibility and evolution– catalogues collections of data– provides mechanisms for access /security control

• More than just summary data - metadata is active and changes with system usage.

• Need to trace how data (and metadata) evolves.

CRISTAL Project Team 6

HEP Query Analysis • Very large amount of data to be collected (petaBytes)• Physicist must have access to all the detector data

– Raw data, events, runs, calibration data, simulation data

• System must be flexible, configurable, scalable • Physicist needs to access multiple sources of (multiple

versions of) data and algorithms• Physicist must be able to carry out his own analysis on

his own workstation• Traceability of data and queries therefore crucial.

CRISTAL Project Team 7

Multi-system Solution

FNAL Regional Centre

DESY Regional Centre

IN2P3 Regional Centre

CERN Regional Centre

Experiment/CERN

Data Locations, Routes

Shared data & meta-data

Analysis specific data

Synchronisation required

Data synchronisationComplex versioning Data replicationLarge network bandwidth requiredData redundancy

Analysis Workstations

CRISTAL Project Team 8

A Single Logical System• Data Management and Physics Algorithm are separately

located and their workflows managed• Algorithm provided by physicist and executed in the regional

centres where the data reside • Multiple versions of the same algorithm can coexist • Data location must be transparent to the physicist• Local query implies distributed processing• Query result returned when algorithm has previously been

executed

CRISTAL Project Team 9

Single Logical System SolutionFNAL Regional Centre

DESY Regional Centre

IN2P3 Regional Centre

CERN Regional Centre

Experiment/CERN

Data Locations, Routes

Shared data & meta-data

Analysis-specific data

•Knowledge is stored alongside data•Active (meta-)objects manage various versions of data and algorithms•Smaller network bandwidth required

Analysis Workstations

QueryResult 2

QueryResult 1

LocalQuery1

LocalQuery1

LocalQuery1

LocalQuery1

LocalQuery2

LocalQuery2

LocalQuery2

LocalQuery2

CRISTAL Project Team 10

Query/analysis processing

• 0: Physicist develops and registers algorithm

• 1: Physicist submits query locally

• 2: Query Handler decompose query and locates data

• 3: If Algorithm has been previously executed, results are immediately returned

• 4: Algorithms are executed where their data resides

• 5: Results returned to Query Handler for presentation to and further analysis by the physicist

CRISTAL Project Team 11

How does this map onto Grids?

Resource-specific implementations of basic servicesE.g., Transport protocols, name servers, differentiated services, CPU schedulers, public keyinfrastructure, site accounting, directory service, OS bypass

Resource-independent and application-independent services authentication, authorization, resource location, resource allocation, events, accounting,

remote data access, information, policy, fault detection

DistributedComputing

Toolkit

Grid Fabric

MiddlewareServices

ApplicationToolkits

Data-Intensive

ApplicationsToolkit

CollaborativeApplications

Toolkit

RemoteVisualizationApplications

Toolkit

ProblemSolving

ApplicationsToolkit

RemoteInstrumentation

ApplicationsToolkit

Applications Chemistry

Biology

Cosmology

High Energy Physics

Environment

Data and query tracking region

CRISTAL Project Team 12

EU Data Grid Project

Data and query tracking region

CRISTAL Project Team 13

Grids AnatomyApplication Layer

Collective Layer

Resource Layer

Connectivity Layer

Fabric Layer

Application Layer

Collective Layer

Resource Layer

Connectivity Layer

Fabric Layer

Application

Transport

Internet

Link

CRISTAL Project Team 14

Fabric & ConnectivityFabric Layer :• Provides the resources on which the shared access will happen

• Implements local resource-specific operation.

• Resources should implement enquiry mechanism & resource management mech.

Connectivity:• Authentication protocols required for Grid Specific network

transaction having:– Single sign on– Delegation– Integration with various local security solutions– User based trust relationships

CRISTAL Project Team 15

Resource & Collective ServicesResource:• Provides protocol for initiation, monitoring and control of operations

on shared resources.– Information Protocols, Management Protocols

Collective Services:• Contains APIs which are global in nature and capture interactions

across collections of resources.• Implementation is based on Resource Layer protocols• Implements a wide variety of sharing behaviours

– Directory services, co-allocation, scheduling, brokering services etcMetaData & traceability are essential features of Collective Services

CRISTAL Project Team 16

The Role of Scientific WFM• Manage and control GRID resource complexity

– By describing tasks, steps and activities

• Examples:– Application Management

• Application resource requirements descriptions• Application tasks descriptors

– Data Traceability• Describe algorithm/application execution steps• Describe and manage data sets and versions• “Handling data on a grid should be part of a workflow”

(GGF Grid Computing Environments working group)

CRISTAL Project Team 17

Current Grid Workflow Activities• Global Grid Forum – Grid Computing Environments

Working Group– Mississippi Computational Portal

• Web /XML based

– GALE • end to end automation of analyst’s workflow also web/XML based

• Workflow Management for Cosmology Collaboratory– Lawrence Berkeley National Laboratory (Stewart Loken,

CHEP 2001 paper 10-036, see proceedings)

• Other projects ?

CRISTAL Project Team 18

MetaData & OMG UML Model

Layer Description ExampleMeta-metamodel The infrastructure for a

metamodelling architecture.Defines the language forspecifying metamodels.

MetaClass, MetaAttribute,MetaOperation

metamodel An instance of a meta-metamodel. Defines thelanguage for specifying amodel.

Class, Attribute, Operation,Component

model An instance of a metamodel.Defines a language todescribe an informationdomain.

StockShare, askPrice,sellLimitOrder,StockQuoteServer

User objects(user data)

An instance of a model.Defines a specificinformation domain.

<Acme_Software_Share_98789>,654.56, sell_limit_order,<Stock_Quote_Svr_32123>

Is-an-Instance of data abstraction

CRISTAL Project Team 19

A Description-Driven Model

H orizonta l Abstraction

Ver

tical

Abs

trac

tion

Is described by

Is described by

Is an instance of

Instance

M odel

M eta-M odel Layer

Is an instance-of

Is an instance of

M eta-D ata

M odel M eta-D ata

Data

Is an instance of

Meta-Data

Meta-Level

Base Level

CRISTAL Project Team 20

Conclusions on MetaDataMeta-objects :

• Provide flexibility & reusability of definitions

• Handle complexity in large scale systems

• Allow co-existence of multiple versions of data

• Minimise effects of system evolution

• Provide ‘hooks’ for interoperability

• Can be queried for data navigation

--> the basis of Grids workflow collective services.

CRISTAL Project Team 21

Future Activities

• Members of the GGF’s Grid Computing Environment working group - workflow interest group

• OMG members (invited to OMG Grids Workshop, Boston)• E-science UK-sponsored CERN Fellowship to work

alongside CMS physicists on workflow management• E-science funding for UK computer scientists to work with

the EU DataGrid project • E-science Generic Middleware open call• Continue to develop the CRISTAL workflow engine.

CRISTAL Project Team 22

Reference papers• F. Estrella, Z. Kovacs, J-M Le Goff & R. McClatchey, “Model and Information

Abstraction for Description-Driven Systems”. Accepted paper at this conference (ID 8-053).

• F. Estrella, “Objects, Patterns and Descriptions in Data Management”, PhD Thesis, University of the West of England, Bristol, England, December 2000.

• J. Draskic et al., “Using a Meta-Model as the Basis for Enterprise-Wide Data Navigation” Proc of the 3rd IEEE Meta-Data Conference Bethesda, Maryland, USA. April 1999F. Available at http://computer.org/conferen/proceed/meta/1999/

• J-M Le Goff et al., “Design Patterns for Description-Driven Systems”. CHEP 2001 Computer Physics Communications, in print and CMS NOTE 1999_045.

• N. Baker et al., “Component-Based Approach to Scientific Workflow Management”. ACAT’2000 conference, FermiLab, October 2000 and CMS NOTE 2001_024