querying large physics data sets over an information grid how can we ensure that we have adequate...
Post on 22-Dec-2015
215 views
TRANSCRIPT
Querying Large Physics Data Sets Over an Information Grid
How can we ensure that we have adequate information available over Grids to resolve our physics queries ?
CHEP’2001 Beijing, China. 3rd-6th September 2001.
CRISTAL Project Team 2
Contents
• Data, Information & Knowledge Grids
• Distributed queries and workflow management
• Sub-query tracking and analysis processing
• Multi-layered systems and information abstraction
• Mapping description-driven workflow services onto Grids
• Conclusions, future activities.
CRISTAL Project Team 3
Layered Grid Technologies
Knowledge Grid
DataControl
DataAbstraction
Data Grid
Information Grid
Warehousing, distributed databases, streaming, near-line storage, large objects, access mechanisms, data staging…
Metadata, middleware, intelligent retrieval, information modelling, warehousing, workflow…
Data mining, visualisation, simulation, problem solving methods/environments …
Software component reuse, design reuse
CRISTAL Project Team 4
Computing Research Challenges
Knowledge Grid
DataControl
DataAbstraction
Data Grid
Information Grid
Metadata, ontologies, controlled vocabularies…
Semi structured data, info modelling and knowledge representation…
Data transformation & tracking… Multi-context and evolving data… Automated integration… Intelligent Agents, resource
discovery… Knowledge discovery, prediction,
machine learning…
Support creative discovery of knowledge/ information via fast, ubiquitous, universal, and homogeneous access to heterogeneous assets
CRISTAL Project Team 5
Metadata Requirements• Used to describe information for:
– data integration– to inform the analysis process e.g. navigation / location– to allow system flexibility and evolution– catalogues collections of data– provides mechanisms for access /security control
• More than just summary data - metadata is active and changes with system usage.
• Need to trace how data (and metadata) evolves.
CRISTAL Project Team 6
HEP Query Analysis • Very large amount of data to be collected (petaBytes)• Physicist must have access to all the detector data
– Raw data, events, runs, calibration data, simulation data
• System must be flexible, configurable, scalable • Physicist needs to access multiple sources of (multiple
versions of) data and algorithms• Physicist must be able to carry out his own analysis on
his own workstation• Traceability of data and queries therefore crucial.
CRISTAL Project Team 7
Multi-system Solution
FNAL Regional Centre
DESY Regional Centre
IN2P3 Regional Centre
CERN Regional Centre
Experiment/CERN
Data Locations, Routes
Shared data & meta-data
Analysis specific data
Synchronisation required
Data synchronisationComplex versioning Data replicationLarge network bandwidth requiredData redundancy
Analysis Workstations
CRISTAL Project Team 8
A Single Logical System• Data Management and Physics Algorithm are separately
located and their workflows managed• Algorithm provided by physicist and executed in the regional
centres where the data reside • Multiple versions of the same algorithm can coexist • Data location must be transparent to the physicist• Local query implies distributed processing• Query result returned when algorithm has previously been
executed
CRISTAL Project Team 9
Single Logical System SolutionFNAL Regional Centre
DESY Regional Centre
IN2P3 Regional Centre
CERN Regional Centre
Experiment/CERN
Data Locations, Routes
Shared data & meta-data
Analysis-specific data
•Knowledge is stored alongside data•Active (meta-)objects manage various versions of data and algorithms•Smaller network bandwidth required
Analysis Workstations
QueryResult 2
QueryResult 1
LocalQuery1
LocalQuery1
LocalQuery1
LocalQuery1
LocalQuery2
LocalQuery2
LocalQuery2
LocalQuery2
CRISTAL Project Team 10
Query/analysis processing
• 0: Physicist develops and registers algorithm
• 1: Physicist submits query locally
• 2: Query Handler decompose query and locates data
• 3: If Algorithm has been previously executed, results are immediately returned
• 4: Algorithms are executed where their data resides
• 5: Results returned to Query Handler for presentation to and further analysis by the physicist
CRISTAL Project Team 11
How does this map onto Grids?
Resource-specific implementations of basic servicesE.g., Transport protocols, name servers, differentiated services, CPU schedulers, public keyinfrastructure, site accounting, directory service, OS bypass
Resource-independent and application-independent services authentication, authorization, resource location, resource allocation, events, accounting,
remote data access, information, policy, fault detection
DistributedComputing
Toolkit
Grid Fabric
MiddlewareServices
ApplicationToolkits
Data-Intensive
ApplicationsToolkit
CollaborativeApplications
Toolkit
RemoteVisualizationApplications
Toolkit
ProblemSolving
ApplicationsToolkit
RemoteInstrumentation
ApplicationsToolkit
Applications Chemistry
Biology
Cosmology
High Energy Physics
Environment
Data and query tracking region
CRISTAL Project Team 13
Grids AnatomyApplication Layer
Collective Layer
Resource Layer
Connectivity Layer
Fabric Layer
Application Layer
Collective Layer
Resource Layer
Connectivity Layer
Fabric Layer
Application
Transport
Internet
Link
CRISTAL Project Team 14
Fabric & ConnectivityFabric Layer :• Provides the resources on which the shared access will happen
• Implements local resource-specific operation.
• Resources should implement enquiry mechanism & resource management mech.
Connectivity:• Authentication protocols required for Grid Specific network
transaction having:– Single sign on– Delegation– Integration with various local security solutions– User based trust relationships
CRISTAL Project Team 15
Resource & Collective ServicesResource:• Provides protocol for initiation, monitoring and control of operations
on shared resources.– Information Protocols, Management Protocols
Collective Services:• Contains APIs which are global in nature and capture interactions
across collections of resources.• Implementation is based on Resource Layer protocols• Implements a wide variety of sharing behaviours
– Directory services, co-allocation, scheduling, brokering services etcMetaData & traceability are essential features of Collective Services
CRISTAL Project Team 16
The Role of Scientific WFM• Manage and control GRID resource complexity
– By describing tasks, steps and activities
• Examples:– Application Management
• Application resource requirements descriptions• Application tasks descriptors
– Data Traceability• Describe algorithm/application execution steps• Describe and manage data sets and versions• “Handling data on a grid should be part of a workflow”
(GGF Grid Computing Environments working group)
CRISTAL Project Team 17
Current Grid Workflow Activities• Global Grid Forum – Grid Computing Environments
Working Group– Mississippi Computational Portal
• Web /XML based
– GALE • end to end automation of analyst’s workflow also web/XML based
• Workflow Management for Cosmology Collaboratory– Lawrence Berkeley National Laboratory (Stewart Loken,
CHEP 2001 paper 10-036, see proceedings)
• Other projects ?
CRISTAL Project Team 18
MetaData & OMG UML Model
Layer Description ExampleMeta-metamodel The infrastructure for a
metamodelling architecture.Defines the language forspecifying metamodels.
MetaClass, MetaAttribute,MetaOperation
metamodel An instance of a meta-metamodel. Defines thelanguage for specifying amodel.
Class, Attribute, Operation,Component
model An instance of a metamodel.Defines a language todescribe an informationdomain.
StockShare, askPrice,sellLimitOrder,StockQuoteServer
User objects(user data)
An instance of a model.Defines a specificinformation domain.
<Acme_Software_Share_98789>,654.56, sell_limit_order,<Stock_Quote_Svr_32123>
Is-an-Instance of data abstraction
CRISTAL Project Team 19
A Description-Driven Model
H orizonta l Abstraction
Ver
tical
Abs
trac
tion
Is described by
Is described by
Is an instance of
Instance
M odel
M eta-M odel Layer
Is an instance-of
Is an instance of
M eta-D ata
M odel M eta-D ata
Data
Is an instance of
Meta-Data
Meta-Level
Base Level
CRISTAL Project Team 20
Conclusions on MetaDataMeta-objects :
• Provide flexibility & reusability of definitions
• Handle complexity in large scale systems
• Allow co-existence of multiple versions of data
• Minimise effects of system evolution
• Provide ‘hooks’ for interoperability
• Can be queried for data navigation
--> the basis of Grids workflow collective services.
CRISTAL Project Team 21
Future Activities
• Members of the GGF’s Grid Computing Environment working group - workflow interest group
• OMG members (invited to OMG Grids Workshop, Boston)• E-science UK-sponsored CERN Fellowship to work
alongside CMS physicists on workflow management• E-science funding for UK computer scientists to work with
the EU DataGrid project • E-science Generic Middleware open call• Continue to develop the CRISTAL workflow engine.
CRISTAL Project Team 22
Reference papers• F. Estrella, Z. Kovacs, J-M Le Goff & R. McClatchey, “Model and Information
Abstraction for Description-Driven Systems”. Accepted paper at this conference (ID 8-053).
• F. Estrella, “Objects, Patterns and Descriptions in Data Management”, PhD Thesis, University of the West of England, Bristol, England, December 2000.
• J. Draskic et al., “Using a Meta-Model as the Basis for Enterprise-Wide Data Navigation” Proc of the 3rd IEEE Meta-Data Conference Bethesda, Maryland, USA. April 1999F. Available at http://computer.org/conferen/proceed/meta/1999/
• J-M Le Goff et al., “Design Patterns for Description-Driven Systems”. CHEP 2001 Computer Physics Communications, in print and CMS NOTE 1999_045.
• N. Baker et al., “Component-Based Approach to Scientific Workflow Management”. ACAT’2000 conference, FermiLab, October 2000 and CMS NOTE 2001_024