david colling gridpp edinburgh 6th november 2001 sam... an overview (many thanks to vicky white, lee...

29
David Colling GridPP Edinburgh 6th November 2001 SAM ... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker) http://d0db.fnal.gov/sam

Upload: timothy-bentley

Post on 28-Mar-2015

219 views

Category:

Documents


4 download

TRANSCRIPT

Page 1: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

SAM ... an overview

(Many thanks to Vicky White, Lee Lueking and Rod Walker)

http://d0db.fnal.gov/sam

Page 2: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

SAM stands for “Sequential Access to Data via Metadata”. Where sequential refers to the events stored within files.

Lauri Loebel-Carpenter, Lee Lueking*, Carmenita Moore, Igor Terekhov, Julie Trumbo, Sinisa Veseli, Matthew Vranicar, Stephen P. White, Victoria White*. (*project leaders)

The current SAM development team include:

Recently some work in the UK by Rod Walker

Page 3: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

History of SAMProject started in 1997Built for the DØ “virtual organisation”(~500 physicists, 72 institutions, 18 countries)

SAM’s objectives are:• to provide a world wide system of shareable computing and storage resources. So providing a solution to the common problem of extracting physics results from about a Petabyte of data (c. 2003)•to provide a large degree of transparency to the user. Who makes requests for datasets, submits jobs and stores files (together with extensive metadata about the processing steps etc.)

Page 4: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Currently SAM’s storage and delivery of data is far more advanced than its job submission.

SAM is an operational prototype of many of the concepts being developed for Grid computing.

Page 5: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

DatabaseServer(s)(Central Database)

NameServer

Global Resource

Manager(s)Log server

Station 1Servers

Station 2Servers

Station 3 Servers

Station nServers

Mass Storage System(s)

SharedGlobally

Local

SharedLocally

Arrows indicateControl and data flow

Overview of SAM

Page 6: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Name Sever allows all components to find each other by name

The Database server has numerous methods which process transactions and retrieve information from the central database

The Resource managercontrol efficient use of resources such as tape stores

The Log server gathers information from the entire system for monitoring and debugging All communication is via CORBA

Page 7: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

The SAM station

A SAM station is deployed on local processing platforms

A station is unshared outside its set of CPU and disk resources.

Stations can communicate directly with each other, and data cached at one stations cache can be replicated at other stations upon demand.

Local groups of stations can, at a physical site, can share a locally available mass storage system (e.g. FermiLab)

Page 8: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

The SAM station

The station’s resposibilities include:

Storing and retrieving data files from mass storage and other stations.

Managing data stored on cache disk.

Launching Project managers which oversee the processing of data requests by consumers in well defined projects.

All these functions are provided by the servers within a station.(See next slide)

Page 9: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

File Stager(s)

Station &Cache

Manager

File Storage Server

Project Managers

/Consumers

eworkers

FileStorageClients

MSS orOtherStation

MSS orOtherStation

Data flowControl

Producers/

Cache DiskTemp Disk

The SAM Station

Page 10: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

The SAM Station

The Station Manager oversees the removal of filescached on disk, and instructs the File Stager to add new files.All processing projects are started through the Station Server which starts Project Managers. Files are added to the system through the File Storage Server (FSS), which uses the Stagers to initiate transfers to the available MSS or another station.

Page 11: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

A Station Job Manager provides services to execute a user application, script, or series of jobs, potentially as parallel processe either interactively or by use of a local batch system.

Currently supported are LSF and FBS, Condor and PBS adapters are under constructed and are being tested.

The station Cache Manager and Job Manager are implemented as a single “Station Master” server.

Job submission and synchronization between job execution and data delivery is currently part of SAM. Jobs are put on hold in batch system queues until data files are available to the job. At present jobs submitted at one station may only be run using the batch system(s) available at that Station.

The SAM Station

Page 12: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

The User Interface

UIs are provided add data, access data, set configurations parameters and monitor the system.

These take the forms of Unix command line, Web GUIs and Python API. There is also a C++ interface for accessing data through a standard DØ framework package.

Page 13: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Defining a dataset

Page 14: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Examining a predefined dataset

Page 15: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Querying Cached Files

Page 16: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

The SAM station

Real Data files from FNAL

MC files from NIKHEF

Page 17: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

The SAM station

sam submit --defname=run129194_reco --cpu-per-event=2m --group=dzero--batch-system-flags="--universe=vanilla --output=condor.out--log=condor.log --error=condor.error--initialdir=/home/walker/TestSam/blife/BLifetime_x-run13264x_reco_p1004--arguments='-rcp framework.rcp -input_file SAMInput: -output_fileoutputfile -out BLifetime_x.out -log BLifetime_x.log -time -mem'"--framework-exe=./BLifetime_x

The SAM submit command

Starts project and submits job to Condor BS

Page 18: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

MSU

Columbia

UTA64

Lyon/IN2P3100

Prague32

ImperialCollege

Lancaster200

NIKHEF50

Fermilab

SuperJanet

SURFnetESnet

Abilene

= MC production centers

The DØ SAM World

Also a UCL-CDF-test station

Page 19: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

SAM Works now!

#Transfers initiated between 9:30 and 12:30 (Thursday 25 Oct 2001)+---------------------+--------------------------+-------+--------------+| from station | to station | #files | tot_size (KB)|+---------------------+--------------------------+-------+--------------+| ccin2p3-analysis | central-analysis | 51 | 4694053| central-analysis | clued0 | 43 | 4970595| central-analysis | enstore | 138 | 35715952| central-analysis | imperial-test | 19 | 6499500| datalogger-d0olb | enstore | 54 | 21833665| datalogger-d0olc | enstore | 34 | 5638370| enstore | central-analysis | 20 | 2836508| enstore | clued0 | 20 | 5290084| enstore | linux-analysis-cluster-1 | 27 | 8207554| hoeve | central-analysis | 67 | 25890902| lancs | central-analysis | 21 | 5588544| prague-test-station | central-analysis | 2 | 1530437| uta-hep | central-analysis | 5 | 1165404+---------------------+--------------------------+-------+--------------+

Page 20: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Compute systems and Storage systems in US – Fermilab, UTA, Columbia, MSU, France/Lyon-IN2P3, UK/Lancaster and Imperial College, Netherlands/NIKHEF, Czech Republic/PragueMany other sites are expected to provide additional compute and storage resources when the experiment moves from commissioning to physics data taking. Storage systems consist of disk storage elements at all locations and robotically controlled tape libraries at Fermilab, Lyon and Nikhef and Lancaster (almost)

All storage elements support the basic functions of storing or retrieving a file. Some support parallel transfer protocols, currently via bbftpThe underlying storage management systems for tape storage elements are different at Fermilab, Lyon and Nikhef. Fermilab tape storage management system, Enstore, provides the ability to assign priorities and file placement instructions to file requests and provides reports about placement of data on tape, queue wait time, transfer time and other information that can be used for resource management.

The Fabric

Page 21: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Interim Conclusions

SAM is a sophisticated tool for data transfer, and a less sophisticated tool for job submission.

SAM works now, and has real users!

SAM is an operational prototype of many of the concepts being developed for Grid computing.

Page 22: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Interim Conclusions

However, significant parts of SAM will have to be enhanced (or replaced) before it can truly claim to be a data grid. This work will happen as part of the Particle Physics Data Grid (PPDG) project.

Current status will be in black, planned enhancements will be in bold red. The following slides are extracts from Vicky White’s Talk “SAM and PPDG” CHEP 2001

Page 23: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

Fab

ric

Tape Storage

Elements

Request Formulator and

Planner

Client Applications

Compute Elements

Indicates component that will be replaced

Disk Storage

Elements

LANs andWANs

Resource and Services Catalog

Replica Catalog

Meta-data Catalog

Authentication and SecurityGSISAM-specific user, group, node, station registration Bbftp ‘cookie’

Connectivity and Resource

CORBA UDP File transfer protocols - ftp, bbftp, rcp GridFTP

Mass Storage systems protocolse.g. encp, hpss

Collective

Services

Catalogprotocols

Significant Event Logger Naming Service Database ManagerCatalog Manager

SAM Resource ManagementBatch Systems - LSF, FBS, PBS,

CondorData MoverJob Services

Storage ManagerJob ManagerCache ManagerRequest Manager

“Dataset Editor” “File Storage Server”“Project Master” “Station Master” “Station Master”

WebPython codes, Java codes Command line

D0 Framework C++ codes

“Stager”“Optimiser”

CodeRepostory

Name in “quotes” is SAM-given software component name

or addedenhanced using PPDG and Grid tools

Page 24: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Enhancing SAM

The Job Manager is limited and can only submit to local resources.

The specification of user jobs, including their characteristics and input datasets, is a major component of the PPDG work.

The intention is to provide Grid job services components that replace the SAM job services components . This will support job submission (including composite and parallel jobs) to suitable SAM Station(s) and eventually any available Grid computing resource.

Page 25: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Unix user names, physics groups, nodes, domains and stations are registered. Valid combinations of these must be provided to obtain services. Station servers at one station provide service on behalf of their local users and are ‘trusted’ by other Station servers or Database Servers. Globus core Security Infrastructure services is a planned PPDG enhancement of the system. Service registration and discovery is implemented using a CORBA naming service, with namespace by station name. APIs to services in SAM are all defined using CORBA Interface Definition Language and have multiple language bindings (C++, Python, Java) and, in many cases, a shell interface. Use of GridFTP and other standard protocols to access storage elements is a planned PPDG modification to the system. Integration with grid monitoring tools and approaches is a PPDG area of research. Registration of resources and services using a standardized Grid registration or enquiry protocol is a PPDG enhancement to the system.

Enhancing SAM

Page 26: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Database Managers provide access to the Replica Catalog, Metadata Catalog, SAM Resource and configuration catalog and Transformation catalog.

All catalogs currently are tables in a central Oracle database; a matter that is hidden from their clients. Replication of some catalogs in two or more locations worldwide is a planned enhancement to the system.

Database managers will need to be enhanced to adapt SAM-specific APIs and catalog protocols onto Grid catalog APIs using PPDG-supported Grid protocols so that information may be published and retrieved in the wider Physics Data Grid that spans several virtual organizations.  A central Logging server receives significant events.

This will be refined to receive only summary level information, with more detailed monitoring information held at each site.

Work in the context of PPDG will examine how to use a Grid Monitoring Architecture and tools.

Enhancing SAM

Page 27: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Resource manager services are provided by an “Optimization” service. File transfer actions are prioritized and authorized prior to being executed. The current primitive functionality of re-ordering and grouping file requests, primarily to optimize access to tapes, will need to be greatly extended, redesigned and re-implemented to better deal with co-location of data with computing elements and fair-shares and policy-driven use of all computing, storage and network resource. This is a major component of the SAM/PPDG work, to be carried out in collaboration with the Condor team.

Enhancing SAM

Page 28: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Enhancing SAM

Other enhancement also needed for scalability e.g. relies on a single Oracle database, which is a single point of failure. Needs replication/cache. Etc etc ...

Page 29: David Colling GridPP Edinburgh 6th November 2001 SAM... an overview (Many thanks to Vicky White, Lee Lueking and Rod Walker)

David Colling GridPP Edinburgh 6th November 2001

Conclusions

SAM already does a lot and planned enhancements will give it far greater functionality.