yet another grid project: the open science grid at slac matteo melani, booker bense and wei yang...

40
Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA, USA

Upload: bryce-pearson

Post on 17-Jan-2016

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

Yet Another Grid Project:The Open Science Grid at SLAC

Matteo Melani, Booker Bense and Wei YangSLAC

Hepix Conference10/13/05,SLAC, Menlo Park, CA, USA

Page 2: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 2

July 22nd, 2005

“The Open Science Grid Consortium today officially inaugurated the Open Science Grid, a national grid computing infrastructure for large scale science. The OSG is built and operated by teams from U.S. universities and national laboratories, and is open to small and large research groups nationwide from many different scientific disciplines.”

- science grid this week-

Page 3: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 3

Outline

OSG in a nutshell

OSG at SLAC: “PROD_SLAC” site

Authentication and Authorization in OSG

LSF-OSG integration

Running applications: US CMS and US ATLAS

Final thought

Page 4: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 4

Outline

OSG in a nutshell

OSG at SLAC: “PROD_SLAC” site

Authentication and Authorization in OSG

LSF-OSG integration

Running applications: US CMS and US ATLAS

Final thought

Page 5: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 5

Once upon a time there was…

• 30 sites,• ~3600 CPUs

-Goal to build a shared Grid infrastructure to support opportunistic use of resources for stakeholders.

Stakeholders are NSF, DOE sponsored Grid Projects (PPDG, GriPhyN, iVDGL), and US LHC software program.

Team of computer and domain scientists deployed (simple) services in a Common infrastructure and interfaces across existing computing facilities.

Operating stably for over a year in support of computationally intensive applications.

Added communities without perturbation.

CMS DC04

ATLASDC2

Page 6: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 6

Page 7: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 7

Page 8: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 8

Vision (1)

The Open Science Grid: A production quality national

grid infrastructure for large scale science.

Robust and scalable

Fully managed

Interoperates with other Grids

Page 9: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 9

Vision (2)

Page 10: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 10

What is the Open Science Grid? (Ian Foster)

Open

A new sort of multidisciplinary cyberinfrastructure community

An experiment in governance, incentives, architecture

Part of a larger whole, with TeraGrid, EGEE, LCG, etc.

Science

Driven by demanding scientific goals and projects who need results today (or

yesterday)

Also a computer science experimental platform

Grid

Standardized protocols and interfaces

Software implementing infrastructure, services, applications

Physical infrastructure—computing, storage, networks

People who know & understand these things!

Page 11: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 11

OSG Consortium

Members of  the OSG Consortium are those organizations that have made

agreements  to contribute to the Consortium.

DOE Labs: SLAC, BNL, FNAL

Universities: CCR- University of Buffalo

Grid Projects: iVDGL, PPDG, Grid3, GriPhyN

Experiments: LIGO, US CMS, US ATLAS, CDF Computing, D0

Computing , STAR, SDDS

Middleware Projects: Condor, Globus, SRM Collaboration, VDT

Partners are  those organizations with whom we are interfacing to work on

interoperation of grid infrastructures and services.

LCG, EGEE, TeraGrid

Page 12: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 12

Character of Open Science Grid (1)

Pragmatic approach: Experiments/users drives requirements

“Keep it simple and make more reliable”

Guaranteed and opportunistic use of resources provided through Facility-VO contracts.

Validated, supported core services based on VDT and NMI Middleware. (Currently GT3 based, moving soon to GT4)

Adiabatic evolution to increase scale and complexity.

Services and applications contributed from external projects. Low threshold to contributions and new services.

Page 13: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 13

Character of Open Science Grid (2)

Heterogeneous Infrastructure All Linux but different versions of the Software Stack at different sites.

Site autonomy: Distributed ownership of resources with diverse local policies,

priorities, and capabilities.

“no” Grid software on compute nodes.

• But users want direct access for diagnosis and monitoring:

• Quote from physicist on CDF: “Experiments need to keep under control

the progress of their application to take proper actions, helping the Grid to

work by having it expose much of its status to the users”

Page 14: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 14

Architecture

Page 15: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 15

Services

Computing Service: GRAM form GT3.2.1+patches

Storage Service: SRM Interface (v1.1) as common interface to storage,

DRM and dCache; most sites use NFS + GridFTP, we are looking into

SRM-xrootd solution

File Transfer Service: GridFTP

VO Management Service: INFN VOMS

AA: GUMS v1.0.1, PRIMA v0.3, gPlazma

Monitoring Service: Monalisa, v1.2.34, MDS

Information Service: jClarens, v0.5.3-2, GridCat

Accounting Service: partially provided by Monalisa

Page 16: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 16

Common Spaceacross WN:$DATA (local SE)$APP$TMP

Identity and Roles : X509 Certs

Open Science Grid Release 0.2

Compute Element

WN$WN_TMP

Site Boundary (WAN->LAN)

CE

Authentication Mapping:GUMS

Submit Host:Condor-GGlobus RSL

UserPortal

Catalogs & Displays:GridCatACDCMonaLisa

Monitoring & InformationGridCat, ACDCMonaLisa, SiteVerify

Storage ElementSRM V1.1GridFTP

Compute ElementGT2 GRAMGrid monitor

Virtual OrganizationManagement

PRIMA; gPlazma

Batch queue job priority

Courtesy of Ruth Pordes

Page 17: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 17

Compute Element

WN$WN_TMP

Common Spaceacross WN:$DATA (local SE)$APP$TMP

Site Boundary (WAN->LAN)

CE

Authentication Mapping:GUMS

Identity and Roles : X509 Certs

Submit Host:Condor-GGlobus RSL

OSG 0.4 User

PortalCatalogs & Displays:GridCatACDCMonaLisa

Monitoring & InformationGridCat, ACDCMonaLis SiteVerify

Storage ElementSRM V1.1GridFTP

Compute ElementGT2 GRAMGrid monitor

Virtual OrganizationManagement

PRIMA; gPlazma

Batch queue job priority

Edge Service Framework (XEN)Lifetime Managed

VO ServicesSome Sites withBandwidth Management

GT4 GRAM

Full Local SE

Job monitoring& exit codes reporting

Accounting

Service Discovery:

GIP + BDII networkCourtesy of Ruth Pordes

Page 18: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 18

Software distribution

Software is contributed by individual OSG members into

collections we call "packages".

OSG provides collections of software for common services

built on top of the VDT to facilitate participation.

There is very little OSG specific software and we strive to use

standards based interfaces where possible.

OSG software packages are currently distributed as a Pacman

caches.

Latest release on May 24th VDT 1.3.6

Page 19: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 19

OSG’s deployed Grids

OSG Consortium operates two grids:

OSG is the production grid: Stable; for sustained production

14 VOs

38 sites, ~5,000 CPUs, 10 VOs.

Support provided

http://osg-cat.grid.iu.edu/

OSG-ITB: is the test and development grid Grid: For testing new services, technologies, versions…

29 sites, ~2400 CPUs,

http://osg-itb.ivdgl.org/gridcat/

Page 20: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 20

Operations and support

VOs are responsible for 1st level support

Distributed Operations and Support model from the outset.

Difficult to explain, but scalable and putting most support

“locally”. Key core component is central ticketing system with automated routing

and import/export capabilities to other ticketing systems and text based

information.

Grid Operations Center (iGOC)

Incident Response Framework, coordinated with EGEE.

Page 21: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 21

Outline

OSG in a nutshell

OSG at SLAC: “PROD_SLAC” site

Authentication and Authorization in OSG

LSF-OSG integration

Running applications: US CMS and US ATLAS

Final thought

Page 22: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 22

PROD_SLAC

100 jobs slots available in TRUE

resource sharing

0.5 TB of disk space

[email protected]

LSF 5.1 batch system

VO role-base authentication and

authorization

VOs: Babar, US ATLAS, US

CMS, LIGO, iVDGL

Page 23: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 23

PROD_SLAC

4 Sun V20z, dual processors

machines

Storage is provided with NFS: 3

directories $APP, $DATA and

$TMP

We do not run Ganglia or GRIS

Page 24: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 24

Outline

OSG in a nutshell

OSG at SLAC: “PROD_SLAC” site

Authentication and Authorization in OSG

LSF-OSG integration

Running applications: US CMS and US ATLAS

Conclusions

Page 25: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 25

AA using GUMS

Page 26: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 26

UNIX account issue

The Problem:

SLAC Unix account did not fit the OSG model:

Normal SLAC account have too many default privileges

Gatekeeper-AFS interaction is problematic

The Solution:

Created a new class of Unix accounts just for the Grids

Creation of new process for this new type of account

New account type have minimum privileges:

no emails, no login accesses,

home dir on grid dedicated NFS, no write access beyond Grid NFS server

Page 27: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 27

DN-UID mapping

Each (DN, voGroup) pair is mapped to an unique UNIX account No group mapping Account name schema: osg + VOname + VOgroup + NNNNN

Example:

A DN in USCMS VO (voGroup /uscms/) => osguscms00001

iVDGL VO, group mis (voGroup /ivdgl/mis) =>osgivdglmis00001

If revoked, the account name/UID will never be reused (unlike for UNIX accounts)

Keep track of Grid UNIX accounts like ordinary UNIX user accounts (in RES) 1,000,000 < UID < 10,000,000

All Grid UNIX accounts belongs to one single UNIX group Home directories on Grid dedicated NFS, shells are /bin/false

Page 28: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 28

Outline

OSG in a nutshell

OSG at SLAC: “PROD_SLAC” site

Authentication and Authorization in OSG

OSG-LSF integration

Running applications: US CMS and US ATLAS

Final thought

Page 29: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 29

GRAM Issue

The Problem:

Gatekeeper over aggressively poll jobs status; it overloads the LSF

scheduler

Race conditions: LSF job manager unable to distinguish between error

condition and loaded system (we usually have more than 2K jobs running)

Maybe reduced in next version of LSF

The Solution:

Re-write part the LSF job manager: lsf.pm

Looking into writing custom bjobs to have local caching

Page 30: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 30

The straw the broke the camel’s back

SLAC has more than 4000 jobs slots being schedule by a

single machine

We operate is a fully production mode: operation disruption

has to be avoided at all costs

Too many monitoring tools (ACDC, Monalisa, User’s

Monitoring tools…) can easily overload the LSF scheduler by

running bjobs –u all

Implementations of monitoring is a concern!

Page 31: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 31

Outline

OSG in a nutshell

OSG at SLAC: “PROD_SLAC” site

Authentication and Authorization in OSG

LSF-OSG integration

Running applications: US CMS and US ATLAS

Final thought

Page 32: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 32

US CMS Application

Intentionally left blank!

We could run 10-100 jobs right away

Page 33: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 33

US ATLAS Application

ATLAS reconstruction and analysis jobs require access to

remote database servers at CERN, BNL, and elsewhere

SLAC batch nodes don't have internet access

Solution is to use have clone of the database within the SLAC

network or to create a tunnel

Page 34: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 34

Outline

OSG in a nutshell

OSG at SLAC: “PROD_SLAC” site

Authentication and Authorization in OSG

LSF-OSG integration

Running applications: US CMS and US ATLAS

Final thought

Page 35: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 35

Final thought

“PARVA SED APTA MIHI SED…”

- Ludovico Ariosto

Page 36: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 36

QUESTIONS?

Page 37: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

Spare

Page 38: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 38

Governance

Page 39: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 39

User in VO1 notices problem at RP3, notifies their SC (1).SC-C opens ticket (2) and assigns to SC-F.SC-F gets automatic notice (3) and contacts RP3 (4).Admin at RP3 fixes and replies to SC-F (5).SC-F notes resolution in ticket and marks it resolved (6).SC-C gets automatic notice of update to ticket (7).SC-C notifies user of resolution (8).

User can complain if dissatisfied and SC-C can re-open ticket (9,10).

PhysicalView

1 2

3

4

5

678

910

OSG infrastructureSC private infrastructure

Ticketing Routing Example

Page 40: Yet Another Grid Project: The Open Science Grid at SLAC Matteo Melani, Booker Bense and Wei Yang SLAC Hepix Conference 10/13/05, SLAC, Menlo Park, CA,

10/13/05 Hepix Conference, SLAC, Menlo Park, CA, USA 40