4/25/2006condor week 1 fermigrid steven timm fermilab computing division fermilab grid support...

21
4/25/2006 Condor Week http://fermigrid.fnal.gov / 1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

Upload: ashlie-stone

Post on 13-Dec-2015

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

1

FermiGrid

Steven Timm

Fermilab

Computing Division

Fermilab Grid Support Center

Page 2: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

2

People

FermiGrid Operations Team:Keith Chadwick (CD/CCF/FTP) – Project LeaderSteve Timm (CD/CSS/FCS) – Linux OS SupportDan Yocum (CD/CCF/FTP) – Application Support

Thanks to:Condor Team: M. Livny, J. Frey, A. Roy and many others.Globus Developers: C. Bacon, S. Martin.GridX1: R. Walker, D. Vanderster et al.Fermilab grid developers: G. Garzoglio, T. Levshina.Representatives of following OSG Virtual Organizations: CDF, DZERO, USCMS, DES, SDSS, FERMILAB, I2U2, NANOHUB, GADU.

FermiGrid Web Site & Additional Documentation:• http://fermigrid.fnal.gov//

Page 3: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

3

FCC—Feynman Computing Center

Page 4: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

4

Fermilab Grid Computing Center

Page 5: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

5

Computing at Fermilab

Reconstruction and analysis of data for High Energy Physics Experiments

> 4 Petabytes on tape

Fast I/O to read file, many hours of computing, fast I/O to write

Each job independent of other jobs.

Simulation for future experiments (CMS at CERN)

In two years need to scale to >50K jobs/day

Each big experiment has independent cluster or clusters

Diverse file systems, batch systems, management methods.

More than 3000 dual-processor Linux systems in all

Page 6: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

6

FermiGrid Project

FermiGrid is a meta-facility established by Fermilab Computing Division

Four elements:

Common Site Grid Services

Virtual Organization hosting (VOMS, VOMRS), Site-wide Globus GRAM gateway, Site AuthoriZation, MyProxy, GUMS.

Bi-lateral Interoperability between various experimental stakeholders

Interfaces to the Open Science Grid

Grid interfaces to mass storage systems.

Page 7: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

7

FermiGrid – Common Grid Services

user

GUMSidentity mapping

serviceSAZsite

authorizationservice

VOMS server

UIDmapping

sitecontrol

GatekeeperJob

managerJob scheduler

Myproxy server

Page 8: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

8

Hardware

Dell 2850 Servers with dual 3.6 GHzXeons, 4Gbytes of memory, 1000TX, Hardware Raid, Scientific Linux 3.0.4, VDT 1.3.9

FermiGrid1:Site Wide Globus Gateway

FermiGrid2:Site Wide VOMS & VOMRS Server

FermiGrid3:Site Wide GUMS Server

FermiGrid4:Myproxy serverSite AuthoriZation server

Page 9: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

9

Site Wide Gateway – Why:

CMSWC

1

CMSWC

2

CDFCAF

CDF D0CAB

D0SDS

STAM

GPFarm

LQCD

Desktops

Site WideGateway

MyproxyServer

VOMSServer

SAZServer

GUMSServer

? ? ?

Page 10: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

10

Site Wide Gateway Technique:

This technique is closely adapted from a technique first used at GridX1 in Canada to forward jobs from the LCG into their clusters.

We begin by creating a new Job Manager script in:

$VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobManager/condorg.pm

This script takes incoming jobs and resubmits them to Condor-G on fermigrid1

Condor matchmaking is used so that the jobs will be forwarded to the member cluster with the most open slots.

Each member cluster runs a cron job every five minutes to generate a ClassAD for their cluster. This is sent to fermigrid1 using condor_advertise.

Credentials to successfully forward the job are obtained in the following manner:

1. User obtains a voms-qualified proxy in the normal fashion with voms-proxy-init2. User sets X509_USER_CERT and X509_USER_KEY to point to the proxy instead of the usercert.pem and

userkey.pem files– User uses myproxy-init to store the credentials, using myproxy, on the fermilab myproxy server myproxy.fnal.gov– jobmanager-condorg, which is running as the uid that the job will run on under fermigrid, executes a myproxy-get-

delegation to get a proxy with full rights to resubmit the job.– Documentation of the steps to do this as a user is found in the Fermigrid User Guide: http://fermigrid.fnal.gov/user-

guide.html

Page 11: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

11

Site Wide Gateway Animation:

CMSWC

1

CMSWC

2

CDFCAF

CDF D0CAB

D0SDS

STAM

GPFarm

LQCD

Desktops

Site WideGateway

MyproxyServer

VOMSServer

SAZServer

GUMSServer

Step 1 - user is

sues voms-p

roxy-init

user receives v

oms signed credentials

Step 2 - user stores their voms sig

ned

credentials on the myproxy server

Step 3 – user submits their grid job viaglobus-job-run, globus-job-submit, or condor-g

Step 4 – Gateway retrieves the previously stored proxy Step 5 – Gateway requests

GUMS

Mapping based on VO & Role

Step 6 – Gateway checks against

Site Authorization Service

clusters send ClassAdsvia condor_advertise

to the site wide gateway

Step 7 - Grid job is forwarded

to target cluster

? ? ?

Page 12: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

12

Guest vs. Owner VO Access:

OSG“guest” VO

Users“owner” VOUsers

ResourceHead Node

FermiGridGateway &

Central Services

Required

Allowed

Fermilab“guest” VO

Users

Allowed

Allowed

ResourceHead Node

ResourceHead Node

ResourceHead Node

Allowed

NotAllowed

Allowed

Page 13: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

13

OSG Interfaces for Fermilab:

Four Fermilab clusters are directly accessible to OSG right now, General Purpose Grid Cluster (FNAL_GPFARM)US CMS Tier 1 Cluster (USCMS_FNAL_WC1_CE)LQCD cluster (FNAL_LQCD)SDSS cluster (SDSS_TAM)

Two more clusters (CDF) accessible only by Fermigrid site gateway.Future Fermilab clusters will also only be accessible by Fermigrid site gateway.

Shell script is used to make a condor classad and send it with condor_advertise

Match is done based on number of free cpu’s and number of jobs waiting

Page 14: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

14

OSG Requirements

OSG Job flow:

User pre-stages applications and data via gridftp/srmcp to shared areas on cluster (can be NFS or SRM-based storage element.)

User submits a set of jobs to cluster

Jobs take applications and data from cluster-wide shared directories.

Results are written to local storage on cluster, then transferred across WAN

Most OSG jobs expect common shared disk areas for applications, data, and user home directories. Our clusters are currently not shared.

Most OSG jobs don’t use myproxy in submission sequence

OSG makes use of monitoring to detect free resources, ours are not currently reported correctly.

Need to make the gateway transparent to the OSG so it looks like any other OSG resource. Right now it only reports 4 CPU’s.

Want to add possibility for VO affinity to the classad advertising of the gateway.

Page 15: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

15

CEMon and matchmaking

Fermigrid will be first large-scale deployment of OSG Resource Selection Service. Use CEMon (glite package) to send classads to central info gatherer.

CEMONGIP

fngp-osg

CEMONGIP

cmsosgce

CEMONGIP

fcdfosg1

Matchmaker (coll/neg)Info Gatherer

Interactive Condor Client

Fermigrid1Jobmanager-

condorg

See P. Mhashlikar talk later in this conference

Page 16: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

16

Shared data areas and storage elements

At the moment OSG requires shared Application and Data areas

Also needed, shared home directory for all users, (fermigrid has 226).

It is planned to use a BlueArc NAS appliance to serve these to all the member clusters of FermiGrid. 24TB of disk is in process of being ordered. NAS head already in hand.

Also being commissioned, a shared volatile Storage Element for fermigrid, supports SRM/dCache access for all grid users.

Page 17: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

17

Getting rid of MyProxy

Configure each individual cluster gatekeeper to accept restricted globus proxy—from just one host, the site gateway.

On CDF clusters for example the gatekeeper is already restricted via tcp-wrappers to not take any connections from off-site. Could be restricted further to take connections only from glidecaf head and fermigrid1.

Then change gatekeeper configuration, call it with

“accept_limited” option, we would then be able to forward jobs without myproxy, and could call this the jobmanager-condor rather than the jobmanager-condorg. This has been tested in our test cluster, will move to production soon.

Page 18: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

18

Reporting all resources

MonALISAjust need a unified Ganglia view of all Fermigrid and MonALISA will show right number of cpu’s, etc. Also make it so MonALISA queries all condor pools in Fermigrid

GridCat/ACDC-> have to change condor subroutines in MIS-CI to get the right total number of CPU’s from the cluster classads. Fairly straightforward

GIP-> Need to change lcg-info-dynamic-condor script to do the right number of job slots per VO. Already had to do this once, not difficult.

Page 19: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

19

Globus Gatekeeper Calls

Page 20: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

20

VOMS access

Page 21: 4/25/2006Condor Week  1 FermiGrid Steven Timm Fermilab Computing Division Fermilab Grid Support Center

4/25/2006 Condor Week http://fermigrid.fnal.gov/

21

GUMS user mappings