experience with globus o nline at fermilab

17
Experience with Globus Online at Fermilab Computing Sector, Fermi National Accelerator Laboratory 4/12/12 GlobusWorld 2012: Experience with GO@Fermilab 1

Upload: mirit

Post on 24-Feb-2016

29 views

Category:

Documents


0 download

DESCRIPTION

Computing Sector, Fermi National Accelerator Laboratory. Experience with Globus O nline at Fermilab. Overview. Integration of Workload Management and Data Movement Systems with GO Center for Enabling Distributed Petascale Science (CEDPS): GO integration with glideinWMS - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 1

Experience with Globus Online at FermilabComputing Sector, Fermi National Accelerator Laboratory

4/12/12

Page 2: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 2

Overview Integration of Workload Management and Data

Movement Systems with GO1. Center for Enabling Distributed Petascale Science

(CEDPS): GO integration with glideinWMS2. Data Handling prototype for Dark Energy Survey (DES)

Performance tests of GO over 100 Gpbs networks3. GO on the Advanced Network Initiative (ANI) testbed

Data Movement on OSG for end users4. Network for Earthquake Engineering Simulation (NEES)

4/12/12

Page 3: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 3

Fermilab’s interest in GO Data Movement service for end users

Supporting user communities on the GridEvaluating GO services in the workflows of our

stakeholders

Data Movement service integrationEvaluate GO as a component of middleware

systems e.g. Glidein Workload ManagementEvaluate performance of GO for exa-scale

networks (100 GE)

4/12/12

Page 4: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 4

1. CEDPS CEDPS: The five year project 2006-2011, funded by Department of Energy (DOE) Goals Produce technical innovations for rapid and dependable data placement within a distributed

high performance environment and for the construction of scalable science services for data and computing from many clients.

Address performance and functionality troubleshooting of these and other related distributed activities.

Collaborative Research Mathematics & Computer Science Division, Argonne National Laboratory Computing Division, Fermi National Accelerator Laboratory Lawrence Berkeley National Laboratory Information Sciences Institute, University of Southern California Dept of Computer Science, University of Wisconsin Madison

Collaborative work done by Fermi National Lab, Argonne National Lab, University of Wisconsin Supporting the integration of data movement mechanisms with scientific Glidein workload

management system Integration of asynchronous data stage-out mechanisms in overlay workload management

systems

4/12/12

Page 5: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 5

glideinWMS Pilot-based WMS that creates on demand a dynamically-sized

overlay condor batch system on Grid resources to address the complex needs of VOs in running application workflows

User Communities CMS Communitties in the Fermilab

○ CDF○ DZero○ Intensity Frontier Experiments (Minos, Minerva, Nova …)

OSG Factory at UCSD & Indian Univ○ Serves OSG VO Frontends, including ICECube, Engage, LSST, …

CoralWMS - Frontend for TeraGrid community Atlas - Evaluating glideinWMS interfaced with Panda framework for their analysis

framework User community growing rapidly

4/12/12

Page 6: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 6

Glideinwms Scale of Operations

CMS Production Factory (up) & Frontend at CERN

OSG Factory & CMS Analysis at UCSD

4/12/12

CMS Factory@CERN serving~400K jobs

OSG Factory@UCSD serving~200K jobs

CMS Frontend@CERN servingpool with ~50K jobs

CMS Analysis Frontend@UCSDserving pool with ~25K jobs

Page 7: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 7

Integrating glideinWMS with GO Goals:

Middleware handle data movement, rather than the application

Middleware optimize use of computing resources (CPU do not block on data movement)

Users provide data movement directives in the Job Description File (e.g. storage services for IO)

glideinWMS procures resources on the Grid and run jobs using Condor

Data movement is delegated to the underlying Condor system

globusconnect is instantiated and GO plug-in is invoked using the directives in the JDF

Condor optimizes resources

4/12/12

VO Infrastructure

Grid SiteWorker Node

Condor Scheduler

Jo b

glideinWMSGlidein Factory,

WMS Pool

VO Frontend

glideinCondor Startd

Condor Central Manager

globusonline.org

Page 8: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 8

Validation Test Results Tests – Modified Intensity Frontier

experiment (Minerva) jobs to transfer output sandbox to GO endpoint using transfer plugin Jobs: 2636, with 500 running at a time Total files transferred: 16359 Upto 500 dynamically created GO endpoints

at a given time. Lessons Learned

Integration tests successful with 95% transfer success rate -- stressing scalability of GO in an unintended way

GO team working on the scalability issues identified

Efficiency and scalability can be increased by modifying the plugin to reuse GO endpoints and by transferring multiple files at the same time.

4/12/12

Plugin Exit Status Count Exit Code 0 (Success) 14094 Exit Code 1 (Failure) 46 Abnormal Termination 2234 Total 16374 Analysis based on exit code of the plugin in logs - Plugin normal termination:

16374 Duplicate terminations in same

log file (ignored): 15 Total plugin log files analyzed:

16359

14094 -86%

46 - 0%2234 - 14%

Plugin Exit Status

Exit Code 0

Exit Code 1

AbnormalTemination

Plugin Abnormal Termination

Count

scp (Transfers success) 1500 scp (Transfers failed) 368 endpoint-activate 49 endpoint-add 50 endpoint-remove 53 globusconnect -setup 135 Other 79 Total 2234 Analysis based on the last action tried by the plugin Status of the action never

reported back to the plugin

1500 - 67%368 -17%

49 -2%

50 - 2%

53 - 2%

135 -6%

79 - 4%

Plugin Abnormal Termination

scp (success)

scp (failed)

endpoint-activate

endpoint-add

endpoint-remove

globusconnect -setup

Other

Page 9: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 9

2. Prototype integration of GO with DES Data Access Framework Motivation

Support Dark Energy Survey preparation for data takingSee Don Petravick’s talk on Wed

DES Data Access Framework (DAF) uses a network of GridFTP servers to reliably move data across sites.

In Mar 2011, we investigated the integration of DAF with GO to address 2 issues:1. DAF data transfer parameters

were not optimal for both small and large files.

2. Reliability was implemented inefficiently by sequentially verifying real file size with DB catalogue.

4/12/12

Page 10: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 10

Results and improvements Tested DAF moving 31,000 files (184 GB) with GO

vs. UberFTP Results

Time for Transfer + Verification is the same (~100 min) Transfer time is 27% faster with GO than with UberFTP Verification time is 50% slower with GO than sequentially

with UberFTP Proposed Improvements:

Allow specification of src / dest transfer reliability semantics (e.g. same size, same CRC, etc.) – Implemented for size

Allow finer-grain failure model (e.g. specify number of transfer retrials instead of time deadline)

Provide interface for efficient (pipelined) ls of src / dest files.

4/12/12

Page 11: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 11

3. GO on the ANI Testbed• Motivation:

Testing Grid middleware readiness to interface 100 Gbits links on the Advanced Network Initiative (ANI) Testbed.

• Characteristics:• GridFTP data transfers (small, medium, large, all sizes) • 300GB of data split into 42432 files (8KB – 8GB)• Network: aggregate 3 x 10Gbit/s to bnl-1 test machine

• Local tests (reference) initiated on bnl-1

• FNAL and GO tests: initiated on “FNAL initiator”; GridFTP control forwarded through “VPN gateway”

Work by Dave Dykstra w/ contrib. by Raman Verma & Gabriele Garzoglio114/12/12

Page 12: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 12

Test results GO (yellow) does almost as well as practical max (red) for

medium-size files. Working with GO to improve transfer parameters for big

and small files.

Small files have very high overhead over wide area control channels

GO auto-tuning works better for medium files than for the large files

Counterintuitively, increasing concurrency and pipelining on small files reduced the transfer throughput.

Work by Dave Dykstra w/ contrib. by Raman Verma & Gabriele Garzoglio124/12/12

Page 13: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 13

4. Data Movement on OSG for NEES Motivation

supporting NEES group at UCSD to run computations on the Open Science Grid (OSG)

Goal Perform parametric studies that involve large-scale nonlinear models of structure

or soil-structure systems with large number of parameters and OpenSees runs. Application example

nonlinear time-history (NLTH) analyses of advanced nonlinear finite element (FE) model of a building

Probabilistic seismic demand hazard analysis making use of the “cloud method”: 90 bi-directional historical earthquake record

Sensitivity of probabilistic seismic demand to FE model parameters

4/12/12

A. R. Barbosa, J. P. Conte, J. I. Restrepo, UCSD

30 days on OSG vs. 12 yrs

on Desktop

Page 14: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 14

Success and challenges Jobs submitted from RENCI (NC) to ~ 20 OSG sites. Output collected at

RENCI. NEES scientist moved 12 TB from the RENCI server to the user’s

desktop at UCSD using GO Operations: every day, set up the data transfer update for the day: fire and forget

…almost… …there is still no substitute for a good network administrator

Initially, we had 5 Mbps eventually 200 Mbps (over 600 Mbps link). Improvements:○ Upgrade eth card on user desktop○ Migrate from Windows to Linux○ Work with the user to use GO○ Find a good net admin to find and fix broken fiber at RENCI, when nothing else

worked.

Better use of GO on OSG: Integrate GO with the Storage Resource Broker (SRM)

4/12/12

Page 15: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 15

Conclusions Fermilab has worked with the GO team to

improve the system for several use cases:Integration with glidein Workload Management –

Stress the “many-globusconnect” dimensionIntegration with Data Handling for DES – New

requirements on reliability semanticsEvaluation of performance over 100 Gbps

networks – Verify transfer parameters auto-tuning at extreme scale

Integrate GO with NEES for regular operations on OSG – Usability for GO’s intended usage

4/12/12

Page 16: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 16

Acknowledgments GlobusOnline team for their support in all of these activities. Integration of Glideinwms and globusonline.org was done as

a part of CEDPS project glideinWMS infrastructure is developed in Fermilab in

collaboration with the Condor team from Wisconsin and High Energy Physics experiments.

Most of the glideinWMS development work is funded by USCMS (part of CMS) experiment.

Currently used in production by CMS, CDF and DZero, MINOS, ICECube with several other VOS evaluating it for their use case.

The Open Science Grid (OSG) Fermilab is operated by Fermi Research Alliance, LLC under

Contract No. DE-AC02-07CH11359 with the United States Department of Energy.

4/12/12

Page 17: Experience with Globus  O nline at Fermilab

GlobusWorld 2012: Experience with GO@Fermilab 17

References1. CEDPS Report: GO Stress Test Analysis

https://cd-docdb.fnal.gov:440/cgi-bin/RetrieveFile?docid=4474;filename=GlobusOnline%20PluginAnalysisReport.pdf;version=1

2. DES DAF Integration with GO https://www.opensciencegrid.org/bin/view/Engage

ment/DESIntegrationWithGlobusonline3. GridFTP & GO on the ANI Testbed

https://docs.google.com/document/d/1tFBg7QVVFu8AkUt5ico01vXcFsgyIGZH5pqbbGeI7t8/edit?hl=en_US&pli=1

4. OSG User Support of NEES https://

www.opensciencegrid.org/bin/view/Engagement/EngageOpenSeesProductionDemo

4/12/12