performance and analysis workflow issues

PERFORMANCE AND ANALYSIS WORKFLOW ISSUES

US ATLAS Distributed Facility Workshop 13-14 November 2012 , Santa Cruz

IMPORTANCE OF ANALYSIS JOBS

Number of analysis jobs are increasing

Production jobs are mostly CPU limited, well controlled, hopefully optimized and can be monitored through other already existing system

Analysis jobs we know very little about and potentially could: be inefficient, wreck havoc at storage elements, networks. Twice failure rate of production jobs

13/11/2012IL I JA VUKOTIC [email protected] 2

Tier0 Tier1 Tier2 Tier3

0

500000

1000000

1500000

2000000

2500000

54,671439,125

1,218,116

4,002

6,344

97,530

282,410

924

86,765

882,525

589,604

11,444

31,739

74,504

83,089

3,085

managed failedmanaged fin-isheduser faileduser finished

ANALYSIS QUEUES PERFORMANCEIdea Find what is performance of ATLAS analysis jobs on the grid

There is no framework that everybody uses, that could be instrumented Understand numbers: each site has it’s hard limits in terms of storage, cpus, network,

software.

Improve ATLAS software ATLAS files, way we use them Site’s configurations

Requirements: Monitoring framework Tests simple, realistic, accessible, versatile as possible

Running on most of the resources we have Fast turn around Test codes that are “recommended way to do it” Web interface for most important indicators


TEST FRAMEWORK


HammerCloud

ORACLE DB CERNResults

WEB site

SVN configurati

on, test scripts

Continuous Job performance

Generic ROOT IO scripts

Realistic analysis jobs

Site performance Site optimization

One-off new releases (Athena, ROOT) new features, fixes

All T2D sites (currently 42 sites)

Large number of monitored parameters

Central database Wide range of visualization

tools

TEST FRAMEWORK


Pilot numbers obtained

from panda db

5-50 jobs per day per site Each job runs at least 24 tests

5 read modes + 1 full analysis job

Over 4 different files

Takes data on machine status

Cross reference to Panda DB Currently 2 million results in DB

WEB sitehttp://ivukotic.web.cern.ch/ivukotic/HC/index.asp

http://ivukotic.web.cern.ch/ivukotic/HC/index.asp



SUMMARY RESULTS

Setup times


AGLT2

BNL

CERN HU

MW

T2

NET2

OU_O

CHEP...

SLAC

SWT2

_CPB

0

20

40

60

80

100

120

140

160

1st week July1st week Nov.

SUMMARY RESULTS

Stage-in


6th Jun –

27th Jun27th Jun –

9th July10th July – 16th Aug

16th Aug – 24th Aug

24th Aug – 24 Sep

24Sep-..

GAINfree copy2scratch direct direct with fixcopy2scratc

h directAGLT2 5 257 5 5 52 0BNL 333 339 7 9 109 224CERN 4 294 4 5 67 -1HU 326 337 8 8 84 242MWT2 7 297 6 7 40 0NET2 343 331 5 6 104 239OU_OCHEP_SWT2 261 262 266 57 60 201SLAC 9 305 24 14 56 -5SWT2_CPB 8 374 23 7 278 1

Space for improvement60 s = 41 MB/s

The Fix

SUMMARY RESULTS Execution time


AGLT2

BNL

CERN HU

MW

T2

NET2

OU_O

CHEP_...

SLAC

SWT2

_CPB

0

500

1000

1500

2000

2500

3000

14091317

1210

1840

1269

15471692

2810

1577

12721133

1522

0

1892

1521 1473

1839

1532

27th Jun - 9th July COPY-2SCRATCH16th Oct - 5th Nov DIRECT

exe

cu

tio

n t

ime

[s]

GPFS not mounted – can’t run in direct mode

SUMMARY RESULTS

Stage out


AGLT

2BN

L

CERN

HU

MW

T2

NET

2

OU_O

CHEP

_SW

T2

SLAC

SWT2

_CPB

0

20

40

60

80

100

120

140

160

180

before 29th Jun - 31st July

SUMMARY RESULTS Total time = setup + stage in + exec + stage out [s] – as measured by pilot


0

500

1000

1500

2000

2500

3000COPY2SCRATCH DIRECT

SUMMARY – GOING DEEPER

CPU efficiency Measures only event loop Defined as CPU time / WALL time Keep in mind – very slow machine can have very high CPU eff. All you want to do is make it as high as possible


FACTS:1. Unless doing bootstrapping or some

weird calculation, users code is negligible compared to unzipping.

2. ROOT can unzip at 40MB/s

SUMMARY – GOING DEEPER

CPU efficiency


AGLT2

BNL

CERN HU

MW

T2

NET2

OU_O

CH...

SLAC

SWT2

...0

10

20

30

40

50

60

70

80

90

100 24th Aug - 24 Sep COPY2SCRATCH

16th Oct - 5th Nov DIRECT

Direct access site


GOING DEEPER - CASE OF SWITCH STACKING

Test files are local to both UC and IU sites.

Lower band is IU.

Only part of the machines are affected. (the best ones)

We check CPU eff. VS. Load Network in/out Memory Swap


GOING DEEPER - CASE OF SWITCH STACKING

CASE OF SWITCH STACKING


Machines can do much better as seen in copy2scratch mode.

Drained node as bad as busy one.

Manual checks show connections to servers much bellow 1Gbps.

Stack performance depend on:• its configuration

(software) • what is connected

whereOptimal switch stacking not exactly trivial.I suspect a lot of sites have the same issues. NET2 and BNL show very similar pattern.Will be investigated till the bottom.

Finally Two big issues discovered. Just that was worth the effort Bunch of smaller problems with queues, misconfigurations

found and solved

FUTURE Fixing remaining issues Investigate Virtual Queues Per site web interface Automatic procedure to follow performance Automatic mailing Investigating non-US sites



WORKFLOW ISSUES

For most users this is the workflow: Skimming/slimming data

usually prun and no complex code often filter_and_merge.py

Merging data only part of people do it unclear how to do it on the grid moving small files around very

inefficient

Getting data locally DaTRI requests to USA processed slowly Most people dq2-get

Storing it locally Not much space in tier-3’s Accessing data from localgroupdisk

Analyzing data Mostly local queues Rarely proof People willing to wait for few hours

and manually merge results

SLIM SKIM SERVICE

IdeaEstablish service to which users submit parameters of their skim&slim job, uses opportunistically CPU’s and FAX as data source and provides optimized dataset.

Practically WebUI to submit request and follow job progress Oracle DB for a backend Currently UC3 will be used for processing. Output data will be dq2-put into MWT2


Work started.Performance and turn around time are what will make or brake this service.

APPENDIX

The Fix timer_command.py is part of the pilot3 code. Used very often in all of the

transforms. Serves to start any command as a subprocess and kills it if not finished

before a given timeout. Not exactly trivial. For some commands was waiting 60 seconds even when command

finished. Was also trying to close all the possible file descriptors before executing

child process. That could take from 0.5s – few tens of seconds depending on site’s settings. Fixed in the last pilot version.

Total effect estimate: Quarter of computing time is spent on analysis jobs Average analysis job is less than 30 min. Fix speeds up job in average 3 minutes - 10% Applied to 40 Tier2’s the fix equivalent of adding one full tier2 of capacity


BACK

performance and analysis workflow issues

Documents

pilot13112012ilija vukotic

analysis jobover

existing systemanalysis

high cpu

atlas softwareatlas

different files

switch stacking test

cpu time wall timekeep