performance and analysis workflow issues
DESCRIPTION
Performance and Analysis Workflow Issues. US ATLAS Distributed Facility Workshop 13-14 November 2012 , Santa Cruz. Importance of Analysis jobs. Number of analysis jobs are increasing - PowerPoint PPT PresentationTRANSCRIPT
PERFORMANCE AND ANALYSIS WORKFLOW ISSUES
US ATLAS Distributed Facility Workshop 13-14 November 2012 , Santa Cruz
IMPORTANCE OF ANALYSIS JOBS
Number of analysis jobs are increasing
Production jobs are mostly CPU limited, well controlled, hopefully optimized and can be monitored through other already existing system
Analysis jobs we know very little about and potentially could: be inefficient, wreck havoc at storage elements, networks. Twice failure rate of production jobs
13/11/2012IL I JA VUKOTIC [email protected] 2
Tier0 Tier1 Tier2 Tier3
0
500000
1000000
1500000
2000000
2500000
54,671439,125
1,218,116
4,002
6,344
97,530
282,410
924
86,765
882,525
589,604
11,444
31,739
74,504
83,089
3,085
managed failedmanaged fin-isheduser faileduser finished
ANALYSIS QUEUES PERFORMANCEIdea Find what is performance of ATLAS analysis jobs on the grid
There is no framework that everybody uses, that could be instrumented Understand numbers: each site has it’s hard limits in terms of storage, cpus, network,
software.
Improve ATLAS software ATLAS files, way we use them Site’s configurations
Requirements: Monitoring framework Tests simple, realistic, accessible, versatile as possible
Running on most of the resources we have Fast turn around Test codes that are “recommended way to do it” Web interface for most important indicators
13/11/2012IL I JA VUKOTIC [email protected] 3
TEST FRAMEWORK
13/11/2012IL I JA VUKOTIC [email protected] 4
HammerCloud
ORACLE DB CERNResults
WEB site
SVN configurati
on, test scripts
Continuous Job performance
Generic ROOT IO scripts
Realistic analysis jobs
Site performance Site optimization
One-off new releases (Athena, ROOT) new features, fixes
All T2D sites (currently 42 sites)
Large number of monitored parameters
Central database Wide range of visualization
tools
TEST FRAMEWORK
13/11/2012IL I JA VUKOTIC [email protected] 5
Pilot numbers obtained
from panda db
5-50 jobs per day per site Each job runs at least 24 tests
5 read modes + 1 full analysis job
Over 4 different files
Takes data on machine status
Cross reference to Panda DB Currently 2 million results in DB
WEB sitehttp://ivukotic.web.cern.ch/ivukotic/HC/index.asp
SUMMARY RESULTS
Setup times
13/11/2012IL I JA VUKOTIC [email protected] 6
AGLT2
BNL
CERN HU
MW
T2
NET2
OU_O
CHEP...
SLAC
SWT2
_CPB
0
20
40
60
80
100
120
140
160
1st week July1st week Nov.
SUMMARY RESULTS
Stage-in
13/11/2012IL I JA VUKOTIC [email protected] 7
6th Jun –
27th Jun27th Jun –
9th July10th July – 16th Aug
16th Aug – 24th Aug
24th Aug – 24 Sep
24Sep-..
GAINfree copy2scratch direct direct with fixcopy2scratc
h directAGLT2 5 257 5 5 52 0BNL 333 339 7 9 109 224CERN 4 294 4 5 67 -1HU 326 337 8 8 84 242MWT2 7 297 6 7 40 0NET2 343 331 5 6 104 239OU_OCHEP_SWT2 261 262 266 57 60 201SLAC 9 305 24 14 56 -5SWT2_CPB 8 374 23 7 278 1
Space for improvement60 s = 41 MB/s
The Fix
SUMMARY RESULTS Execution time
13/11/2012IL I JA VUKOTIC [email protected] 8
AGLT2
BNL
CERN HU
MW
T2
NET2
OU_O
CHEP_...
SLAC
SWT2
_CPB
0
500
1000
1500
2000
2500
3000
14091317
1210
1840
1269
15471692
2810
1577
12721133
1522
0
1892
1521 1473
1839
1532
27th Jun - 9th July COPY-2SCRATCH16th Oct - 5th Nov DIRECT
exe
cu
tio
n t
ime
[s]
GPFS not mounted – can’t run in direct mode
SUMMARY RESULTS
Stage out
13/11/2012IL I JA VUKOTIC [email protected] 9
AGLT
2BN
L
CERN
HU
MW
T2
NET
2
OU_O
CHEP
_SW
T2
SLAC
SWT2
_CPB
0
20
40
60
80
100
120
140
160
180
before 29th Jun - 31st July
SUMMARY RESULTS Total time = setup + stage in + exec + stage out [s] – as measured by pilot
13/11/2012IL I JA VUKOTIC [email protected] 10
0
500
1000
1500
2000
2500
3000COPY2SCRATCH DIRECT
SUMMARY – GOING DEEPER
CPU efficiency Measures only event loop Defined as CPU time / WALL time Keep in mind – very slow machine can have very high CPU eff. All you want to do is make it as high as possible
13/11/2012IL I JA VUKOTIC [email protected] 11
FACTS:1. Unless doing bootstrapping or some
weird calculation, users code is negligible compared to unzipping.
2. ROOT can unzip at 40MB/s
SUMMARY – GOING DEEPER
CPU efficiency
13/11/2012IL I JA VUKOTIC [email protected] 12
AGLT2
BNL
CERN HU
MW
T2
NET2
OU_O
CH...
SLAC
SWT2
...0
10
20
30
40
50
60
70
80
90
100 24th Aug - 24 Sep COPY2SCRATCH
16th Oct - 5th Nov DIRECT
Direct access site
13/11/2012IL I JA VUKOTIC [email protected] 13
GOING DEEPER - CASE OF SWITCH STACKING
Test files are local to both UC and IU sites.
Lower band is IU.
Only part of the machines are affected. (the best ones)
We check CPU eff. VS. Load Network in/out Memory Swap
13/11/2012IL I JA VUKOTIC [email protected] 14
GOING DEEPER - CASE OF SWITCH STACKING
CASE OF SWITCH STACKING
13/11/2012IL I JA VUKOTIC [email protected] 15
Machines can do much better as seen in copy2scratch mode.
Drained node as bad as busy one.
Manual checks show connections to servers much bellow 1Gbps.
Stack performance depend on:• its configuration
(software) • what is connected
whereOptimal switch stacking not exactly trivial.I suspect a lot of sites have the same issues. NET2 and BNL show very similar pattern.Will be investigated till the bottom.
Finally Two big issues discovered. Just that was worth the effort Bunch of smaller problems with queues, misconfigurations
found and solved
FUTURE Fixing remaining issues Investigate Virtual Queues Per site web interface Automatic procedure to follow performance Automatic mailing Investigating non-US sites
13/11/2012IL I JA VUKOTIC [email protected] 16
13/11/2012IL I JA VUKOTIC [email protected] 17
WORKFLOW ISSUES
For most users this is the workflow: Skimming/slimming data
usually prun and no complex code often filter_and_merge.py
Merging data only part of people do it unclear how to do it on the grid moving small files around very
inefficient
Getting data locally DaTRI requests to USA processed slowly Most people dq2-get
Storing it locally Not much space in tier-3’s Accessing data from localgroupdisk
Analyzing data Mostly local queues Rarely proof People willing to wait for few hours
and manually merge results
SLIM SKIM SERVICE
IdeaEstablish service to which users submit parameters of their skim&slim job, uses opportunistically CPU’s and FAX as data source and provides optimized dataset.
Practically WebUI to submit request and follow job progress Oracle DB for a backend Currently UC3 will be used for processing. Output data will be dq2-put into MWT2
13/11/2012IL I JA VUKOTIC [email protected] 18
Work started.Performance and turn around time are what will make or brake this service.
APPENDIX
The Fix timer_command.py is part of the pilot3 code. Used very often in all of the
transforms. Serves to start any command as a subprocess and kills it if not finished
before a given timeout. Not exactly trivial. For some commands was waiting 60 seconds even when command
finished. Was also trying to close all the possible file descriptors before executing
child process. That could take from 0.5s – few tens of seconds depending on site’s settings. Fixed in the last pilot version.
Total effect estimate: Quarter of computing time is spent on analysis jobs Average analysis job is less than 30 min. Fix speeds up job in average 3 minutes - 10% Applied to 40 Tier2’s the fix equivalent of adding one full tier2 of capacity
13/11/2012IL I JA VUKOTIC [email protected] 19
BACK