a year of htcondor at the ral tier-1 ian collier, andrew lahiff stfc rutherford appleton laboratory...

A Year of HTCondor at the RAL Tier-1

Ian Collier, Andrew LahiffSTFC Rutherford Appleton Laboratory

HEPiX Spring 2014 Workshop

Outline

• Overview of HTCondor at RAL• Computing elements• Multi-core jobs• Monitoring

2

Introduction

• RAL is a Tier-1 for all 4 LHC experiments– In terms of Tier-1 computing requirements, RAL provides

• 2% ALICE• 13% ATLAS• 8% CMS• 32% LHCb

– Also support ~12 non-LHC experiments, including non-HEP• Computing resources

– 784 worker nodes, over 14K cores– Generally have 40-60K jobs submitted per day

• Torque / Maui had been used for many years– Many issues– Severity & number of problems increased as size of farm increased– In 2012 decided it was time to start investigating moving to a new

batch system3

Choosing a new batch system

• Considered, tested & eventually rejected the following– LSF, Univa Grid Engine*

• Requirement: avoid commercial products unless absolutely necessary– Open source Grid Engines

• Competing products, not sure which has the next long term future• Communities appear less active than HTCondor & SLURM• Existing Tier-1s running Grid Engine using the commercial version

– Torque 4 / Maui• Maui problematic• Torque 4 seems less scalable than alternatives (but better than Torque 2)

– SLURM• Carried out extensive testing & comparison with HTCondor• Found that for our use case

– Very fragile, easy to break– Unable to get reliably working above 6000 running jobs

4* Only tested open source Grid Engine, not Univa Grid Engine

Choosing a new batch system

• HTCondor chosen as replacement for Torque/Maui– Has the features we require– Seems very stable– Easily able to run 16,000 simultaneous jobs

• Didn’t do any tuning – it “just worked”• Have since tested > 30,000 running jobs

– Is more customizable than all other batch systems

5

Migration to HTCondor

• Strategy– Start with a small test pool– Gain experience & slowly move resources from Torque / Maui

• Migration2012 Aug Started evaluating alternatives to Torque / Maui

(LSF, Grid Engine, Torque 4, HTCondor, SLURM)

2013 Jun Began testing HTCondor with ATLAS & CMS~1000 cores from old WNs beyond MoU commitments

2013 Aug Choice of HTCondor approved by management2013 Sep HTCondor declared production service

Moved 50% of pledged CPU resources to HTCondor2013 Nov Migrated remaining resources to HTCondor

6

Experience so far

• Experience– Very stable operation

• Generally just ignore the batch system & everything works fine• Staff don’t need to spend all their time fire-fighting problems

– No changes needed as the HTCondor pool increased in size from ~1000 to ~14000 cores

– Job start rate much higher than Torque / Maui even when throttled• Farm utilization much better

– Very good support

7

Problems

• A few issues found, but fixed quickly by developers– Found job submission hung when one of a HA pair of central

managers was down• Fixed & released in 8.0.2

– Found problem affecting HTCondor-G job submission to ARC with HTCondor as LRMS

• Fixed & released in 8.0.5– Experienced jobs dying 2 hours after network break between CEs and

WNs• Fixed & released in 8.1.4

8

• All job submission to RAL is via the Grid– No local users

• Currently have 5 CEs– 2 CREAM CEs– 3 ARC CEs

• CREAM doesn’t currently support HTCondor– We developed the missing functionality ourselves– Will feed this back so that it can be included in an official release

• ARC better– But didn’t originally handle partitionable slots, passing CPU/memory

requirements to HTCondor, …– We wrote lots of patches, all included in the recent 4.1.0 release

• Will make it easier for more European sites to move to HTCondor

Computing elements

9

• ARC CE experience– Have run almost 9 million jobs so far across our 3 ARC CEs– Generally ignore them and they “just work”– VOs

• ATLAS & CMS fine from the beginning• LHCb added ability to submit to ARC CEs to DIRAC

– Seem to be ready to move entirely to ARC• ALICE not yet able to submit to ARC

– They have said they will work on this• Non-LHC VOs

– Some use DIRAC, which now can submit to ARC– Others use EMI WMS, which can submit to ARC

• CREAM CE status– Plan to phase-out CREAM CEs this year

Computing elements

10

HTCondor & ARC in the UK

• Since the RAL Tier-1 migrated, other sites in the UK have started moving to HTCondor and/or ARC– RAL T2 HTCondor + ARC (in production)

– Bristol HTCondor + ARC (in production)

– Oxford HTCondor + ARC (small pool in production, migration in progress)

– Durham SLURM + ARC– Glasgow Testing HTCondor + ARC– Liverpool Testing HTCondor

• 7 more sites considering moving to HTCondor or SLURM• Configuration management: community effort

– The Tier-2s using HTCondor and ARC have been sharing Puppet modules

11

Multi-core jobs

• Current situation– ATLAS have been running multi-core jobs at RAL since November– CMS started submitting multi-core jobs in early May– Interest so far only for multi-core jobs, not whole-node jobs

• Only 8-core jobs• Our aims

– Fully dynamic• No manual partitioning of resources

– Number of running multi-core jobs determined by fairshares

12

Getting multi-core jobs to work

• Job submission– Haven’t setup dedicated multi-core queues– VO has to request how many cores they want in their JDL, e.g.

(count=8)• Worker nodes configured to use partitionable slots

– Resources of each WN (CPU, memory, …) divided up as necessary amongst jobs

• Setup multi-core groups & associated fairshares– HTCondor configured to assign multi-core jobs to the appropriate

groups• Adjusted the order in which the negotiator considers groups

– Consider multi-core groups before single core groups• 8 free cores are “expensive” to obtain, so try not to lose them to single core

jobs too quickly

13

Getting multi-core jobs to work

• If lots of single-core jobs are idle & running, how does a multi-core job start?– By default it probably won’t

• condor_defrag daemon– Finds WNs to drain, triggers draining & cancels draining as required– Configuration changes from default:

• Drain 8-cores only, not whole WNs• Pick WNs to drain based on how many cores they have that can be freed

up– E.g. getting 8 free CPUs by draining a full 32-core WN is generally faster than

draining a full 8-core WN– Demand for multi-core jobs not known by condor_defrag

• Setup simple cron to adjust number of concurrent draining WNs based on demand

– If many idle multi-core jobs but few running, drain aggressively– Otherwise very little draining

14

Results

• Effect of changing the way WNs to drain are selected

– No change in the number of concurrent draining machines– Rate in increase in number of running multi-core jobs much higher

15

Running multi-core jobs

• Recent ATLAS activity

Results

16

Running & idle multi-core jobs

Gaps in submission by ATLAS resultsin loss of multi-core slots.

Significantly reduced CPU wastagedue to the cron• Aggressive draining: 3% waste• Less-aggressive draining: < 1% waste

Number of WNs running multi-core jobs & draining WNs

Worker node health check

• Startd cron– Checks for problems on worker nodes

• Disk full or read-only• CVMFS• Swap• …

– Prevents jobs from starting in the event of problems• If problem with ATLAS CVMFS, then only prevents ATLAS jobs from

starting– Information about problems made available in machine ClassAds

• Can easily identify WNs with problems, e.g.

# condor_status –constraint 'NODE_STATUS =!= "All_OK”’ -autoformat Machine NODE_STATUSlcg0980.gridpp.rl.ac.uk Problem: CVMFS for alice.cern.chlcg0981.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.ch Problem: CVMFS for lhcb.cern.chlcg1069.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1070.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1197.gridpp.rl.ac.uk Problem: CVMFS for cms.cern.chlcg1675.gridpp.rl.ac.uk Problem: Swap in use, less than 25% free

17

Worker node health check

• Also can put this data into ganglia– RAL tests new CVMFS releases

• Therefore it’s important for us to detect increases in CVMFS problems– Generally have only small numbers of WNs with issues:

– Example: a user’s “problematic” jobs affected CVMFS on many WNs:

18

Jobs monitoring

• CASTOR team at RAL have been testing Elasticsearch– Why not try using it with HTCondor?

• Elasticsearch ELK stack– Logstash: parses log files– Elasticsearch: search & analyze data in real-time– Kibana: data visualization

• Hardware setup– Test cluster of 13 servers (old diskservers & worker nodes)

• But 3 servers could handle 16 GB of CASTOR logs per day• Adding HTCondor

– Wrote config file for Logstash to enable history files to be parsed– Add Logstash to machines running schedds

19

HTCondor history files Logstash

Elasticsearch Kibana

Jobs monitoring

• Can see full job ClassAds

20

Jobs monitoring

21

• Custom plots– E.g. completed jobs by schedd

• Custom dashboards

Jobs monitoring

22

Jobs monitoring

23

• Benefits– Easy to setup

• Took less than a day to setup the initial cluster– Seems to be able to handle the load from HTCondor

• For us (so far): < 1 GB, < 100K documents per day– Arbitrary queries

• Seem faster than using native HTCondor commands (condor_history)– Horizontal construction

• Need more capacity? Just add more nodes

Summary

• Due to scalability problems with Torque/Maui, migrated to HTCondor last year

• We are happy with the choice we made based on our requirements– Confident that the functionality & scalability of HTCondor will meet our

needs for the foreseeable future• Multi-core jobs working well

– Looking forward to ATLAS and CMS running multi-core jobs at the same time

24

Future plans

• HTCondor– Phase in cgroups onto WNs

• Integration with private cloud– When production cloud is ready, want to be able to expand the batch

system into the cloud– Using condor_rooster for provisioning resources

• HEPiX Fall 2013: http://indico.cern.ch/event/214784/session/9/contribution/205

• Monitoring– Move Elasticsearch into production– Try sending all HTCondor & ARC CE log files to Elasticsearch

• E.g. could easily find information about a particular job from any log file

25

http://indico.cern.ch/event/214784/session/9/contribution/205




Future plans

• Ceph– Have setup a 1.8 PB test Ceph storage system– Accessible from some WNs using Ceph FS– Setting up an ARC CE with shared filesystem (Ceph)

• ATLAS testing with arcControlTower– Pulls jobs from PanDA, pushes jobs to ARC CEs– Unlike the normal pilot concept, jobs can have more precise resource

requirements specified• Input files pre-staged & cached by ARC on Ceph

26

Thank you!

27

a year of htcondor at the ral tier-1 ian collier, andrew lahiff stfc rutherford appleton laboratory...

Documents