facility status and resource requirements

Facility Status and Resource Requirements

Michael Ernst, BNLUS ATLAS Facilities Workshop at OSG All-Hands Meeting

Harvard Medical School

March 7, 2011

OSG AH March 7, 2011Facilities 2

Outline· Status of computing in ATLAS· Overview and developments since last Facility Meeting· Tier 1/2/3 facilities overview· Job completion metrics for production & analysis· Challenges for higher luminosity running· Resource Projections· Summary


U.S. ATLASPhysics Support & Computing

2.1, 2.9 Management (Wenaus/Willocq)2.2 Software (Luehring) 2.2.1 Coordination (Luehring) 2.2.2 Core Services (Calafiura) 2.2.3 Data Management (Malon) 2.2.4 Distributed Software (Wenaus) 2.2.5 Application Software (Neubauer) 2.2.6 Infrastructure Support (Undrus) 2.2.7 Analysis support (retired; redundant) 2.2.8 Multicore Processing (Calafiura)2.3 Facilities and Distributed Computing (Ernst) 2.3.1 Tier 1 Facilities (Ernst) 2.3.2 Tier 2 Facilities (Gardner) 2.3.3 Wide Area Network (McKee) 2.3.4 Grid Tools and Services (Gardner) 2.3.5 Grid Production (De) 2.3.6 Facility Integration (Gardner) 2.3.7 Tier 3 Coordination (Yoshida/Benjamin)2.4 Analysis Support (Cochran/Yoshida) 2.4.1 Physics/Performance Forums (Black) 2.4.2 Analysis Tools (Cranmer) 2.4.3 Analysis Support Centers (Ma) 2.4.4 Documentation (Luehring)


ATLAS Computing Status· A ‘tremendous’ first year of datataking: in machine & detector

performance, data volumes, processing loads, analysis activity, and physics output

Computing at levels far beyond STEP09, which was considered the nominal required performance

All Tier 1s delivered, and Tier 2s were prominent and crucial· Computing delivered well as an enabler for physics analysis

Processing completed and validated on schedule for conferences Low latency from data taking to physics output (e.g. ICHEP)

· U.S. contributed reliably and made innovations toward an improved CM Tier 1, Tier2s most successful in ATLAS (i.e. analysis performance) PanDA distributed production/analysis performed and scaled well Provided critical new tools for coping with data volumes


Accomplishments –Facilities and Distributed Computing

· Most successful Tier-1 in ATLAS Availability, data and CPU delivered, production performance,

analysis performance When ATLAS needs it done now, they send it to the US Tier-1

· U.S. Tier-2s also the best in the ATLAS Tier-2 complex All U.S. Tier 2s and the Tier 1 in the top 20 (of ~75) sites that do

75% of ATLAS analysis work· U.S.-led dedicated ATLAS–wide Tier-3 support in 2010

Tier-3s as a distinct but integral component of ATLAS computing focused on end user analysis

In the U.S. tightly integrated with Facility (Integration) Program Doug is ATLAS Tier-3 Technical Coordinator

· Thanks to OSG, a crucial part of the success We are closely involved in planning OSG’s next 5 years

Only possible due to unique collaborative spirit and effort from many people in the Facilities with distributed leadership, coordinated under the Facilities Integration Program


Tier 3s· DOE and NSF funded institutions received their ARRA funds last Fall and the

majority is operational or close · Tier 3 coordinators Doug and Rik recently contacted all 44 US institutions for a

Tier 3 status update and to offer help 26 functional sites 7 are in the process of setting up the site 1 just received hardware, 2 more waiting on it, 1 planning, 7 no news

· Documentation, tools and procedures developed by Tier 3 team https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasTier3g Closely coupled to the Analysis Support program

· Asked Alden to help with Tier 3 priorities Currently working on supporting PanDA for analysis at Tier 3 sites

· NSF approval of DYNES networking project beneficial for Tier 3s; 13 US ATLAS participants

· U.S. is a leader in the integrated ATLAS Tier 3 effort Technical leadership from U.S. in key Tier 3-directed efforts

CVMFS as ‘Global’ Filesystem for Software distribution and (Conditions) Data Access Federated storage across sites based on xrootd for transparent distributed data access

Tomorrow is fully devoted to Tier-3 planning and operations

https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasTier3g

Production Job Completion

Single-attempt success ratetypically 96%

US


Analysis vs. Production(Johannes @ ADC Retreat)


Distributed Analysis

· Several open Data Management issues associated w/ Analysis


Output MergingWei: USERDISK is used for analysis job outputs, including data, ntuple and .log.tgz.

The last two are typically small. The .log.tgz normally range from 10KB to 1MB. I guess as users refine their analysis programs by repeatedly running them, they generate large number of temporary outputs they will never be looking at.

At SLAC, we have 4.64 million files in our LFC catalog, of which 3.00 million belong to USERDISK (and 670TB vs 88TB)

At the ADC Retreat the point was raised

This is only one example of numerous Data Management Issues Organization and Management of User/Group/LocalGroupdisk


Optimization


Tier-1 CPU Usage in 2010

Required: 226 kHS06, Pledged: 250 kHS06


Tier-2 CPU Usage in 2010

Required: 278 kHS06, Pledged: 281 kHS06


LHC – Preliminary Luminosity Projections

(?)!

“re

aso

na

ble

”“u

ltimat

e”

S. Myers

2011 Run challenging as thereare uncertainties that will havesignificant impact on ResourceRequirements- integrated luminosity (could vary by 3x)- trigger rate (200-> 400 Hz)

There will be a Run in 2012- ATLAS Management is in the process of changing the Computing Model from massive Data Pre-placement to Dynamic Caching which was developed in the US (more in Torre’s talk)

We need to be flexible andnimble in view of limitedresources by e.g. exploring mechanisms to temporarilyincrease them


Challenges for Higher Luminosity Running

· LHC/ATLAS in 2011: 200 days running, 30%-40% duty cycle, 300-400Hz trigger rate, higher energy (pileup)

· Confident that Facilities in the U.S. and Workflow Management system will cope with increased load

· Principal challenge is fitting the computing into available resources· In view of pledges in place for 2011 and budget cuts significant changes

needed From full ESDs on disk to rolling buffer (10% of ESDs available on

disk at Tier 1s). Rely on AOD and filtered ESDs Expand use of caching – extend PanDA’s dynamic caching to

managing Tier 2 storage as well as Tier 1 Remove hierarchy from computing model

Break Cloud boundaries and maximize Tier-2 resource usage (CPU and …) Tier 2s as a storage resource

· Based on 2010 experience we have good reasons to believe that this provides a promising path forward w/o compromising physics


New Data Replication Policy(J. Shank’s presentation to CB on Mar 4)


Computing Capacity Requirements

· Revised ATLAS computing requirements taking account of planning revisions (ESD reductions etc) first presented to the collaboration in late February

Based on 400Hz trigger rate, 200 running days, 30%-40% LHC efficiency

2.1B events, 2.7PB raw, 8.8PB total derived– 1 RAW copy on disk, limited derived copies– Rely on dynamic usage-driven replication to caches at Tier-1s

and Tier-2s to meet global needs– Data placement practice in 2010 would yield 27PB total volume

MC is in addition: 2010 experience at the Tier-1: Real is 1:1 – 2:1, predicted for the Tier-2s in 2011 Real/MC 1.4

· Lots of changes expected Success is heavily relying on reliability and performance of sites

contributing resources to the worldwide ATLAS Computing Facility

Were largely insulated in the US due to ~complete data inventory


Tier-1 CPU


Tier-1 Disk


Tier-2 CPU

Huge increase in User Activities vs. Simulation2010: 1.2 : 1 2012: 4.4 : 1

All Tier-2 sites must be fully prepared to run analysis at>75% of their total capacityThis has a major impact on theperformance the storage system and the Network (Note Simulation stays ~constant)


Tier-2 Disk


US Capacity Projection (1/4)

CPU Capacities in the U.S. (low guidance, T2 $3M flat-flat, dedicated resources for U.S., UIUC not included)

0

50

100

150

200

250

kHS0

6

CPU T1 (pledge) 19 23 41 52 72 81 81 81 81

CPU T1 (inst) 19 23 41 72 86 96 96 96 96

CPU T2 (pledge) 17.2 22.1 55.0 65.0 94.0 105.0 105.0 134.0 150.0

CPU T2 (inst) 17.2 22.1 105.0 105.0 131.0 175.0 180.0 201.0 218.0

2008 2009 2010 2011 2012 2013 2014 2015 2016



Disk Capacities in the U.S. (low guidance,T2 $3M flat-flat, dedicated resources for U.S., UIUC not included)

0.0

5.0

10.0

15.0

20.0

25.0

PB

Disk T1 (pledge) 1.5 1.8 5.1 5.6 6.4 7.2 7.2 7.2 7.2

Disk T1 (inst) 1.5 1.8 6.1 8.1 8.1 8.6 8.6 8.6 8.6

Disk T2 (pledge) 1.1 1.6 5.5 8.4 12.3 13.4 13.4 17.3 21.2

Disk T2 (inst) 1.1 1.6 7.4 11.7 13.2 15.6 18.0 21.9 23.0

2008 2009 2010 2011 2012 2013 2014 2015 2016



CPU Capacities in the U.S. (nominal guidance, T2 $3.3M/yr flat-flat, dedicated resources for U.S., UIUC included)

0

50

100

150

200

250

300

kHS0

6

CPU T1 (pledge) 19 23 41 52 72 81 81 81 81

CPU T1 (inst) 19 23 41 72 86 96 96 96 96

CPU T2 (pledge) 17.2 22.1 55.0 65.0 94.0 105.0 105.0 134.0 150.0

CPU T2 (inst) 17.2 22.1 105.0 137.0 137.0 168.0 224.0 271.0 271.0

2008 2009 2010 2011 2012 2013 2014 2015 2016



Disk Capacities in the U.S. (nominal guidance, T2 $3.3M/yr flat-flat, dedicated resources for U.S., UIUC included)

0.0

5.0

10.0

15.0

20.0

25.0

30.0

PB

Disk T1 (pledge) 1.5 1.8 5.1 5.6 6.4 7.2 7.2 7.2 7.2

Disk T1 (inst) 1.5 1.8 6.1 8.1 8.1 8.6 8.6 8.6 8.6

Disk T2 (pledge) 1.1 1.6 5.5 8.4 12.3 13.4 13.4 17.3 21.2

Disk T2 (inst) 1.1 1.6 7.4 11.3 16.2 18.6 20.5 24.2 25.4

2008 2009 2010 2011 2012 2013 2014 2015 2016


Summary of Resource Estimates


Capacity Requirements – Budget Implications

· These capacity levels represent a substantial reduction from previously estimated and budgeted U.S. capacity

ATLAS was able to be more aggressive than anticipated, and PanDA-based dynamic caching has proven very effective

· Savings are limited to the Tier 1 in 2011 and 2012; Tier 2s see a substantial ramp in disk space in 2012 to accommodate analysis

· Detailed planning and budgets remain to be worked out, but in broad outline, with U.S. Tier 1 capacity reduced to just the new ATLAS requirements, computing funding needs roughly match the target based on low-guidance funding level

We have extrapolated the new ATLAS estimates to 2015 and find this holds true in the out years

But it is not without costs…


Budget Requirements – Program Implications

· Tier 2s under the low scenario… Where do they stand under the different guidance levels?

Costs including ancillary equipment needs, replacement At what guidance level are we able to add IL/NCSA?

· Why is achieving the nominal (at least) scenario important? At the Tier-1 restoration of some US-dedicated capacity Risk reduction and more realistic planning/budgeting for

mandatory replacements and upgrades Exciting opportunity to introduce a major NSF

Supercomputing Center as a third MWT2 site to the Tier 2 program, affiliated with a strong ATLAS group at Illinois

Would significantly strengthen U.S. ATLAS Facilities expertise base, and strengthen couplings to OSG, XD (next generation Teragrid)


Activities essential to scaling & sustaining ATLAS computing

(From Torre’s list)

· Computing R&D activities and (collaborative, eg. ATLAS/wLCG, OSG) plans:

Virtualization and cloud computing incl. CVMFS (active) Multi/many-core computing (active, supplemental DOE support) Campus grids and inter-campus bridging (active)

Bring to the campus what has worked so well over the wide area in OSG ‘Intelligent’ cache-based distributed storage (active)

Efficient use of disk through greater reliance on network, federated xrootd Hierarchical storage incorporating SSD (active) Highly scalable ‘noSQL’ databases (active)

Tools from the ‘cloud giants’: Cassandra, HBASE/Hadoop, SimpleDB… ‘Flatter’ point-to-point networking model (active)

Validation, diagnostics, monitoring of (especially) T2-T2 networking GPU computing (active in ATLAS, not in US) Managing complexity in distributed computing (active)

Monitoring, diagnostics, error management, automation


Cloud-enabled ElasticityHome resource expands elastically· Cloud providers “join” home/dedicated resources

Ability to add resources on demand· Virtual machines deployed on demand

By that establish proven environment apps can run in Dynamically create a PanDA Production site

Use cloud resources as long as needed, turn off when done· Scalable infrastructure

Addresses primarily Computational side Simulated Event Production is an ideal candidate

– Computational intensive w/ small IO requirements Current cloud service offerings are lacking adequate

technical capabilities and attractive pricing for data intensive processing tasks

May require “Hybrids”, a combination out of Grids and Clouds for the next few years


Proposed Cloud Roadmap

J. Hover


From Local to Global Data Access

Using existing solutions to make data globally accessible – a pragmatic approach• Use Xrootd (SLAC, CERN) to build a federated storage system

based on autonomous storage systems at sites• Supports file copies & direct/sparse

access across all sites and across

heterogeneous storage systems• Work in collaboration w/ US CMS

and OSG

C. Waldman


… and improve the Performance by adding a Caching Layer

Reading a 1 GB ATLAS AOD file over HTTP

* Though IO is asynchronous the performance is limited by the wide area network latency


Network Evolution – “Flattening” today’s Hierarchical Model

Hierarchical (Monarch) Model (restricted) Ultimate Pull Model based on unrestricted access across sites

Compromise: Proposed Implementation basedon regional exchange points (LHCONE)

Regione.g. US


Summary· Computing was a big success as enabler for physics, on its own metrics but

also on the ultimate metric of timely physics output· The Facilities, the Tier-1 and the Tier-2’ centers, have performed well in initial LHC

data taking and analysis Production and Analysis Operations Coordination provides seamless

integration with ATLAS world-wide computing operations We have a very effective Integration Program is in place to ensure readiness in

view of the steep ramp-up of analysis operations with real data Excellent contribution of Tier-2 Sites to high volume production (event

simulation, reprocessing) and analysis· The U.S. ATLAS Computing Facilities need sufficient funding to be on track to

meet the ATLAS performance and capacity requirements Tier-2 funding uncertainties beyond 2011

Proposal for Cooperative Agreement w/ NSF for 2012 – 2016 submitted in Dec 2010 Tier-1 equipment target reduced to minimum

Extended equipment lifetime increases risk· 2011 will also mark the rise of the Tier 3s in the U.S.· U.S. ATLAS is actively pursuing continuation of OSG

Overall, the Facilities in the U.S. have performed very well during the 2010 run, and I have no doubts that this will hold for 2011 and beyond !

facility status and resource requirements

Documents

atlas tier

us tier

close tier

facilities ernst

facilities gardner

accomplishments facilities

dedicated atlaswide

atlas analysis worku