facility status and resource requirements
DESCRIPTION
Facility Status and Resource Requirements. Michael Ernst, BNL US ATLAS Facilities Workshop at OSG All-Hands Meeting Harvard Medical School March 7, 2011. Outline. Status of computing in ATLAS Overview and developments since last Facility Meeting Tier 1/2/3 facilities overview - PowerPoint PPT PresentationTRANSCRIPT
Facility Status and Resource Requirements
Michael Ernst, BNLUS ATLAS Facilities Workshop at OSG All-Hands Meeting
Harvard Medical School
March 7, 2011
OSG AH March 7, 2011Facilities 2
Outline· Status of computing in ATLAS· Overview and developments since last Facility Meeting· Tier 1/2/3 facilities overview· Job completion metrics for production & analysis· Challenges for higher luminosity running· Resource Projections· Summary
OSG AH March 7, 2011Facilities 3
U.S. ATLASPhysics Support & Computing
2.1, 2.9 Management (Wenaus/Willocq)2.2 Software (Luehring) 2.2.1 Coordination (Luehring) 2.2.2 Core Services (Calafiura) 2.2.3 Data Management (Malon) 2.2.4 Distributed Software (Wenaus) 2.2.5 Application Software (Neubauer) 2.2.6 Infrastructure Support (Undrus) 2.2.7 Analysis support (retired; redundant) 2.2.8 Multicore Processing (Calafiura)2.3 Facilities and Distributed Computing (Ernst) 2.3.1 Tier 1 Facilities (Ernst) 2.3.2 Tier 2 Facilities (Gardner) 2.3.3 Wide Area Network (McKee) 2.3.4 Grid Tools and Services (Gardner) 2.3.5 Grid Production (De) 2.3.6 Facility Integration (Gardner) 2.3.7 Tier 3 Coordination (Yoshida/Benjamin)2.4 Analysis Support (Cochran/Yoshida) 2.4.1 Physics/Performance Forums (Black) 2.4.2 Analysis Tools (Cranmer) 2.4.3 Analysis Support Centers (Ma) 2.4.4 Documentation (Luehring)
OSG AH March 7, 2011Facilities 4
ATLAS Computing Status· A ‘tremendous’ first year of datataking: in machine & detector
performance, data volumes, processing loads, analysis activity, and physics output
Computing at levels far beyond STEP09, which was considered the nominal required performance
All Tier 1s delivered, and Tier 2s were prominent and crucial· Computing delivered well as an enabler for physics analysis
Processing completed and validated on schedule for conferences Low latency from data taking to physics output (e.g. ICHEP)
· U.S. contributed reliably and made innovations toward an improved CM Tier 1, Tier2s most successful in ATLAS (i.e. analysis performance) PanDA distributed production/analysis performed and scaled well Provided critical new tools for coping with data volumes
OSG AH March 7, 2011Facilities 5
Accomplishments –Facilities and Distributed Computing
· Most successful Tier-1 in ATLAS Availability, data and CPU delivered, production performance,
analysis performance When ATLAS needs it done now, they send it to the US Tier-1
· U.S. Tier-2s also the best in the ATLAS Tier-2 complex All U.S. Tier 2s and the Tier 1 in the top 20 (of ~75) sites that do
75% of ATLAS analysis work· U.S.-led dedicated ATLAS–wide Tier-3 support in 2010
Tier-3s as a distinct but integral component of ATLAS computing focused on end user analysis
In the U.S. tightly integrated with Facility (Integration) Program Doug is ATLAS Tier-3 Technical Coordinator
· Thanks to OSG, a crucial part of the success We are closely involved in planning OSG’s next 5 years
Only possible due to unique collaborative spirit and effort from many people in the Facilities with distributed leadership, coordinated under the Facilities Integration Program
OSG AH March 7, 2011Facilities 6
Tier 3s· DOE and NSF funded institutions received their ARRA funds last Fall and the
majority is operational or close · Tier 3 coordinators Doug and Rik recently contacted all 44 US institutions for a
Tier 3 status update and to offer help 26 functional sites 7 are in the process of setting up the site 1 just received hardware, 2 more waiting on it, 1 planning, 7 no news
· Documentation, tools and procedures developed by Tier 3 team https://twiki.cern.ch/twiki/bin/view/Atlas/AtlasTier3g Closely coupled to the Analysis Support program
· Asked Alden to help with Tier 3 priorities Currently working on supporting PanDA for analysis at Tier 3 sites
· NSF approval of DYNES networking project beneficial for Tier 3s; 13 US ATLAS participants
· U.S. is a leader in the integrated ATLAS Tier 3 effort Technical leadership from U.S. in key Tier 3-directed efforts
CVMFS as ‘Global’ Filesystem for Software distribution and (Conditions) Data Access Federated storage across sites based on xrootd for transparent distributed data access
Tomorrow is fully devoted to Tier-3 planning and operations
Production Job Completion
Single-attempt success ratetypically 96%
US
OSG AH March 7, 2011Facilities 8
Analysis vs. Production(Johannes @ ADC Retreat)
OSG AH March 7, 2011Facilities 9
Distributed Analysis
· Several open Data Management issues associated w/ Analysis
OSG AH March 7, 2011Facilities 10
Output MergingWei: USERDISK is used for analysis job outputs, including data, ntuple and .log.tgz.
The last two are typically small. The .log.tgz normally range from 10KB to 1MB. I guess as users refine their analysis programs by repeatedly running them, they generate large number of temporary outputs they will never be looking at.
At SLAC, we have 4.64 million files in our LFC catalog, of which 3.00 million belong to USERDISK (and 670TB vs 88TB)
At the ADC Retreat the point was raised
This is only one example of numerous Data Management Issues Organization and Management of User/Group/LocalGroupdisk
OSG AH March 7, 2011Facilities 11
Optimization
OSG AH March 7, 2011Facilities 12
Tier-1 CPU Usage in 2010
Required: 226 kHS06, Pledged: 250 kHS06
OSG AH March 7, 2011Facilities 13
Tier-2 CPU Usage in 2010
Required: 278 kHS06, Pledged: 281 kHS06
OSG AH March 7, 2011Facilities 14
LHC – Preliminary Luminosity Projections
(?)!
“re
aso
na
ble
”“u
ltimat
e”
S. Myers
2011 Run challenging as thereare uncertainties that will havesignificant impact on ResourceRequirements- integrated luminosity (could vary by 3x)- trigger rate (200-> 400 Hz)
There will be a Run in 2012- ATLAS Management is in the process of changing the Computing Model from massive Data Pre-placement to Dynamic Caching which was developed in the US (more in Torre’s talk)
We need to be flexible andnimble in view of limitedresources by e.g. exploring mechanisms to temporarilyincrease them
OSG AH March 7, 2011Facilities 15
Challenges for Higher Luminosity Running
· LHC/ATLAS in 2011: 200 days running, 30%-40% duty cycle, 300-400Hz trigger rate, higher energy (pileup)
· Confident that Facilities in the U.S. and Workflow Management system will cope with increased load
· Principal challenge is fitting the computing into available resources· In view of pledges in place for 2011 and budget cuts significant changes
needed From full ESDs on disk to rolling buffer (10% of ESDs available on
disk at Tier 1s). Rely on AOD and filtered ESDs Expand use of caching – extend PanDA’s dynamic caching to
managing Tier 2 storage as well as Tier 1 Remove hierarchy from computing model
Break Cloud boundaries and maximize Tier-2 resource usage (CPU and …) Tier 2s as a storage resource
· Based on 2010 experience we have good reasons to believe that this provides a promising path forward w/o compromising physics
OSG AH March 7, 2011Facilities 16
New Data Replication Policy(J. Shank’s presentation to CB on Mar 4)
OSG AH March 7, 2011Facilities 17
Computing Capacity Requirements
· Revised ATLAS computing requirements taking account of planning revisions (ESD reductions etc) first presented to the collaboration in late February
Based on 400Hz trigger rate, 200 running days, 30%-40% LHC efficiency
2.1B events, 2.7PB raw, 8.8PB total derived– 1 RAW copy on disk, limited derived copies– Rely on dynamic usage-driven replication to caches at Tier-1s
and Tier-2s to meet global needs– Data placement practice in 2010 would yield 27PB total volume
MC is in addition: 2010 experience at the Tier-1: Real is 1:1 – 2:1, predicted for the Tier-2s in 2011 Real/MC 1.4
· Lots of changes expected Success is heavily relying on reliability and performance of sites
contributing resources to the worldwide ATLAS Computing Facility
Were largely insulated in the US due to ~complete data inventory
OSG AH March 7, 2011Facilities 18
Tier-1 CPU
OSG AH March 7, 2011Facilities 19
Tier-1 Disk
OSG AH March 7, 2011Facilities 20
Tier-2 CPU
Huge increase in User Activities vs. Simulation2010: 1.2 : 1 2012: 4.4 : 1
All Tier-2 sites must be fully prepared to run analysis at>75% of their total capacityThis has a major impact on theperformance the storage system and the Network (Note Simulation stays ~constant)
OSG AH March 7, 2011Facilities 21
Tier-2 Disk
OSG AH March 7, 2011Facilities 22
US Capacity Projection (1/4)
CPU Capacities in the U.S. (low guidance, T2 $3M flat-flat, dedicated resources for U.S., UIUC not included)
0
50
100
150
200
250
kHS0
6
CPU T1 (pledge) 19 23 41 52 72 81 81 81 81
CPU T1 (inst) 19 23 41 72 86 96 96 96 96
CPU T2 (pledge) 17.2 22.1 55.0 65.0 94.0 105.0 105.0 134.0 150.0
CPU T2 (inst) 17.2 22.1 105.0 105.0 131.0 175.0 180.0 201.0 218.0
2008 2009 2010 2011 2012 2013 2014 2015 2016
OSG AH March 7, 2011Facilities 23
US Capacity Projection (2/4)
Disk Capacities in the U.S. (low guidance,T2 $3M flat-flat, dedicated resources for U.S., UIUC not included)
0.0
5.0
10.0
15.0
20.0
25.0
PB
Disk T1 (pledge) 1.5 1.8 5.1 5.6 6.4 7.2 7.2 7.2 7.2
Disk T1 (inst) 1.5 1.8 6.1 8.1 8.1 8.6 8.6 8.6 8.6
Disk T2 (pledge) 1.1 1.6 5.5 8.4 12.3 13.4 13.4 17.3 21.2
Disk T2 (inst) 1.1 1.6 7.4 11.7 13.2 15.6 18.0 21.9 23.0
2008 2009 2010 2011 2012 2013 2014 2015 2016
OSG AH March 7, 2011Facilities 24
US Capacity Projection (3/4)
CPU Capacities in the U.S. (nominal guidance, T2 $3.3M/yr flat-flat, dedicated resources for U.S., UIUC included)
0
50
100
150
200
250
300
kHS0
6
CPU T1 (pledge) 19 23 41 52 72 81 81 81 81
CPU T1 (inst) 19 23 41 72 86 96 96 96 96
CPU T2 (pledge) 17.2 22.1 55.0 65.0 94.0 105.0 105.0 134.0 150.0
CPU T2 (inst) 17.2 22.1 105.0 137.0 137.0 168.0 224.0 271.0 271.0
2008 2009 2010 2011 2012 2013 2014 2015 2016
OSG AH March 7, 2011Facilities 25
US Capacity Projection (4/4)
Disk Capacities in the U.S. (nominal guidance, T2 $3.3M/yr flat-flat, dedicated resources for U.S., UIUC included)
0.0
5.0
10.0
15.0
20.0
25.0
30.0
PB
Disk T1 (pledge) 1.5 1.8 5.1 5.6 6.4 7.2 7.2 7.2 7.2
Disk T1 (inst) 1.5 1.8 6.1 8.1 8.1 8.6 8.6 8.6 8.6
Disk T2 (pledge) 1.1 1.6 5.5 8.4 12.3 13.4 13.4 17.3 21.2
Disk T2 (inst) 1.1 1.6 7.4 11.3 16.2 18.6 20.5 24.2 25.4
2008 2009 2010 2011 2012 2013 2014 2015 2016
OSG AH March 7, 2011Facilities 26
Summary of Resource Estimates
OSG AH March 7, 2011Facilities 27
Capacity Requirements – Budget Implications
· These capacity levels represent a substantial reduction from previously estimated and budgeted U.S. capacity
ATLAS was able to be more aggressive than anticipated, and PanDA-based dynamic caching has proven very effective
· Savings are limited to the Tier 1 in 2011 and 2012; Tier 2s see a substantial ramp in disk space in 2012 to accommodate analysis
· Detailed planning and budgets remain to be worked out, but in broad outline, with U.S. Tier 1 capacity reduced to just the new ATLAS requirements, computing funding needs roughly match the target based on low-guidance funding level
We have extrapolated the new ATLAS estimates to 2015 and find this holds true in the out years
But it is not without costs…
OSG AH March 7, 2011Facilities 28
Budget Requirements – Program Implications
· Tier 2s under the low scenario… Where do they stand under the different guidance levels?
Costs including ancillary equipment needs, replacement At what guidance level are we able to add IL/NCSA?
· Why is achieving the nominal (at least) scenario important? At the Tier-1 restoration of some US-dedicated capacity Risk reduction and more realistic planning/budgeting for
mandatory replacements and upgrades Exciting opportunity to introduce a major NSF
Supercomputing Center as a third MWT2 site to the Tier 2 program, affiliated with a strong ATLAS group at Illinois
Would significantly strengthen U.S. ATLAS Facilities expertise base, and strengthen couplings to OSG, XD (next generation Teragrid)
OSG AH March 7, 2011Facilities 29
Activities essential to scaling & sustaining ATLAS computing
(From Torre’s list)
· Computing R&D activities and (collaborative, eg. ATLAS/wLCG, OSG) plans:
Virtualization and cloud computing incl. CVMFS (active) Multi/many-core computing (active, supplemental DOE support) Campus grids and inter-campus bridging (active)
Bring to the campus what has worked so well over the wide area in OSG ‘Intelligent’ cache-based distributed storage (active)
Efficient use of disk through greater reliance on network, federated xrootd Hierarchical storage incorporating SSD (active) Highly scalable ‘noSQL’ databases (active)
Tools from the ‘cloud giants’: Cassandra, HBASE/Hadoop, SimpleDB… ‘Flatter’ point-to-point networking model (active)
Validation, diagnostics, monitoring of (especially) T2-T2 networking GPU computing (active in ATLAS, not in US) Managing complexity in distributed computing (active)
Monitoring, diagnostics, error management, automation
OSG AH March 7, 2011Facilities 30
Cloud-enabled ElasticityHome resource expands elastically· Cloud providers “join” home/dedicated resources
Ability to add resources on demand· Virtual machines deployed on demand
By that establish proven environment apps can run in Dynamically create a PanDA Production site
Use cloud resources as long as needed, turn off when done· Scalable infrastructure
Addresses primarily Computational side Simulated Event Production is an ideal candidate
– Computational intensive w/ small IO requirements Current cloud service offerings are lacking adequate
technical capabilities and attractive pricing for data intensive processing tasks
May require “Hybrids”, a combination out of Grids and Clouds for the next few years
OSG AH March 7, 2011Facilities 31
Proposed Cloud Roadmap
J. Hover
OSG AH March 7, 2011Facilities 32
From Local to Global Data Access
Using existing solutions to make data globally accessible – a pragmatic approach• Use Xrootd (SLAC, CERN) to build a federated storage system
based on autonomous storage systems at sites• Supports file copies & direct/sparse
access across all sites and across
heterogeneous storage systems• Work in collaboration w/ US CMS
and OSG
C. Waldman
OSG AH March 7, 2011Facilities 33
… and improve the Performance by adding a Caching Layer
Reading a 1 GB ATLAS AOD file over HTTP
* Though IO is asynchronous the performance is limited by the wide area network latency
OSG AH March 7, 2011Facilities 34
Network Evolution – “Flattening” today’s Hierarchical Model
Hierarchical (Monarch) Model (restricted) Ultimate Pull Model based on unrestricted access across sites
Compromise: Proposed Implementation basedon regional exchange points (LHCONE)
Regione.g. US
OSG AH March 7, 2011Facilities 35
Summary· Computing was a big success as enabler for physics, on its own metrics but
also on the ultimate metric of timely physics output· The Facilities, the Tier-1 and the Tier-2’ centers, have performed well in initial LHC
data taking and analysis Production and Analysis Operations Coordination provides seamless
integration with ATLAS world-wide computing operations We have a very effective Integration Program is in place to ensure readiness in
view of the steep ramp-up of analysis operations with real data Excellent contribution of Tier-2 Sites to high volume production (event
simulation, reprocessing) and analysis· The U.S. ATLAS Computing Facilities need sufficient funding to be on track to
meet the ATLAS performance and capacity requirements Tier-2 funding uncertainties beyond 2011
Proposal for Cooperative Agreement w/ NSF for 2012 – 2016 submitted in Dec 2010 Tier-1 equipment target reduced to minimum
Extended equipment lifetime increases risk· 2011 will also mark the rise of the Tier 3s in the U.S.· U.S. ATLAS is actively pursuing continuation of OSG
Overall, the Facilities in the U.S. have performed very well during the 2010 run, and I have no doubts that this will hold for 2011 and beyond !