project status report ian bird computing resource review board 20 th april, 2010 cern-rrb-2010-033
Post on 17-Jan-2018
214 Views
Preview:
DESCRIPTION
TRANSCRIPT
Project Status Report
Ian Bird
Computing Resource Review Board20th April, 2010
CERN-RRB-2010-033
Ian.Bird@cern.ch 2
Project status report• Overall status – experience with data• Planning and milestones• Status of planning for new Tier 0• Brief summary of EGEE EGI transition• Resource planning for 2010, 2011, 2012
Sergio Bertolucci, CERN 3
... And now at 7 TeV
Sergio Bertolucci, CERN 4
• Running increasingly high workloads:– Jobs in excess of 650k / day;
Anticipate millions / day soon– CPU equiv. ~100k cores
• Workloads are:– Real data processing– Simulations– Analysis – more and more
(new) users
• Data transfers at unprecedented rates next slide
Today WLCG is:
e.g. CMS: no. users doing analysis
Sergio Bertolucci, CERN 5
Data transfersFinal readiness test (STEP’09)
Preparation for LHC startup LHC physics data
Nearly 1 petabyte/week2009: STEP09 + preparation for data
Castor traffic last week:> 4 GB/s input> 13 GB/s served
Real data – from 30/3
Ian.Bird@cern.ch 6
WLCG uses EGEE & OSG
85k CPU-days/day
30k CPU-days/day
30k CPU-days/day
Sergio Bertolucci, CERN 7
• Has meant very rapid data distribution and analysis– Data is processed and available at Tier 2s within
hours!
Readiness of the computing
CMS
ATLAS
LHCb
Ian.Bird@cern.ch 8
More and more users>200 users~500 jobs on average over 3 months
ATLAS: number of distinct users accessing various data typesMany hundreds of users accessed grid data
CMS
Sergio Bertolucci, CERN 9
And physics output ...
Sergio Bertolucci, CERN 10
Fibre cut during STEP’09:Redundancy meant no interruption
Ian.Bird@cern.ch 11
Reliabilities
• This is not the full picture:• Experiment-specific
measures give complementary view
• Need to be used together with some understanding of underlying issues
Ian.Bird@cern.ch 12
• Site readiness as seen by the experiments– LH week before data taking; RH 1st week of data
Site availability seen by experiments
Ian.Bird@cern.ch 13
2010 2011 2012
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Jan FebSU pp running HI SU pp running HI
WLCG timeline 2010-2012
2010 Capacity commissioned
2011 Capacity commissioned
EGEE-III ends EGI & NGIs
EGI
HEP – SSC
EMI
(SA3)
Now full report each month
Glexec + SCAS services available;Deployment discussion / policy ongoing
Not all sites yet publishing; information validation in progress
14Ian.Bird@cern.ch
15Ian.Bird@cern.ch
Ian.Bird@cern.ch 16
Future milestones• Actually very few formal milestones now
– Moved from set up to regular operations• Not all problems solved – and more will certainly arise
– These can be subject to specific milestones• However, in general we must move from tracking milestones to
tracking metrics for– Performance– Reliability– Scalability
• Today we have some – but we need to propose a set of useful metrics that we track – Accounting, reliability/availability, throughputs, are published on-line– Operational metrics reviewed weekly – A lot of information in different places (SLS, dashboards, etc).
STATUS OF PLANS FOR TIER 0
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
18Frédéric Hemmer
Revised Tier 0 strategy
• The power situation has evolved – Aggressive replacement of old equipment– Technology evolution– Refined estimates of needs in next few years– 400 kW additional power made available (2.5 2.9 MW)– But situation for backed-up (Diesel) power is more critical –
close to the limit and lack of redundancy
• Revised strategy– Hosting agreement for 100 kW of backed-up power in Geneva
area– Consolidate existing CC critical power situation– Investigate container solution for incremental capacity addition– Investigate (far) remote hosting possibilities
Ian.Bird@cern.ch
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
19Frédéric Hemmer
Tier-0 Power needs estimates
Ian.Bird@cern.ch
NB: Real limit is closer to 2.7 MW than 2.9 assumed so far
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
20Frédéric Hemmer
March 2010 situation• Additional 400 KW in building 513
– The power capacity has been made available• Critical power consolidation in 513
– Various solutions are being studied• Requiring additional UPS & cooling capacity• Should provide ~600 KW of backed up power; Hopefully as an addition to the 2.9 MW
– Will not be available before mid-2011• External hosting of 100 KW in Geneva
– Hosting company identified & contract being signed• Target implementation: summer 2010
– Will allow for initial experience of remote operations• Containers
– Initial technology assessment done & Market survey launched– Location: Prévessin close to building 931
• Will require civil engineering to host electrical power distribution• Cannot be available before end 2011
• (Far) remote hosting proposals– No concrete financial proposals yet from Norway
• Although technical pre-proposal fairly clear– Likelihood that a similar offer will come from Finland
Ian.Bird@cern.ch
CERN IT Department
CH-1211 Genève 23
Switzerlandwww.cern.ch/
it
21Frédéric Hemmer
Summary
• Current estimates predict that the Computer Centre will now run out of power ~ 2013– Within the current requirements of the experiments– Within the limits of the technology evolution
• IT has started to prepare several stop gap solutions to be able to cope with changing conditions as well as alternative options– But costs are significant
• Decisions for the medium term should be taken in 2010 in light of experience of data taking and once alternative options can be evaluated
Ian.Bird@cern.ch
Ian.Bird@cern.ch 22
EGI: Status of project submissions
There were 3 different (sub-) calls1) EGI itself (project named EGI-Inspire); includes an activity (SA3)
specifically focussed on support for existing large communities• This project was invited to a hearing; likely to receive requested
funding2) Middleware (project named EMI); includes support for all gLite
software required by WLCG (FTS, LFC, dCache, etc., etc.)• This project was invited to a hearing; asked to make a 900k€ cut
3) Virtual Research Communities (ex-SSC); There were several EGEE-derived proposals, including one (ROSCOE) that contained a VRC for HEP
• These will NOT be funded.
• Project funding expected to start only in June (may be back dated to May)
Ian.Bird@cern.ch 23
EGEE-EGI: Risk for WLCG?
• This situation does not represent a major risk for WLCG– EGEE EGI transition is well planned by EGEE, and is well advanced– Countries representing the majority of the resources have NGIs and the
Tier 1s are well placed – Important operational tools (GGUS, monitoring, etc.) are assured even if
project funding does not appear• WLCG operational procedures are well tested and are mostly independent of the
existence of EGEE or EGI • SA3 activity contains Dashboards, Ganga, & specific tasks for each experiment (~2
FTE each); VRC had integration/analysis support– EMI contains essential middleware support and “harmonisation”
gLite/ARC/Unicore (long term development was not included)• No funding for HEP VRC means that work with other application
communities will significantly reduce at CERN• Should now consider strategy for longer term of middleware
Ian.Bird@cern.ch 24
Status of non-European states
• Concern expressed at last RRB over status in EGI of some non-EC states
• The situation has evolved:– EGI.eu: introduced Associate member status – EGI-Inspire project: full partners
RESOURCE PLANNING
Baseline assumptions used by all experiments for requirements analysis
Present understanding of schedule for both 2010 and 2011
26Ian.Bird@cern.ch
• 2010 + 2011– Running from mid-Feb
– end Nov – Pb-Pb in November– In principle stop after 1
fb-1 ; plan to run 2 years• (0.2 in 2010, remainder
in 2011)
• 2012: shutdown of accelerator (but not computing)
Ian.Bird@cern.ch 27
Assumptions and guidance: 2010,11,12
Assumptions:• The agreed RRB year is April - March (i.e. resources for a given year available by April)
– In 2010 exceptionally delayed this until June 1st (based on the schedules understood at that time)
– Of course some Tier 1s have already installed some fraction of their 2010 pledges.• Also agreed that in 2011 revert to the April installation deadline.• 2010 pledges or the installation schedules cannot be changed:
– nominal 2009 resources must satisfy the needs until end of May; – 2010 resources should cover the time from June to March 2011, – and the 2011 resources from April 2011 onwards.
Live time:• 30 days/month = 720 hours • folding in efficiencies 720 x 0.7 x 0.4 = ~200 effective hours/month
1) Availability of machine for physics = 0.7• The rest is technical stop + recovery from
technical stop + dedicated MD2) Efficiency for physics = time with colliding
beams/time that machine is available = 0.4• The rest is turnaround time + faults +
access
Ian.Bird@cern.ch 28
Summary of requirements
Totals 2010 2010pledge
2011 2012
CERN CPU
233.4 233.4 263.3 219.7
CERN disk
14.79 14.8 19.7 22.8
CERN tape
31.7 31.7 48.8 49.7
T1 CPU 394.1 412 543.5 584T1 disk 49.39 44.5 66.3 68.9T1 tape 56.2 51.4 111.07 131.72T2 CPU 562.6 511.1 730.2 787T2 disk 46.62 39.6 75.42 78.42
Old 2010 request + 2010 pledges are as presented at the Autumn 2009 RRB
Ian.Bird@cern.ch 29
• Budget cut in France: -40% – Notified after the last RRB– Proposed impact for 2010 somewhat less with planning and
management– Risk for 2011?
• Concerns over some Tier 1s– Recent experience is good, hope this is sustainable in the long
term• Level of effort available in EMI for middleware support
– Including release process etc.– May be at the limit
• Data access for analysis– Early discussions on how to address this – 2 year timescale
Concerns
Ian.Bird@cern.ch 30
Summary• First experience with data has been positive from the
WLCG point of view– Thanks to the huge efforts invested in recent years in testing – All Tier 0, Tier 1 and Tier 2 staff must take the credit for this
• Resource planning for coming years is a concern• Still to see what effect many more non-expert users will
have• Transition from EGEE to EGI is now
– It is (hopefully!) not a major risk for WLCG• Must start to address long term sustainability of the
system we have
top related