gridpp deployment status gridpp14 jeremy coles [email protected] 6 th september 2005
TRANSCRIPT
Overview
2 Trends in basic EGEE metrics
3 Utilisation and efficiency
4 Deployment priorities
5 Brief look at service challenges
6 Summary
1 The main changes over the last two months
Old vs new SFT
• http://goc.grid.sinica.edu.tw/gocwiki/Site_Functional_Tests• See Piotr Nyczyk’s mail to LCG-ROLLOUT 21st July
Old vs new SFT
1. Change in critical tests2. Change in impact of test order3. Tests are run more regularly4. THINGS NOW LOOK MUCH MORE
STABLE!
The new SFTs are used to populate regional weekly
views
… and monthly views. The variations need to be
understood (avg. 24hrs)
Sites with large farms upgrading?
Tier-1 scheduler lost
GridPP is still the largest contributor of resources
UK job slots have increased by >20% in last few months
0
500
1000
1500
2000
2500
3000
3500
Date
Pu
bli
shed
jo
b s
lots
UK total job slots
Next to CERN additions this is one of the major recent
increases
0
2000
4000
6000
8000
10000
12000
14000
16000
18000
20000
06/2
4/04
07/0
6/20
04
07/1
8/04
07/3
0/04
08/1
1/20
04
08/2
3/04
09/0
4/20
04
09/1
6/04
09/2
8/04
10/1
0/20
04
10/2
2/04
11/0
3/20
04
11/1
5/04
11/2
7/04
12/0
9/20
04
12/2
1/04
01/0
2/20
05
01/1
4/05
01/2
6/05
02/0
7/20
05
02/1
9/05
03/0
3/20
05
03/1
5/05
03/2
7/05
04/0
8/20
05
04/2
0/05
05/0
2/20
05
05/1
4/05
05/2
6/05
06/0
7/20
05
06/1
9/05
07/0
1/20
05
07/1
3/05
07/2
5/05
08/0
6/20
05
08/1
8/05
EGEE total job slots UK total job slots
Contribution to EGEE CPU resources therefore remains
good at ~20%
0.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
06/2
4/04
07/0
7/20
04
07/2
0/04
08/0
2/20
04
08/1
5/04
08/2
8/04
09/1
0/20
04
09/2
3/04
10/0
6/20
04
10/1
9/04
11/0
1/20
04
11/1
4/04
11/2
7/04
12/1
0/20
04
12/2
3/04
01/0
5/20
05
01/1
8/05
01/3
1/05
02/1
3/05
02/2
6/05
03/1
1/20
05
03/2
4/05
04/0
6/20
05
04/1
9/05
05/0
2/20
05
05/1
5/05
05/2
8/05
06/1
0/20
05
06/2
3/05
07/0
6/20
05
07/1
9/05
08/0
1/20
05
08/1
4/05
08/2
7/05
Date
Per
cen
tag
e co
ntr
ibu
tio
n
UK % total CPU
This has translated into GridPP taking an average of
about 20% of the work recently
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
06/0
2/20
04
06/1
6/04
06/3
0/04
07/1
4/04
07/2
8/04
08/1
1/20
04
08/2
5/04
09/0
8/20
04
09/2
2/04
10/0
6/20
04
10/2
0/04
11/0
3/20
04
11/1
7/04
12/0
1/20
04
12/1
5/04
12/2
9/04
01/1
2/20
05
01/2
6/05
02/0
9/20
05
02/2
3/05
03/0
9/20
05
03/2
3/05
04/0
6/20
05
04/2
0/05
05/0
4/20
05
05/1
8/05
06/0
1/20
05
06/1
5/05
06/2
9/05
07/1
3/05
07/2
7/05
08/1
0/20
05
08/2
4/05
Date
Per
cen
tag
e o
f jo
bs
in U
K
% EGEE jobs running in UK
Which reflects the fact that our sites remain at least as stable as the EGEE average
0
5
10
15
20
25
30
35
40
45
24/0
1/20
05
07/0
2/20
05
21/0
2/20
05
07/0
3/20
05
21/0
3/20
05
04/0
4/20
05
18/0
4/20
05
02/0
5/20
05
16/0
5/20
05
30/0
5/20
05
13/0
6/20
05
27/0
6/20
05
11/0
7/20
05
25/0
7/20
05
08/0
8/20
05
22/0
8/20
05
gst
at m
etri
c
EGEE
UKI
A reminder of the “gstat metric” basis
Status Description Example
0 na or no status available
10 ok or normal status No problems
20 info or useful information Storage over 90% full
30 note or important information GridIce tests are failing
40 warn or subject mail fail soon Blank values or wrong format in configuration
50 error or subject has failed and problem is localised
A query failed (e.g. no cpu information found)
60 crit or subject has failed and problem is fatal
maint or subject is under maintenance Scheduled downtime at site
off or subject has monitoring off Site is undertaking work that would trigger alerts
Gstat metric = ((#ok sites)*10+(#info sites)*20+(#note sites)*30+(#warn sites)*40+(#error sites)*50+(#crit sites)*60) / (#sites – (#maint+#off))
Occupancy averages at 55% for August (26% for period from June
04)
0.00%
10.00%
20.00%
30.00%
40.00%
50.00%
60.00%
70.00%
80.00%
90.00%
06/0
2/20
04
06/1
6/04
06/3
0/04
07/1
4/04
07/2
8/04
08/1
1/20
04
08/2
5/04
09/0
8/20
04
09/2
2/04
10/0
6/20
04
10/2
0/04
11/0
3/20
04
11/1
7/04
12/0
1/20
04
12/1
5/04
12/2
9/04
01/1
2/20
05
01/2
6/05
02/0
9/20
05
02/2
3/05
03/0
9/20
05
03/2
3/05
04/0
6/20
05
04/2
0/05
05/0
4/20
05
05/1
8/05
06/0
1/20
05
06/1
5/05
06/2
9/05
07/1
3/05
07/2
7/05
08/1
0/20
05
08/2
4/05
Date
% j
ob
slo
ts u
sed
% EGEE slots used % UK slots used
Several sites have been running full for July/August.
The plot below is for the Tier-1 in August
Maximum Capacity
August was the busiest month for the Tier-1 as
evidenced by the total KSI2K delivered (KSI2K*CPUMonths)
CPU Use (KSI2K*CPUMonths)
0
100
200
300
400
500
600
700
800
900
J an Feb Mar Apr May J un J ul Aug Sep Oct Nov Dec
Other
SNO
Zeus
H1
Minos
UKQCD
LHCB
DZERO
CMS
CDF
Babar
Atlas
Alice
There has been a Tier-1 investigation into job efficiency over the year (CPU time/Elapsed
time
• Low efficiencies impact utilisation (in terms of CPU time provided)
•Produced by global performance problems on LCG SEs, coupled with problems in logging and book-keeping services
• Approximately 400 KSI2K*CPUmonths per month Feb-June – about 50% of total capacity
•Farm occupancy (job slots used) has increased
>1 if job runs more than 1 CPU intensive process
Specific weighted job efficiencies for ATLAS in
July
• Straight line structures show jobs which ran for a period of time before blocking on an external resource and eventually being killed by an elapsed time limit• Clusters at low efficiency probably show performance problems on external storage elements• Many problems seen here are NOW FIXED
We have seen a good general response to 2.6.0
deployment
0
5
10
15
20
25
30
35
40
24/0
1/20
05
07/0
2/20
05
21/0
2/20
05
07/0
3/20
05
21/0
3/20
05
04/0
4/20
05
18/0
4/20
05
02/0
5/20
05
16/0
5/20
05
30/0
5/20
05
13/0
6/20
05
27/0
6/20
05
11/0
7/20
05
25/0
7/20
05
08/0
8/20
05
22/0
8/20
05
Date
Sit
es a
t re
leas
e
LCG-2_4_0 LCG-2_3_1 LCG-2_3_0 LCG-2_5_0 LCG-2_6_0 Sites
SRMs and data migration
• SRMs and data migration – dCache/DPM– We have most experience with dCache-SRM but
gaining knowledge of DPM– The mailing list remains active – join and review
the archives BEFORE attempting an installation so that we can support you better
– There is now a GridPP wiki, which brings us on to …
Links to all areas mentioned can be found on the deployment links page:http://www.gridpp.ac.uk/deployment/links.html
Our support model needs to be developed
UKI ROC ticket tracking system(Footprints)
Site ASite A
Site ASite A
GGUS
Regional service 1Regional service 1
Regional service 1
Tier-1 helpdesk(Remedy)
Grid-Ireland helpdesk(Remedy)
GOSC(Footprints)
CIC-on-duty
Users Experiments/VOs
Savannah – bug tracking
Site administrators
LCG-ROLLOUT
TB-SUPPORT
Other areas (examples)
Technical• Implications of LCG Baseline Services Group findings• Procurement and deployment of more resources while
maintaining a steady service
General• PPARC signs the LCG MoU shortly – this commits all sites to
a certain basic level of service (Tier-2s 72hrs response)• The operations workshop at Culham (near RAL) later this
monthhttp://egee.in2p3.fr/events/UKI/
• A training course for GridPP sysadmins to help prepare sites for SC4 and the increasing service demands (PPARC signs an LCG MoU soon!)
• A UK support workshop for users and sysadmins?
Service Challenge 3 enters a new phase
Phase 1 (throughput tests) – July 2005– dCache-SRM working at all sites– Tier-1 managed rates (on UKLIGHT) up to 650 Mb/s to CERN.
This is similar to SC2 rates. – Edinburgh – 10TB data transferred. Sustained rates of 220-
250Mb/s– Imperial – Rates reached 400-480 Mb/s– Lancaster – 958GB (978 files) over 8 days (~27Mb/s
sustained)
Phase 2 (service phase) from 1st September 2005– The experiments will use the SC3 infrastructure for testing
their models and production– Experiment (basic functionality) test jobs are being
developed (to run as part of the SFTs) to check sites
Service Challenge 4 will affect all sites – start
preparing!
• SC4 consists of a Setup Phase starting on 1st April 2006, during which a number of Throughput tests will be performed
• followed by a Service Phase from 1st May 2006 until the 30th September 2006
• All service components for SC4 need to be delivered ready for production by the 31st January 2006
• Final testing and integration of components and services must be completed by 31st March 2006
… more details in the panel discussion later today.
Summary
2 GridPP remains a major contributor to LCG/EGEE resources
3 Use of resources is increasing – there were concerns about efficiency
4 Sites did well with the upgrade during a vacation period
6 Service Challenge 3 enters the “Service Phase”. SC4 planning starts
1 We have seen changes in SFTs
5 Two major deployment tasks – support & SRM implementations