atlas report 14 april 2010 rwl jones. the good news at the iop meeting before easter, dave charlton...

20
ATLAS Report 14 April 2010 RWL Jones

Upload: anastasia-oconnor

Post on 25-Dec-2015

213 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

ATLAS Report

14 April 2010RWL Jones

Page 2: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

The Good News

At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was nothing to say about it

It is good he thinks so! But it is also a little hopeful! But stick with the good news for now…..

People doing real work on the Grid, and in large numbers

Page 3: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Data throughput

MB/sper day

Total ATLAS data throughput via Grid (Tier0, Tier-1s, Tier-2s)

Beam splashes

First collisions

Cosmics

End data-taking

ATLAS was pumping data out at a high rate, despite the low luminosity

We are limited by trigger rate and event size We also increased the number of first data replicas

Page 4: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

900GeV reprocessing

We are reprocessing swiftly and oftenWe need prompt and continuous responseThis dataset is tiny compared with what is to come!

Page 5: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Reprocessed data to UK

Only one outstanding data transfer issue for UK by end of Xmas period

Page 6: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

6

Tier 2 Nominal Shares 2010

Real36%

MC46%

Group9%

Scratch9%

Tier 2 Disk share

Real

MC

Group

Scratch

Simulation48%

Group13%

User39%

Tier 2 CPU

Simulation

Group

User

Page 7: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Event data flow to the UK

7

Data distribution has been very fast despite event sizeMost datasets available in Tier 2s in a couple of hours

Page 8: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

High Energy Data Begins

UK T2 throughput MB/s UK T2 transfer volume GB

UK T1 throughput MB/s

Page 9: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Analysis

This is going generally well Previously recognised good sites are generally doing well Workload is uneven depending on data distribution

This is not yet in ‘equilibrium’ as the data is sparse Remember, datasets for specific streams go to specific sites,

so usage reflects those hosting minimum bias samples However, the fact the UK sites were favoured (e.g. Glasgow)

also reflects good performance The 7TeV will move more to equilibrium

Page 10: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Data Placement

There are issues in the current ATLAS distribution Should be a full set in each cloud

This has not always happened because of a bug We need to be more responsive to site performance &

capacity At the moment, the UK has been patching-in extra

copies ‘manually’ ATLAS has followed UK ideas and has introduced

‘primary’ and ‘secondary’ copies of datasets Secondary copies live only while space permits Improves access – UK typically has 1.6 copies

Page 11: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

11

The UK and Data Placement

The movement and placement of data must be managed Overload of the data management system slows the system

down for everyone Unnecessary multiple copies waste disk space and will

prevent a full set being available Some multiple copies will be a good idea to balance loads

We have a group for deciding the data placement: UK Physics Co-ordinator, UK deputy spokesman, Tony

Doyle (UK data ops), Roger Jones (UK ops), aided by Stewart, Love & Brochu

The UK Physics co-ordinator consults the institute physics reps

The initial data plans follows the matching of trigger type to site from previous exercises

We will make second copies until we run short of space, then the second copies will be removed *at little notice*

Page 12: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

But dataset x is not in the UK

In general, this should not be the case, unless it is RAW Access it elsewhere (unless RAW/less popular ESD) The job goes to the data, not the data to you

We can copy small amounts to the UK on request E.g. my Higgs candidate in RAW or ESD

But we must manage it - specify What need for the data is (activity, which physics and

performance group) Why it is not already covered by a physics or performance

group area How big it will be *at a maximum* How the data will be used (what sort of job to be run, database

access etc) We are still surprised to see requests for datasets that are

freely available on the Grid in the UK to be copied to ‘their’ Tier 2

Local requests should be to Tier 3 (non-GridPP) space

12

Page 13: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Site responsibilities

Sites are either Supporting important datasets Supporting physics groups Reliability is vital

We need to be in the high 90s at least Means paying a lot of attention to ATLAS monitoring and

not just to SAM tests. The switch to a SAM nagios based system is potentially

useful, but many bugs to be ironed out Sites just have to be pro-actively looking at the ATLAS

dashboards (blame this on the infrastructure people again).

We are reviewing middleware, but the sites must play their part

Local monitoring is important It should not be users who spot site problems first!

Sites must also look at ATLAS monitoring, not just SAM tests – they are not enough

ATLAS is working to help this…

Page 14: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Monitoring & Validation

ATLAS is working to improve the monitoring Learn more from the user jobs:

We focus on “active” probing of the sites. But “passive” yet automatic observation of the

user jobs would lead to a better understanding of what is happening at the sites.

The current ADC metrics for analysis are the Hammer Cloud tests using the GangaRobot

These tests are heavy but fairly reliable Reflect the computing model and needs in data-taking era Reminder:

About 55% of CPU for ATLAS-wide analysis About 100% of disk for ATLAS-wide analysis About 0% of either for local use!

Page 15: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

GangaRobot Today

~8 tests per site per day w/ a mix of: A few different releases Different access modes Mc and real data Cond DB access All are defined/prepared by Johannes Elmshauser

Results on GR web and in SAM Non-critical; sites usually ignore it

Auto blacklisting on EGEE/WMS2x daily email report sent to DAST containing:

Sites with failures Sites with no job submitted (brokerage error, e.g. no

release)

Page 16: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

ATLAS Validation – GR2

New tool, GR2, under development to validate sites Lighter load on sites – GR2 is HC in ‘gentle mode’ Concept of Test templates (release, analysis, ds pattern,

[sites]) Defined by ADC

Still has bugs Installations need to be clearly defined and installed Test samples need to be in place

This will almost certainly be the framework for future metrics

The metrics themselves require more experience to define

Page 17: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Installations

Installations: Our sites have been apparently ‘off’ because of missing

releases ATLAS central is also slow at responding to problems

with non-CERN set-ups Major clean-up underway Auto-retry installer under development

Page 18: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

PANDA & WMS

There are now two distinct groups of users Those who use the PANDA back-end Those who use the WMS

There is less monitoring of the WMS, and less control Some (e.g. Graeme) favour a tight focus on the PANDA

approach I am not sure this is possible However, ATLAS clearly has more feedback and more

control if this route is taken Do not be surprised!

Page 19: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Middleware

Sites cannot be made 100% reliable with the current middleware

Many options are being considered In particular, data management may reduce from 3

layers to 2 This would effectively remove the LFC if so

Radical options are also being considered BUT ATLAS involved in recognizing the limitations of

the system today and making it work

Page 20: ATLAS Report 14 April 2010 RWL Jones. The Good News  At the IoP meeting before Easter, Dave Charlton said the best thing about the Grid was there was

Conclusion

We are now finally dealing with real data We are still learning

We must all work hard to make things work Many thanks for everyone’s effort so far But the work continues for 20 years!

The UK has been heavily used and involved in first physics studies This is partly because of data location But also because we are a reliable cloud

We can all celebrate this at the dinner tonight But please keep an eye on your sites on your smart phones!