[email protected], 29/8/07 cms and the grid: 2004 – 7 or peta-meta-computing in the...

[email protected] GridPP19, 29/8/07

CMS and the Grid: 2004 – 7

Or

Peta-meta-computing in the proto-Grid era:a sociotechnological retrospective

Or

Whose paradigm is it anyway?

“Experiments for the Grid”

“The Grid for Experiments”


History 2004: The end of innocence

DC04 data challenge - a learning experience (also start playing with a new idea: “PhEDEx”)

2005: The year of (re-)design Computing TDR A new ground-up software & computing framework

2006: Making it work CSA06 - the acid test CASTOR in the UK

• “…we all thought you were crazy” - I. Fisk

2007: Making it work without losing our sanity Learning about storage + data transfers CSA07: the real fun is yet to come


DC04

Our unofficial mascot: “Everything Sucked”

http://g-ec2.images-amazon.com/images/I/31JWYMX97EL._SS500_.jpg


DC04 25% (of startup) data challenge

Set the traditional formula for subsequent challenges Exercised T0, T1, T2 centres for a full month 50M events, ~15 centres, ~30 people

Tools & technologies Ad hoc scripts for workflow mgmt + CASTOR at CERN Variety of storage at T1 (SRB + ADS at RAL), plain disk at T2 First-generation EDG workload + data mgmt tools (incl. RLS)

Results Technically: largely a disaster

• Managed to break essentially every component in the system Organisationally: a major step forward

• Learn several key lessons that informed the computing model• Established new projects to solve the technical problems

GridPP2 contributed substantially to the analysis / solution of problems


CSA06 - in Numbers7 Tier-1 centres

35 Tier-2 sites

100M fully simulated events

1.4PB of data

400MB/s rate from RAL CASTOR

1200MB/s peak dataflow

70 people

180 meetings

500 dodgy disks (but only one tape eaten!)

£300k of electricity

40l of Champagne at end-of-CSA party

“Shambolic” CSA Forced to Abandon Targets- Headline, The Independent, 5th Jan


Lesson the 1st: Data

It’s all about the data, stupid!

Remember the ‘DataGrid’? We’ve failed to build a uniform approach to WADM

We deal with data processing centres, not CPU centres The (remaining) hard problems are in efficient data access

I.e. storage, IO, data transfer at Tier-2, not just Tier-1

Need to nail the ‘local IO problem’ very soon Still have a lot to do on reliable data transfer as well

BTW, the network still isn’t the bottleneck (but keep trying…)


Lesson the 2nd: Locality

Keep local stuff local

Aim of the Grid: avoid central points of congestion Present coherent local interfaces, but reduce global state Actually: aim of all coherent system-building strategies

Examples from current CMS system: Use local catalogues whenever possible; update asynchronously Don’t use off-site ‘Grid services’ for local workflows (e.g. reco).

This also applies to contacts with sites ‘Users’ / ‘experiments’ must work with sites directly ‘Up the chain of command and down again’ does not work NB: also applies to ‘strategic’ discussions and resource planning


Lesson the 3rd: Reliability

Reliability trumps performance & scalability

Unreliable systems are extremely inefficient N_tries goes as log(1-p)-1, bookkeeping at least as N_tries2

Unreliable systems are not trusted by users

If one can’t make a small system work, larger systems will be progressively worse

We are getting there; but not fast enough Reliability can be achieved iff robustness is built in

Without reliability, what is the point of the Grid?


Lesson the 4th: Exceptions

Sticking plasters won’t cure a broken leg

We use the ‘network stack model’ of fault tolerance Higher layer functions compensate for unreliability of lower layers

Alas, does not work for intrinsically unreliable systems Example: wireless network in CERN building 40 802.11b fine with 1% error rate; collapses with 10% (CMS week!)

Fault-finding is impossible without fault reporting And intelligible logging, recorded and accessible at all levels

‘Exception handling’ is clearly hard A key property of a mature system Remember: exceptions should be exceptional


Lesson the 5th: People

“Generic Grid sites” do not really exist

A site is precisely as “good” as the people running it Objectively: throughput (transfers) tracks national holidays! We are still in a highly labour-intensive mode; the labour is at sites

What does CMS need from site operators, today? Close contact with ‘the project’ and ‘the users’;sharing of

experience Proactive deployment & testing of new services, software Active participation in resource planning and data operations

Will ‘generic sites’ ever exist? Not until central support and problem-tracking are much improved


Lesson the 6th: Focus

No more neat ideas - for now

In 2007/8, that is! Focus on (dull, tedious, hard) integration, testing, documentation The excitement will come from the physics!

But many ‘big unsolved problems’ for later: How can we store data more efficiently? Can we? How can we compute more efficiently? Can we? How should we use multi-threading & virtualisation? How do we use really high-speed networks? Will anyone ever make videoconferencing work properly?

Someone should start targeting these problems…


Hype?

Where are we? (Or at least: what sign is the gradient?)


Whither the Grid? Is CMS using the Grid?

PKI-based, uniform(ish) web services interfaces? Yes• But also a lot of remote DB access for many purposes

Resource discovery / Info service? Not really.• >90% of CMS jobs are whitelisted at RB level (many even at user level)

Replica management? Partially, through our own mechanisms• No real attempt at optimisation of data access - yet

Support, authentication, ROC services? Partially• Augmented with CMS-specific and national support mechanisms

Has it all been worth it (so far?) Yes! If if didn’t exist, we’d have had to invent it anyway

Will we become more Grid-like? Undoubtedly (though not sure ‘utility computing’ will ever be a goer) For now, efficiency appears to require simplicity - no surprise

The real value of ‘The Grid’ is yet to come


(Near) FutureThe hard work starts here!

I say this every six months So far it’s always been true

CSA07 Really the last big test of our organisation, readiness for data Already reviewing some aspects of model after discussion…

2008: The crunch year Focus should be on basic reliable services at centres Need to reinforce communications between expts and sites

GridPP3 Clearly a major role to play in CMS computing - at many levels Roll on LHC startup!

[email protected], 29/8/07 cms and the grid: 2004 – 7 or peta-meta-computing in the...

Documents

inefficient n n

n reliability

hard n

n csa06

n focus

n dont

documentation n

month n