slide 1 steve lloyd culham - july 2006 the data deluge and the grid steve lloyd queen mary,...
TRANSCRIPT
Steve Lloyd Culham - July 2006 Slide 1
The Data Deluge
and the GridSteve Lloyd
Queen Mary, University of LondonCulham July 2006
Steve Lloyd Culham - July 2006 Slide 2
Outline
• What is Data?• Where it comes from – e-
Science• The CERN LHC and Experiments• What is the Grid?• GridPP• Challenges ahead
Steve Lloyd Culham - July 2006 Slide 3
What is Data?
Anything that can be expressed as numbers
Raw Information → Numbers → Binary Digits
Pictures
Electrical Signals
Sound
Store amount of Red, Green and Blue
Store loudness at each time
Lots of Pictures + Sound = DVD Video
Store voltage or current
Text
Every character has a numerical code
Steve Lloyd Culham - July 2006 Slide 4
Digital DataNumbers are stored as Binary digits1 Bit = 0 or 11 Byte = 8 bits
Can store yes/no or on/offCan store numbers from 0 to 255(Enough for a character a-z, A-Z, *£$@<... )
25 = 0x128 + 0x64 + 0x32 + 1x16 +1x8 + 0x4 + 0x2 + 1x1 = 00011001
1 kiloByte = ~1,000 Bytes
Typical Word Document ~30kB
1 MegaByte = ~1,000,000 Bytes
A Floppy Disk ~1.4MBA CD ~700MB
1 GigaByte = ~1,000,000,000 Bytes Typical PC Hard Drive 120-
400 GB1 TeraByte = ~1,000,000,000,000 Bytes1 PetaByte = ~1,000,000,000,000,000 Bytes ~1.4 Million
CDs1 ExaByte = ~1,000,000,000,000,000,000 Bytes World Annual Information
Production
World Annual Book Production
Steve Lloyd Culham - July 2006 Slide 5
Data AnalysisWhat is done with data?
Nothing
Read it Listen to it
Watch it
Analyse it
23Read A
Read BC = A + BPrint C
5
Computer
Program
"Job"
Calculate how proteins fold
Calculate what the weather is going to do
Steve Lloyd Culham - July 2006 Slide 6
e-ScienceIn the UK this sort of activity has become known as "e-Science"
"e-Science will change the dynamic of the way Science is undertaken"
"Science increasingly done through distributed global collaborations enabled by the internet using very large data collections, terascale computing resources and high performance visualisation"
Dr John Taylor - Director General of Research Councils:
"e-Science is about global collaboration in key areas of science, and the next generation of infrastructure that will enable it"
Steve Lloyd Culham - July 2006 Slide 7
AstronomyCrab Nebula
Optical
RadioInfra-red
X-ray
Jet in M87
HST optical
Gemini mid-IR
VLA radio
Chandra X-ray Virtual Observatories
Steve Lloyd Culham - July 2006 Slide 8
Earth Observation
1 TB/day
Ozone map
Ottawa
The Institute of Physics (Google
Earth)
Steve Lloyd Culham - July 2006 Slide 9
Bioinformatics
Steve Lloyd Culham - July 2006 Slide 10
HealthcareDynamic Brain Atlas
Breast Screening
Scanning
Remote Consultancy
Steve Lloyd Culham - July 2006 Slide 11
Collaborative Engineering
Real-timecollection
Multi-sourceData Analysis
Unitary Plan Wind Tunnel
Archival storage
Steve Lloyd Culham - July 2006 Slide 12
Digital CurationDigitization of almost anything
To create Digital Libraries and Museums
Steve Lloyd Culham - July 2006 Slide 13
The CERN LHC
4 Large Experiments
The world’s most powerful particle accelerator - 2007
Steve Lloyd Culham - July 2006 Slide 14
ALICE- heavy ion collisions, to create quark-gluon plasmas- 50,000 particles in each collision
LHCb- to study the differences between matter and antimatter- detect over 100 million b and b-bar mesons each year
ATLAS- General purpose- Origin of mass- Supersymmetry- 2,000 scientists from 34 countries
CMS- General purpose- 1,800 scientists from over 150 institutes
The Experiments
Steve Lloyd Culham - July 2006 Slide 15
7,000 tonnes 42m long22m wide22m high
2,000 Physicists150 Institutes34 Countries
ATLAS Detector
(About the height of a 5 storey building)
Steve Lloyd Culham - July 2006 Slide 16
Steve Lloyd Culham - July 2006 Slide 17
The Higgs• Primary objective of the
LHC -• What is the origin of Mass? • Is it the Higgs Particle?
Massless Particle – Travels at the speed of light
Low Mass Particle – Travels slower
High Mass Particle – Travels slower still
Steve Lloyd Culham - July 2006 Slide 18
Starting from this event…
We are looking for this “signature”Selectivity: 1 in 1013
Like looking for 1 person in a thousand world populations
Or for a needle in 20 million haystacks!
The LHC Data Challenge • ~100,000,000
electronic channels• 800,000,000 proton-
proton interactions per second
• 0.0002 Higgs per second
• 10 PBytes of data a year
• (10 Million GBytes = 14 Million CDs)
Steve Lloyd Culham - July 2006 Slide 19
LHC Computing RequirementsCPU Power (Reconstruction, Simulation, User Analysis etc) - 100,000 of today's PCs
Distributed Computing Solution – "The Grid"
'Tape' Storage20 PetaBytes (= 20 M GBytes)
Disk Storage – 2.5 PetaBytes (= 2.5 M GBytes)
Steve Lloyd Culham - July 2006 Slide 20
Web: Information Sharing
• Invented at CERN by Tim Berners-Lee
• Agreed protocols: HTTP, HTML, URLs
• Anyone can access information and post their own
• Quickly crossed over into public use
No. of
Inte
rnet
host
s (m
illio
ns)
Year
Steve Lloyd Culham - July 2006 Slide 21
Distributed Resource Sharing @Home
Projects• Uses home PCs to run
numerous calculations with dozens of variables.
• Distributed computing project, not a Grid
• Some @home projects– BBC Climate
Change ExperimentSETI @ Home
– FightAIDS@home
Distributed File SharingPeer To Peer Networks
Peer-to-peer network
• No centralised database of files• Legal problems with sharing
copyrighted material• Security problems
Steve Lloyd Culham - July 2006 Slide 22
SETI@home
• A distributed computing project - not really a Grid project
• You pull the data from them rather than they submit the job to you
Arecibo telescope in Puerto Rico
Users - 5,240,038
Results received – 1,632,106,991
Years of CPU Time – 2,121,057
Extraterrestrials found – 0
Steve Lloyd Culham - July 2006 Slide 23
The Grid
Ian Foster / Carl Kesselman:
"A computational Grid is a hardware and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities."
'Grid' means different things to different people
All agree it's a funding opportunity!
Steve Lloyd Culham - July 2006 Slide 24
Electricity GridAnalogy with the Electricity Power Grid
'Standard Interface'
Power Stations
Distribution Infrastructure
Steve Lloyd Culham - July 2006 Slide 25
Computing Grid
Computing and Data Centres
Fibre Optics of the Internet
Steve Lloyd Culham - July 2006 Slide 26
Middleware
MIDDLEWARE
CPUDisks, CPU etc
PROGRAMS
OPERATING SYSTEM
Word/Excel
Email/Web
Your Progra
mGames
CPUCluste
r
UserInterfac
eMachine
CPUCluste
r
CPUCluste
r
Resource Broker
Information Service
Single PC
Grid
DiskServer
Your Progra
m
Middleware is the Operating System of a distributed computing system
Replica CatalogueBookkeepin
g Service
Steve Lloyd Culham - July 2006 Slide 27
The Grid VisionFrom this:
To this:
Steve Lloyd Culham - July 2006 Slide 28
19 UK Universities, CERN and
CCLRC (RAL & Daresbury) Funded by PPARC:
GridPP1 2001-2004 “From Web to Grid”
GridPP2 2004-2007 “From Prototype to Production”
GridPP3 2007-2011 (proposed)“From Production to
Exploitation”
Who are GridPP?
Developed a working, highly functional Grid
Steve Lloyd Culham - July 2006 Slide 29
International Context
LHC Computing Grid (LCG)Grid Deployment Project for
LHC
EU Enabling Grids for e-Science (EGEE) 2004-2008Grid Deployment Project for all disciplines
GridPP LCG
EGEE
GridPP is part of EGEE and LCG (currently the largest Grid in the world)
UK National Grid ServiceUK’s core production computational and data Grid
Open Science Grid (USA)Science applications from HEP to biochemistry
Nordugrid (Scandinavia)Grid Research and Development collaboration
Steve Lloyd Culham - July 2006 Slide 30
GridPP Middleware Development
Workload Management
Storage Interfaces
Network Monitoring
SecurityInformation Services
Grid Data Management
Steve Lloyd Culham - July 2006 Slide 31
What you need to use the Grid
1. Get a digital certificate (UK Certificate Authority)
2. Join a Virtual Organisation (VO)
3. Get access to a local User Interface Machine (UI) and copy your files and certificate there
Authentication – who you are
Authorisation – what you are allowed to do
4. Write some Job Description Language (JDL) and scripts to wrap your programs
############# HelloWorld.jdl #################Executable = "/bin/echo";Arguments = "Hello welcome to the Grid ";StdOutput = "hello.out";StdError = "hello.err";OutputSandbox = {"hello.out","hello.err"};#########################################
Steve Lloyd Culham - July 2006 Slide 32
How it works
Job Descriptio
n
UI Machine
Resource Broker
Input SandboxScript you
want to run
Other files (Job
Options, Source...)
Storage Element
Compute Element
Storage Element
Grid
Proxy Certificate
Output SandboxOutput
files (Plots, Logs...)
Input Data
Output Data
Job
Output Sandbo
x
Steve Lloyd Culham - July 2006 Slide 33
Tier StructureTier 0
Tier 1National centres
Tier 2Regional groups
Institutes
Workstations
Offline farm
Online system
CERN computer centre
RAL,UK
ScotGrid NorthGridSouthGrid London
FranceItalyGermanyUSA
Glasgow Edinburgh DurhamUseful model for Particle Physics but not necessary for others
Steve Lloyd Culham - July 2006 Slide 34
UK Tier-1 Centre at RAL
• High quality data services• National and International Role• UK focus for International Grid
development•1000 Dual CPU•200 TB Disk•220 TB Tape
(Capacity 1PB)Grid Operations Centre
Steve Lloyd Culham - July 2006 Slide 35
UK Tier-2 Centres
ScotGridDurham, Edinburgh, Glasgow NorthGridDaresbury, Lancaster, Liverpool,Manchester, Sheffield
SouthGridBirmingham, Bristol, Cambridge,Oxford, RAL PPD
LondonBrunel, Imperial, QMUL, RHUL, UCL
Mostly funded by HEFCE
Steve Lloyd Culham - July 2006 Slide 36
The Grid at Queen MaryThe Queen Mary e-Science
High Throughput Cluster
174 PCs (348 CPUs)40 TByte Disk Storage
Part of the London Tier-2 Centre
Currently adding another 285 PCs
Steve Lloyd Culham - July 2006 Slide 37
The LCG Grid Status Worldwide
182 Sites
23,438 CPUs
9.2 PB Disk
2,200 Years of CPU time
UK
21 Sites
4,482 CPUs
180 TB Disk
593 Years of CPU time
Steve Lloyd Culham - July 2006 Slide 38
"GridPP has been developed to help answer questions about the conditions in the Universe just after the Big Bang," said Professor Keith Mason, head of the Particle Physics and Astronomy Research Council (PPARC)."But the same resources and techniques can be exploited by other sciences with a more direct benefit to society."
Steve Lloyd Culham - July 2006 Slide 39
Future Challenges
(Ex-)Concorde(15 km)
CD stack with1 year LHC data(~ 20 km)
We are here(4 km)
• Scaling to full size ~20,000 → 100,000 CPUs
• Stability, Robustness etc• Security (Hackers Paradise!)• Sharing resources (in RAE
environment!)• International Collaboration• Increased Industrial take-up• Spread beyond Science• Continued funding beyond start of
LHC!
Steve Lloyd Culham - July 2006 Slide 40
Further Info …
http://www.gridpp.ac.uk
RSS News feed