steve lloydthe data deluge and the gridslide 1 the data deluge and the grid the data deluge the...
Post on 05-Jan-2016
220 Views
Preview:
TRANSCRIPT
Steve Lloyd The Data Deluge and the Grid Slide 1
The Data Deluge and the Grid
• The Data Deluge• The Large Hadron Collider• The LHC Data Challenge• The Grid• Grid Applications• GridPP• Conclusion
Steve LloydQueen Mary
University of Londons.l.lloyd@qmul.ac.uk
Steve Lloyd The Data Deluge and the Grid Slide 2
The Data Deluge
Expect massive increases in amount of data being collected in several diverse fields over the next few years:– Astronomy - Massive sky surveys– Biology - Genome databases etc.– Earth Observing– Digitisation of paper, film, tape records etc to create
Digital Libraries, Museums . . .– Particle Physics - Large Hadron Collider– . . .1PByte ~1000 TBytes ~ 1M GBytes ~ 1.4M CDs [Petabyte Terabyte Gigabyte]
Steve Lloyd The Data Deluge and the Grid Slide 3
Digital Sky Project
Federating new astronomical surveys: ~ 40,000 square degrees ~ 1/2 trillion pixels (1 arc second) ~ 1 TB x multi-wavelengths > 1 billion sources
Integrated catalogue and image database:– Digital Palomer Observatory Sky Survey– 2 All Sky Survey– NRAO VLA Sky Survey– VLA FIRST Radio Survey
Later:– ROSAT– IRAS– Westerbork 327 MHz Survey
Steve Lloyd The Data Deluge and the Grid Slide 4
Sloan Digital Sky Survey
• ~ 1 million spectra• positions and images of 100 million objects• 5 wavelength bands• ~ 40 TB
Survey 10,000 square degrees of Northern Sky over 5 years
Steve Lloyd The Data Deluge and the Grid Slide 5
VISTA
Visible and Infrared Survey Telescope for Astronomy
Steve Lloyd The Data Deluge and the Grid Slide 6
Virtual Observatories
Crab Nebula
Optical
RadioInfra-red
X-ray
Jet in M87
HST optical
Gemini mid-IR
VLA radio
Chandra X-ray
Steve Lloyd The Data Deluge and the Grid Slide 7
NASA’s Earth Observing System
1 TB/day
Galapagos Oil Spill:
Steve Lloyd The Data Deluge and the Grid Slide 8
ESA EO Facilities
ESRIN
MATERA (I)
NEUSTREL.ITZ (D)
KIRUNA (S)- ESRANGE
MASPALOMAS (E)
TROMSO (N)
MATERA (I)
SEAWIFS
SPOT
IRS-P3
LANDSAT 7TERRA/MODIS
STANDARDPRODUCTION CHAINS
USERS
HISTORICALARCHIVES
USERS
PRODUCTS
AVHRR
GOME analysis detected ozone thinning over Europe 31 Jan 2002
Steve Lloyd The Data Deluge and the Grid Slide 9
Species 2000
To enumerate all ~1.7 million known species of plants, animals, fungi and microbes on Earth
A federation of initially 18 taxonomic databases - eventually ~ 200 databases
Steve Lloyd The Data Deluge and the Grid Slide 10
Genomics
Steve Lloyd The Data Deluge and the Grid Slide 11
The LHC
The Large Hadron Collider (LHC) will be a 14 TeV centre of mass proton proton collider operating in the existing 26.7Km LEP tunnel at CERN. Due to start operation > 2006
– 1,232 superconducting main dipoles of 8.3Tesla
– 788 quadrupoles– 2,835 bunches of 1011 protons
per bunch spaced by 25ns
Steve Lloyd The Data Deluge and the Grid Slide 12
Particle Physics Questions
• Need to discover (confirm) Higgs Particle– Study its properties– Prove that Higgs couplings depend on masses
• Other unanswered questions:– Does Supersymmetry exist?– How are quarks and leptons related?– Why are there 3 sets of quarks and leptons?– What about Gravity?– Anything unexpected?
Steve Lloyd The Data Deluge and the Grid Slide 13
The LHC
Steve Lloyd The Data Deluge and the Grid Slide 14
The LEP/LHC Tunnel
Steve Lloyd The Data Deluge and the Grid Slide 15
LHC Experiments
LHC will house 4 experiments:– ATLAS and CMS are large 'General Purpose'
detectors designed to detect everything and anything
– LHCb is a specialised experiment designed to study CP violation in the b quark system
– ALICE is a dedicated Heavy Ion Physics Detector
Steve Lloyd The Data Deluge and the Grid Slide 16
Schematic View of the LHC
Steve Lloyd The Data Deluge and the Grid Slide 17
The ATLAS Experiment
ATLAS Consists of – An inner tracker to measures the
momentum of each charged particle – A calorimeter to measure the energies
carried by the particles – A muon spectrometer to identify and
measure muons – A huge magnet system for bending charged
particles for momentum measurement
A total of > 108 electronic channels
Steve Lloyd The Data Deluge and the Grid Slide 18
The ATLAS Detector
Steve Lloyd The Data Deluge and the Grid Slide 19
Simulated ATLAS Higgs Event
Steve Lloyd The Data Deluge and the Grid Slide 20
LHC Event Rates
• The LHC proton bunches collide every 25ns and each collision yields ~20 proton proton interactions superimposed in the Detector i.e.– 40 MHz x 20 = 8x108 pp interactions/sec
• The (110 GeV) Higgs cross section is 24.2pb.• A good channel is H with a branching
ratio of 0.19% and a detector acceptance ~50%– At full (1034cm-2s-1) LHC luminosity this gives
1034 x 24.2x10-12 x 10-24 x 0.0019 x 0.5 = 2x10-4 H per second
A 2x10-4 needle in a 8x108 Haystack
Steve Lloyd The Data Deluge and the Grid Slide 21
'Online' Data Reduction
Collision Rate 40 MHz
Level 1 Special Hardware Trigger
Level 2 Embedded Processor Trigger
Level 3 Processor Farm
Raw Data Storage
104 - 105 Hz
102 - 103 Hz
10 - 100 Hz
Offline Data Reconstruction
Selecting interesting events based on progressively more detector information
10-100 GB/sec
40 TB/sec
1-10 GB/sec
100-200 MB/sec
Steve Lloyd The Data Deluge and the Grid Slide 22
Offline Analysis
Raw Data from Detector
Physics Analysis
1-2 MB/event @ 100-400 Hz
Data Reconstruction(Digits to
Energy/momentum etc)
Event Summary Data
0.5 MB/event
Analysis Event Selection
Analysis Object Data
10 kB/event
Total Data per year from one experiment
1 to 8 PBytes (1015 Bytes)
Steve Lloyd The Data Deluge and the Grid Slide 23
Computing Resources Required
CPU Power (Reconstruction, Simulation, User Analysis etc)– 2 Million SpecInt95– (A 1 GHz PC is rated at ~40 SpecInt95)– i.e. 50,000 of today's PCs
'Tape' Storage– 20,000 TB
Disk Storage– 2,500 TB
Analysis carried out throughout the world by hundreds of Physicists
Steve Lloyd The Data Deluge and the Grid Slide 24
Worldwide Collaboration
CMS: 1800 physicists150 institutes32 countries
Steve Lloyd The Data Deluge and the Grid Slide 25
Solutions
• Distributed solution:– exploit established computing expertise &
infrastructure in national labs and universities– reduce dependence on links to CERN– tap additional funding sources (spin off)
Is the Grid the solution?
• Centralised Solution:– Put all resources at CERN
• Funding agencies certainly won't place all their investment at CERN
• Sociological problems
Steve Lloyd The Data Deluge and the Grid Slide 26
What is The Grid?
Analogy with the Electricity Power Grid:– Unlimited ubiquitous distributed computing– Transparent access to multipetabyte
distributed databases– Easy to plug in– Complexity of infrastructure hidden
Steve Lloyd The Data Deluge and the Grid Slide 27
The Grid
Ian Foster and Carl Kesselman, editors, “The Grid: Blueprint for a New Computing Infrastructure,” Morgan Kaufmann, 1999, http://www.mkp.com/grids
Five emerging models:
•Distributed Computing
- synchronous processing
• High-Throughput Computing
- asynchronous processing
• On-Demand Computing
- dynamic resources
• Data-Intensive Computing
- databases
• Collaborative Computing
- scientists
Steve Lloyd The Data Deluge and the Grid Slide 28
The Grid
Ian Foster / Carl Kesselman: "A computational Grid is a hardware
and software infrastructure that provides dependable, consistent, pervasive and inexpensive access to high-end computational capabilities."
Steve Lloyd The Data Deluge and the Grid Slide 29
The Grid
• Dependable - Need to rely on remote equipment as much as the machine on your desk
• Consistency - Machines need to communicate so need consistent environments and interfaces
• Pervasive - The more resources that participate in the same system the more useful they all are
• Inexpensive - Important for pervasiveness - i.e. built using commodity PCs and disks
Steve Lloyd The Data Deluge and the Grid Slide 30
The Grid
• You simply submit your job to the 'Grid'- you shouldn't have to know where the data you want is or where the job will run. The Grid software (Middleware) will take care of:– running the job where the data is or– moving the data to where there is CPU power
available
Steve Lloyd The Data Deluge and the Grid Slide 31
@#%&*!
The Grid for the Scientist
E = mc2
Grid Middleware
“Putting the bottleneck back in the Scientist’s mind”
Steve Lloyd The Data Deluge and the Grid Slide 32
Grid Tiers
• For the LHC we envisage a 'Hierarchical' structure based on several 'Tiers' since the data mostly originates at one place:– Tier-0 - CERN - the source of the data– Tier-1 - ~ 10 Major Regional Centres (inc
UK)– Tier-2 - smaller more specialised Regional
Centres (4 in UK?)– Tier-3 - University Groups– Tier-4 – My laptop? Mobile Phone?
• Doesn't need to be hierarchical e.g. for Biologists probably not desirable
Steve Lloyd The Data Deluge and the Grid Slide 33
Grid Services
Resource-specific implementations of basic servicese.g., Transport protocols, name servers, differentiated services, CPU schedulers, public keyinfrastructure, site accounting, directory service, OS bypass
Resource-independent and application-independent services authentication, authorization, resource location, resource allocation, events, accounting,
remote data access, information, policy, fault detection
DistributedComputing
Toolkit
Grid Fabric(Resources)
Grid Services(Middleware)
ApplicationToolkits
Data-Intensive
ApplicationsToolkit
CollaborativeApplications
Toolkit
RemoteVisualizationApplications
Toolkit
ProblemSolving
ApplicationsToolkit
RemoteInstrumentation
ApplicationsToolkit
Applications Chemistry
Biology
Cosmology
Particle Physics
Environment
Steve Lloyd The Data Deluge and the Grid Slide 34
Problems
• Scalability– Will it scale to thousands of processors,
thousands of disks, PetaBytes of data, Terabits/sec of IO?
• Wide-area distribution– How to distribute, replicate, cache,
synchronise, catalogue the data?– How to balance local ownership of resources
with the requirements of the whole?• Adaptability/Flexibility
– Need to adapt to rapidly changing hardware and costs, new analysis methods etc.
Steve Lloyd The Data Deluge and the Grid Slide 35
SETI@home
• A distributed computing project - not really a Grid project
• You pull the data from them rather than they submit the job to you– total of 3,864,230 users– 564,194,228 results received– 1,063,104 years of cpu time– 1.8x1021 floating point operations– 77 different cpu types – ~100 different OS
Arecibo telescope in Puerto Rico
Steve Lloyd The Data Deluge and the Grid Slide 36
SETI@home
Steve Lloyd The Data Deluge and the Grid Slide 37
Entropia
Uses idle cycles on Home PCs for profit and non-profit projects:
Mersenne Prime Search•146,622 machines •784,360,165 cpu hours
FightAIDS@Home•13,944 Machines•1,652,126 cpu hours
Steve Lloyd The Data Deluge and the Grid Slide 38
NASA Information Power Grid
• Knit together widely distributed computing, data, instrumentation and human resources
• to address complex large scale computing and data analysis problems
Steve Lloyd The Data Deluge and the Grid Slide 39
Collaborative Engineering
Real-timecollection
Multi-sourceData Analysis
Unitary Plan Wind Tunnel
Archival storage
Steve Lloyd The Data Deluge and the Grid Slide 40
Other Grid Applications
• Distributed Supercomputing– Simultaneous execution across
multiple supercomputers• Smart Instruments
– Enhance the power of scientific instruments by providing access to data archives and online processing capabilities and visualisation e.g. coupling Argonne’s Photon
Source to a supercomputer
Steve Lloyd The Data Deluge and the Grid Slide 41
GridPP
http://www.gridpp.ac.uk
Steve Lloyd The Data Deluge and the Grid Slide 42
GridPP Overview
Provide architecture and middleware
Use the Grid with simulation data
Use the Grid with real data
Future LHC Experiments
Running US Experiments
Build prototype Tier-1 and Tier-2s in the UK and implement middleware in experiments
Steve Lloyd The Data Deluge and the Grid Slide 43
The Prototype UK Tier-1
• Jan 2002 Central Facilities used by all experiments:
– 250 CPUs (450Mhz-1GHz)– 10TB Disk– 35TB Tape in use (theoretical tape
capacity 330 TB)• March 2002 Extra Resources for LHC and
BaBar– 312 CPUs– 40TB Disk – extra 36 TB of tape and three new drives
Steve Lloyd The Data Deluge and the Grid Slide 44
Conclusions
• Enormous data challenges in next few years.
• The Grid is likely solution.• The Web gives ubiquitous access to
distributed information.• The Grid will give ubiquitous access to
computing resources and hence knowledge.• Many Grid projects and testbeds starting to
take off.• GridPP is building a UK Grid for Particle
Physicists to prepare for future LHC Data.
top related