CERN-IT-DB
Exabyte-Scale Data Management Using an Object-Relational
Database: The LHC Project at CERN
Jamie ShiersCERN, Switzerlandhttp://cern.ch/db/
EB Scale DBs
Overview
Brief introduction to CERN & LHC
Why we have massive data volumes
The role of Object-Relational DBs
A Possible Solution…
CERN - The European Organisation for Nuclear Research
The European Laboratory for Particle Physics
Fundamental research in particle physicsDesigns, builds & operates large acceleratorsFinanced by 20 European countries (member
states) + others (US, Canada, Russia, India, ….)~€650M budget - operation + new accelerators 2000 staff + 6000 users (researchers) from all over
the world
LHC (starts ~2005) experiment: 2000 physicists, 150 universities, apparatus costing ~€300M, computing ~€250M to setup, ~€60M/year to run
10+ year lifetime
EB Scale DBs
The LHC machine
Two counter-circulating proton beams
Collisionenergy 7 + 7 TeV
27 Km of magnetswith a field of 8.4 Tesla
Super-fluid Heliumcooled to 1.9°K
The world’s largest superconducting structure
EB Scale DBs
online systemmulti-level triggerfilter out backgroundreduce data volume from40TB/s to 100MB/s
level 1 - special hardware
40 MHz (40 TB/sec)level 2 - embedded processorslevel 3 - PCs
75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz(100 MB/sec)data recording &
offline analysis
1000TB/s according to recent estimates
EB Scale DBs
Higgs Search
H ZZ
•Start with protons•(quarks + gluons)
•Accelerate & collide•Observe in massive detectors
EB Scale DBs
LHC Data Challenges
4 large experiments, 10-15 year lifetimeData rates: ~500MB/s – 1.5GB/sData volumes: ~5PB / experiment / year
Several hundred PB total !Data reduced from “raw data” to “analysis
data” in a small number of well-defined steps
Analysed by thousands of users world-wide
CERN-IT-DBEstimated DISK Capacity at CERN
0
1000
2000
3000
4000
5000
6000
7000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
year
Tera
Byt
es
LHC
Other experiments
Estimated Mass Storage at CERN
LHC
Other experiments
0
20
40
60
80
100
120
14019
98
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
Year
Pet
aByt
es
Estimated CPU Capacity at CERN
0
1,000
2,000
3,000
4,000
5,000
6,000
1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010
year
K S
I95
LHC
Other experiments
Moore’s law
Planned capacity evolution at CERN
Mass Storage Disk
CPU
interactivephysicsanalysis
batchphysicsanalysis
batchphysicsanalysis
detector
event summary data
rawdata
eventreprocessing
eventreprocessing
eventsimulation
eventsimulation
analysis objects(extracted by physics topic)
Data Handling and Computation for
Physics Analysisevent filter(selection &
reconstruction)
event filter(selection &
reconstruction)
processeddata
les.
rob
ert
son
@ce
rn.c
h
CERN
EB Scale DBs
LHC Data ModelsLHC data models are
complex! Typically hundreds (500-1000) of
structure types (classes) Many relations between them Different access patterns
LHC experiments rely on OO technology OO applications deal with
networks of objects Pointers (or references) are
used to describe relations
EventEvent
TrackListTrackList
TrackerTracker Calor.Calor.
TrackTrackTrackTrackTrackTrack
TrackTrackTrackTrack
HitListHitList
HitHitHitHitHitHitHitHitHitHit
EB Scale DBs
CMS: 1800 physicists150 institutes32 countries
World Wide Collaboration distributed computing & storage capacity
EB Scale DBs
physics group
regional group
les.
rob
ert
son
@ce
rn.c
h
CERNTier2
Lab a
Uni a
Lab c
Uni n
Lab m
Lab b
Uni bUni y
Uni x
Tier3physics
department
Desktop
Germany
Tier 1
USA
UK
France
Italy
……….
CERN Tier 1
……….
The LHC Computing
Centre
CERN-IT-DB
Why use DBs?
OK, you have lots of data, but what have databases, let alone Object-Relational DBs got to do with it?
EB Scale DBs
Why Not: file = object + GREP?It works if you have thousands of objects
(and you know them all)But hard to search
millions/billions/trillions with GREP
Hard to put all attributes in file name. Minimal metadata
Hard to do chunking right.Hard to pivot on
space/time/version/attributes.Email: [email protected] site: http://research.microsoft.com/~Gray
EB Scale DBs
The Reality: its build vs buy
If you use a file system you will eventually build a database system: metadata, Query, parallel ops, security,…. reorganize, recovery, distributed, replication,
EB Scale DBsOK: so I’ll put lots of objects in a
fileDo It Yourself Database
Good news: Your implementation will be 10x faster (at
least!) easier to understand and use
Bad news: It will cost 10x more to build and maintain Someday you will get bored
maintaining/evolving it It will lack some killer features:
• Parallel search• Self-describing via metadata• SQL, XML, … • Replication• Online update – reorganization• Chunking is problematic (what granularity, how to
aggregate)
EB Scale DBsTop 10 reasons to put Everything in a DB1. Someone else writes the million lines of code
2. Captures data and Metadata,3. Standard interfaces give tools and quick learning4. Allows Schema Evolution without breaking old apps5. Index and Pivot on multiple attributes
space-time-attribute-version….6. Parallel terabyte searches in seconds or minutes7. Moves processing & search close to the disk arm
(moves fewer bytes (qestons return datons).
8. Chunking is easier (can aggregate chunks at server).9. Automatic geo-replication 10.Online update and reorganization. 11.Security 12.If you pick the right vendor, ten years from now,
there will be software that can read the data.
CERN-IT-DB
How to build multi-PB DBs
Total LHC data volume: ~300PBVLDBs today: ~3TB
Just 5 orders of magnitude to solve…
(one per year)
EB Scale DBs
Divide & Conquer Split data from different experiments Split different data types
Different schema, users, access patterns,… Focus on mainstream technologies &
low-risk solutions VLDB target: 100TB databases
1. How do we build 100TB databases?2. How do we use 100TB databases to
solve 100PB problem?
EB Scale DBs
Why 100TB DBs?
Possible today
Vendors must provide support
Expected to be mainstream within a few years
EB Scale DBs Decision Support (2000)
Company DB Size*(TB)
DBMS Partner
Server Partner
Storage Partner
SBC 10.50 NCR NCR LSI
First Union Nat. Bank
4.50 Informix IBM EMC
Dialog 4.25 Proprietary Amdahl EMC
Telecom Italia (DWPT)
3.71 IBM IBM Hitachi
FedEx Services 3.70 NCR NCR EMC
Office Depot 3.08 NCR NCR EMC
AT & T 2.83 NCR NCR LSI
SK C&C 2.54 Oracle HP EMC
NetZero 2.47 Oracle Sun EMC
Telecom Italia (DA) 2.32 Informix Siemens TerraSystems
*Database size = sum of user data + summaries and aggregates + indexes
EB Scale DBs
Size of the Largest RDBMS in Commercial Use for DSSSource: Database Scalability Program 2000
Terabytes
3
50
100
1996 2000 2005
Projected By Respondents
EB Scale DBs
BT Visit – July 2001
Oracle VLDB site: Enormous Proof of Concept test in 1999 80TB disk, 40TB mirrored, 37TB usable Performed using Oracle 8i, EMC storage “Single instance” – i.e. not cluster
Same techniques as being used at CERN
Demonstrated > 2 years ago!No concerns for building 100TB today!
EB Scale DBs
Physics DB Deployment
Currently run 1-3TB / server Dual processor Intel/Linux Scale to ~10TB in a few years sounds plausible
10-node cluster: 100TB ~100 disks in 2005!
Can we achieve close to linear scalability? Fortunately, our data is write-once, read-many
Should be good match for shared-disk clusters
EB Scale DBs
100TB DBs & LHC Data
Analysis data: 100TB ok for ~10 yearsOne DB cluster
Intermediate: 100TB ~1 year’s data ~40 DB clusters
RAW data: 100TB = 1 month’s data 400 DB clusters to handle all RAW
data• 10 / year, 10 years, 4 experiments
EB Scale DBs
RAW DataProcessed sequentially ~once / yearNeed only current + historic window onlineSolution: partitioning + offline tablespaces100TB = 10 days dataAmple for (re-)processing
Partition the tables “Old” data transportable TBS copy to tape Drop from catalog Reload, eventually to a different server, on request
EB Scale DBs
Intermediate Data~100-500TB / experiment / yearYotta-byte DBs predicted by 2020!
1000,000,000 TB
? Can DBMS capabilities grow fast enough to permit just 1 server / experiment? ++500TB / year
An open question …
EB Scale DBs
DB DeploymentDAQ cluster: current data – no history
export tablespacesto RAW cluster
to/from MSS
ESD cluster: 1/year? 1?
AOD/TAG 1 total?
to RCs to/from RCs
reconstruct analysis