pushing the limits of database clusters jamie shiers / cern werner schueler / intel
TRANSCRIPT
Pushing the Limits of Pushing the Limits of Database clustersDatabase clusters
Jamie Shiers / CERNJamie Shiers / CERN
Werner Schueler / IntelWerner Schueler / Intel
*Other trademarks and brands are the property of their respective owners 22
AgendaAgenda
Trend to Intel Clusters …Trend to Intel Clusters … Introduction to CERN & Data VolumesIntroduction to CERN & Data Volumes Current 9iRAC / IA32 statusCurrent 9iRAC / IA32 status Performance / Scalability / ReliabilityPerformance / Scalability / Reliability Future tests & timelineFuture tests & timeline Plans for Oracle tests on IA64Plans for Oracle tests on IA64 Oracle9i RAC PerformanceOracle9i RAC Performance Oracle9i RAC on ItaniumOracle9i RAC on Itanium
*Other trademarks and brands are the property of their respective owners 33
““Scale Up” by Scaling OutScale Up” by Scaling Out
InfoWorld – January 31, 2002InfoWorld – January 31, 2002
"It will be several years before the big machine dies, but inevitably the big machine will die.“ — Larry Ellison
Top10
Top10
Clustering for Clustering for performanceperformance
Source: tpc.org
*Other trademarks and brands are the property of their respective owners 44
Proprietary Solutions LaggingProprietary Solutions Lagging
Source: IDC 8/01
Worldwide Operating EnvironmentWorldwide Operating EnvironmentInstalled Base Server/Host Environments 2000-2005Installed Base Server/Host Environments 2000-2005
00
44
88
1212
1616
2000 2001 2002 2003 2004 2005
Years
UnitsUnits(M)(M)
Windows ServersWindows Servers
Linux ServersLinux Servers
ProprietaryProprietaryUNIX ServersUNIX Servers
*Other trademarks and brands are the property of their respective owners 55
CERNCERN
Large
Hadron
Collider
*Other trademarks and brands are the property of their respective owners 66
The Large Hadron Collider (LHC)The Large Hadron Collider (LHC)
*Other trademarks and brands are the property of their respective owners 77
Inside The 27km Tunnel…Inside The 27km Tunnel…
*Other trademarks and brands are the property of their respective owners 88
ATLAS Detector System for LHCATLAS Detector System for LHC
Detector is the size of a 6-floor building!
*Other trademarks and brands are the property of their respective owners 99
LHC: A Multi-Petabyte Problem!LHC: A Multi-Petabyte Problem!
Long Term Tape Storage EstimatesLong Term Tape Storage Estimates
LEPLEPExperimentsExperiments COMPASSCOMPASS
LHCLHCExperimentsExperiments
00
2'0002'0004'0004'000
6'0006'0008'0008'000
10'00010'00012'00012'000
14'00014'000
1995
1995
1996
1996
1997
1997
1998
1998
1999
1999
2000
2000
2001
2001
2002
2002
2003
2003
2004
2004
2005
2005
2006
2006
YearYear
TeraBytesTeraBytes
*Other trademarks and brands are the property of their respective owners 1010
level 1 - special hardware
100 MHz (1000 TB/sec)level 2 - embedded processors
level 3 - PCs
75 KHz (75 GB/sec)5 KHz (5 GB/sec)100 Hz
(100 MB/sec)DB
*Other trademarks and brands are the property of their respective owners 1111
LHC Data VolumesLHC Data Volumes
Data CategoryData Category Annual Annual TotalTotal
RAWRAW 1-3PB1-3PB 10-10-30PB30PB
Event Summary Data - ESDEvent Summary Data - ESD 100-500TB 100-500TB 1-5PB1-5PB
Analysis Object Data - AODAnalysis Object Data - AOD 10TB 10TB 100TB100TB
TAGTAG 1TB 1TB 10TB10TB
Total per experimentTotal per experiment ~4PB ~4PB~40PB~40PB
Grand totals (10 years)Grand totals (10 years) ~40PB~40PB~160PB~160PB
Data CategoryData Category Annual Annual TotalTotal
RAWRAW 1-3PB1-3PB 10-10-30PB30PB
Event Summary Data - ESDEvent Summary Data - ESD 100-500TB 100-500TB 1-5PB1-5PB
Analysis Object Data - AODAnalysis Object Data - AOD 10TB 10TB 100TB100TB
TAGTAG 1TB 1TB 10TB10TB
Total per experimentTotal per experiment ~4PB ~4PB~40PB~40PB
Grand totals (10 years)Grand totals (10 years) ~40PB~40PB~160PB~160PB
*Other trademarks and brands are the property of their respective owners 1212
LHC SummaryLHC Summary
Multi-national research lab near Geneva
Building new accelerator: Large Hadron Collider
Will generate fantastic amounts of data: 1PB/second!
How can 9iRAC help?
*Other trademarks and brands are the property of their respective owners 1313
LHC Computing PolicyLHC Computing Policy
Commodity solutions where-ever possibleCommodity solutions where-ever possible Extensive use of Extensive use of GridGrid technologies technologies Intel / LinuxIntel / Linux for processing nodes for processing nodes
– Farms of many K nodes: Farms of many K nodes: 200K200K in today’s terms in today’s terms– IA32 today moving to IA64 prior to LHC startupIA32 today moving to IA64 prior to LHC startup
9iRAC9iRAC claims to extend commodity solutions to claims to extend commodity solutions to the database marketthe database market
Does it live up to the promise?Does it live up to the promise? DB needs: DB needs: ~100PB~100PB total; few GB/s / PB; many total; few GB/s / PB; many
thousand concurrent processes; distributed access thousand concurrent processes; distributed access (world-wide)(world-wide)
*Other trademarks and brands are the property of their respective owners 1414
History and experienceHistory and experience
Oracle Parallel Server since V7Oracle Parallel Server since V7– ““Marketing clusters” – source Larry Ellison, OOW SFO 2001Marketing clusters” – source Larry Ellison, OOW SFO 2001
OPS in production at CERN since 1996OPS in production at CERN since 1996– Mainly for high-availabilityMainly for high-availability
Tests of 9iRAC started Autumn 2001Tests of 9iRAC started Autumn 2001– Servers: 9 dual Pentium® III Xeon Processor based Servers: 9 dual Pentium® III Xeon Processor based
servers, 512MB servers, 512MB – Storage: single node as above Storage: single node as above – Suse 7.2, Oracle 9.0.1Suse 7.2, Oracle 9.0.1
Currently working with 9iR2Currently working with 9iR2– Servers: 10 nodes as aboveServers: 10 nodes as above– Storage: now 3TB via 2 Intel-based disk-serversStorage: now 3TB via 2 Intel-based disk-servers
*Other trademarks and brands are the property of their respective owners 1515
CERN Computer Centre Today…CERN Computer Centre Today…
insideinsidehashas
*Other trademarks and brands are the property of their respective owners 1616
Benefits of 9iRACBenefits of 9iRAC
ScalabilityScalability– Supports Supports VLDBsVLDBs using using commoditycommodity h/w h/w– Intel/LinuxIntel/Linux server nodes (target ~100TB / cluster) server nodes (target ~100TB / cluster)
ManageabilityManageabilitySmallSmall number of RAC number of RAC manageablemanageable Tens / hundreds single instances a nightmareTens / hundreds single instances a nightmare
Better Resource UtilizationBetter Resource Utilization– Shared diskShared disk architecture architecture avoidsavoids hot-spots and idle hot-spots and idle
/ overworked nodes/ overworked nodes– Shared cacheShared cache improves improves performanceperformance for for
frequently accessed read-only datafrequently accessed read-only data
*Other trademarks and brands are the property of their respective owners 1717
9iRAC benefits9iRAC benefits
¥ € $¥ € $ Cost Cost– N x dual processors typically N x dual processors typically much muchmuch much
cheaper than single large multi-processorcheaper than single large multi-processor
¥ € $¥ € $ Cost Cost– Fewer DBAsFewer DBAs
¥ € $¥ € $ Cost Cost– No need to oversize system for peak loadsNo need to oversize system for peak loads
*Other trademarks and brands are the property of their respective owners 1818
Tests on LinuxTests on Linux
Initial goals: Initial goals: Test that it works with commodity H/W + LinuxTest that it works with commodity H/W + Linux Understand the configuration issuesUnderstand the configuration issues– Check how it scalesCheck how it scales
– Number of nodesNumber of nodes– Network interconnectNetwork interconnect– CPU used for the cache coherencyCPU used for the cache coherency– Identify bottlenecksIdentify bottlenecks
Commodity?Commodity? Server + interconnect okServer + interconnect ok– Storage Storage outstanding question !! outstanding question !!
*Other trademarks and brands are the property of their respective owners 1919
Conventional Oracle ClusterConventional Oracle Cluster
Disks Database servers
Clients (interactive, batch)
e.g. Fibre channel based solution
*Other trademarks and brands are the property of their respective owners 2020
Commodity Storage?Commodity Storage?
Critical issue for CERN Critical issue for CERN – Massive amount of data Massive amount of data – Extremely tight budget constraintsExtremely tight budget constraints
Long term (LHC: 2007)Long term (LHC: 2007)– network attached disks based on iSCSI?network attached disks based on iSCSI?
Short/Medium term: cost effective disk servers Short/Medium term: cost effective disk servers – €€7.5K 7.5K for 1.5TB mirrored at > 60MB/s) for 1.5TB mirrored at > 60MB/s)
*Other trademarks and brands are the property of their respective owners 2121
Commodity Oracle Cluster?Commodity Oracle Cluster?
Disks Database servers
Clients (interactive, batch)
3 interconnects, e.g. GbitE, possibly different protocols• General purpose network• Intra-cluster communications• I/O network
*Other trademarks and brands are the property of their respective owners 2222
Test & Deployment GoalsTest & Deployment Goals
Short-term (summer 2002):Short-term (summer 2002):– Continue tests on multi-node 9iRAC up to ~3-5TBContinue tests on multi-node 9iRAC up to ~3-5TB– Based on realistic data model & access patternsBased on realistic data model & access patterns– Understand in-house, then test in ValbonneUnderstand in-house, then test in Valbonne
Medium-term (Q1 2003):Medium-term (Q1 2003):– ProductionProduction 9iRAC with up to 25TB of data 9iRAC with up to 25TB of data– Modest I/O rate; primarily read-only dataModest I/O rate; primarily read-only data
Long-term (LHC production phase):Long-term (LHC production phase):– Multiple multi-hundred TB RACsMultiple multi-hundred TB RACs– Distributed in World-wide GridDistributed in World-wide Grid
*Other trademarks and brands are the property of their respective owners 2323
9iRAC Direction9iRAC Direction
Strong & visible commitment from OracleStrong & visible commitment from Oracle– Repeated message at OracleWorldRepeated message at OracleWorld– New features in 9iR2 New features in 9iR2
– e.g. cluster file system for Windows and Linuxe.g. cluster file system for Windows and Linux
Scalability depends to a certain extent on Scalability depends to a certain extent on applicationapplication– Our read-mostly data should be an excellent fit!Our read-mostly data should be an excellent fit!
Multi-TB tests with “professional” storageMulti-TB tests with “professional” storage– HP / COMPAQ centre in Valbonne, FranceHP / COMPAQ centre in Valbonne, France
Target: Target: 100TB per 9iRAC100TB per 9iRAC
*Other trademarks and brands are the property of their respective owners 2424
Why 100TB?Why 100TB?
Possible todayPossible today– BT Enormous Proof of Concept: 37TB in 1999BT Enormous Proof of Concept: 37TB in 1999– CERN ODBMS deployment: 3TB per nodeCERN ODBMS deployment: 3TB per node
Mainstream long before LHCMainstream long before LHC– Winter 2000 VLDB survey: 100TB circa 2005Winter 2000 VLDB survey: 100TB circa 2005
How does this match LHC need for 100PB?How does this match LHC need for 100PB?Analysis data: 100TB ok for ~10 yearsAnalysis data: 100TB ok for ~10 years
One 10 node 9iRAC One 10 node 9iRAC per experimentper experiment Intermediate: 100TB ~1 year’s dataIntermediate: 100TB ~1 year’s data
– ~40 10 node 9iRACs~40 10 node 9iRACs RAW data: 100TB = 1 month’s dataRAW data: 100TB = 1 month’s data
– 400 10node 9iRACs to handle all RAW data400 10node 9iRACs to handle all RAW data– 10 RACs / year, 10 years, 4 experiments
*Other trademarks and brands are the property of their respective owners 2525
LHC Data Volumes RevisitedLHC Data Volumes Revisited
Data CategoryData Category Annual Annual TotalTotal
RAWRAW 1-3PB1-3PB 10-10-30PB30PB
Event Summary Data - ESDEvent Summary Data - ESD 100-500TB 100-500TB 1-5PB1-5PB
Analysis Object Data - AODAnalysis Object Data - AOD 10TB 10TB 100TB100TB
TAGTAG 1TB 1TB 10TB10TB
Total per experimentTotal per experiment ~4PB ~4PB~40PB~40PB
Grand totals (15 years)Grand totals (15 years) ~16PB~16PB~250PB~250PB
Data CategoryData Category Annual Annual TotalTotal
RAWRAW 1-3PB1-3PB 10-10-30PB30PB
Event Summary Data - ESDEvent Summary Data - ESD 100-500TB 100-500TB 1-5PB1-5PB
Analysis Object Data - AODAnalysis Object Data - AOD 10TB 10TB 100TB100TB
TAGTAG 1TB 1TB 10TB10TB
Total per experimentTotal per experiment ~4PB ~4PB~40PB~40PB
Grand totals (15 years)Grand totals (15 years) ~16PB~16PB~250PB~250PB
*Other trademarks and brands are the property of their respective owners 2626
RAW & ESD: >> 100TBRAW & ESD: >> 100TB
RAW:RAW:– Access pattern: sequentialAccess pattern: sequential– Access frequency: ~once per yearAccess frequency: ~once per year– Use time partitioning + (offline tablespaces?)Use time partitioning + (offline tablespaces?)– 100TB = 10 day time window100TB = 10 day time window– Current data (1 RAC) historic data (2Current data (1 RAC) historic data (2ndnd RAC) RAC)
ESD:ESD:– Expect RAC scalability to continue to increaseExpect RAC scalability to continue to increase– VLDB prediction for 2020: VLDB prediction for 2020: 1000,000,000 TB1000,000,000 TB (YB) (YB)
*Other trademarks and brands are the property of their respective owners 2727
RAW
ESD
AOD
TAG
randomseq.
1PB/yr (1PB/s prior to reduction!)
100TB/yr
10TB/yr
1TB/yr
Data
Users
Tier0
Tier1
*Other trademarks and brands are the property of their respective owners 2828
Oracle Tests on IA64Oracle Tests on IA64
64 bit computing essential for LHC64 bit computing essential for LHC– Addressability: VLMs, 64 bit filesystems, VLDBsAddressability: VLMs, 64 bit filesystems, VLDBs– Accuracy: need 64 bit precision to track sub-Accuracy: need 64 bit precision to track sub-
atomic particles over tens of metresatomic particles over tens of metres
Migration IA32 Migration IA32 IA64 prior to LHC startup IA64 prior to LHC startup
*Other trademarks and brands are the property of their respective owners 2929
A solid history of Enterprise class A solid history of Enterprise class processor developmentprocessor development
Intel’s technology innovations drive Intel’s technology innovations drive price/performance and scalabilityprice/performance and scalability
Intel’s technology innovations drive Intel’s technology innovations drive price/performance and scalabilityprice/performance and scalability
TimeTime
Pe
rfo
rma
nc
eP
erf
orm
an
ce
RISC techniques for 2X i386™ performanceRISC techniques for 2X i386™ performance
Executes 2 instructions in parallelExecutes 2 instructions in parallel
Multi-processor Multi-processor supportsupport
PentiumPentium®® processor processor
PentiumPentium®® II/III Xeon™ processors II/III Xeon™ processorsPentiumPentium®® Pro processor Pro processor
Intel Xeon processorIntel Xeon processor
i486i486™™ processorprocessor
IntelIntel®® Xeon™ processor MP Xeon™ processor MP Higher processing & Higher processing & data bandwidth for data bandwidth for enterprise appsenterprise apps
*Other trademarks and brands are the property of their respective owners 3030
Performance Performance Via Technology Innovations Via Technology Innovations
Balanced system performance through higher Balanced system performance through higher bandwidth and throughputbandwidth and throughput– IntelIntel®® NetBurst™ microarchitecture NetBurst™ microarchitecture– Integrated multi-level cache architecture Integrated multi-level cache architecture
Faster performance on business appsFaster performance on business apps– Hyper-Threading TechnologyHyper-Threading Technology– up to 40% more efficient use of processor resourcesup to 40% more efficient use of processor resources
Processor Innovations for Increased Processor Innovations for Increased Server Performance and HeadroomServer Performance and Headroom
Processor Innovations for Increased Processor Innovations for Increased Server Performance and HeadroomServer Performance and Headroom
*Other trademarks and brands are the property of their respective owners 3131
High AvailabilityHigh AvailabilityBack EndBack End ReliabilityReliabilityAvailabilityAvailabilityReliabilityReliabilityAvailabilityAvailability
Mid-TierMid-TierHigh-end High-end General PurposeGeneral Purpose
ScalabilityScalability EPIC ArchitectureEPIC ArchitectureEPIC ArchitectureEPIC Architecture
High PerformanceHigh PerformanceFront-endFront-endGeneral PurposeGeneral Purpose
BandwidthBandwidthBandwidthBandwidthThroughputThroughputPerformancePerformance
Matching Enterprise RequirementsMatching Enterprise RequirementsItanium® Processor
family FeaturesSystem RequirementsEnterprise Segments
Features and flexibility to span the enterpriseFeatures and flexibility to span the enterpriseFeatures and flexibility to span the enterpriseFeatures and flexibility to span the enterprise
*Other trademarks and brands are the property of their respective owners 3232
Example: Example:
Calling circle OLTP modelCalling circle OLTP model–Taken from a real world insurance exampleTaken from a real world insurance example
Example: Example:
Calling circle OLTP modelCalling circle OLTP model–Taken from a real world insurance exampleTaken from a real world insurance example
Best Performance… OLTP modelBest Performance… OLTP model
– 4 node x 4-way Pentium4 node x 4-way Pentium®® III Xeon III Xeon™™ 700 MHz 700 MHz processor-based systems processor-based systems
128k TPM128k TPM Over 90% scalabilityOver 90% scalability
Oracle9i RAC Scalability on Intel Architecture
33,000
66,000
98,000
128,000
0
20,000
40,000
60,000
80,000
100,000
120,000
1 2 3 44 way Pentium 3 Xeon Systems
#Nodes:
TPM
Intel-based Solution Outperforms 32-way Intel-based Solution Outperforms 32-way Sun Solution by More than 2xSun Solution by More than 2x
Intel-based Solution Outperforms 32-way Intel-based Solution Outperforms 32-way Sun Solution by More than 2xSun Solution by More than 2x
*Other trademarks and brands are the property of their respective owners 3333
Best Performance… TPC/CBest Performance… TPC/C
8 nodes * 4 way Database Servers Pentium III Xeon 8 nodes * 4 way Database Servers Pentium III Xeon 900Mhz900Mhz
16 load generating Application Servers Pentium III 1Ghz16 load generating Application Servers Pentium III 1Ghz
*Other trademarks and brands are the property of their respective owners 3434
Best Performance … TPC/CBest Performance … TPC/C
*Other trademarks and brands are the property of their respective owners 3535
Best Performance… Best Performance… Price/PerformancePrice/Performance
9iRAC on RedHat on e.g. Dell 69% faster 9iRAC on RedHat on e.g. Dell 69% faster and 85% less expensive than Oracle on and 85% less expensive than Oracle on RISC solutionsRISC solutions
*Other trademarks and brands are the property of their respective owners 3636
ItaniumItanium®® Processor Family Processor Family
PerformancePerformance
ItaniumItanium®®
ProcessorProcessorItaniumItanium®®
ProcessorProcessor
ItaniumItanium®® 2 2ProcessorProcessorItaniumItanium®® 2 2ProcessorProcessor
Madison* /Madison* /Deerfield*Deerfield*Madison* /Madison* /Deerfield*Deerfield*
Montecito*Montecito*Montecito*Montecito*
20012001 20022002 20032003
• Introduce architectureIntroduce architecture• Deliver competitive performanceDeliver competitive performance• Focused target segments Focused target segments
• Build-out architecture/ platformBuild-out architecture/ platform• Establish world-class performanceEstablish world-class performance• Significantly increase deploymentSignificantly increase deployment
• Extend performance leadershipExtend performance leadership• Broaden target applicationsBroaden target applications
Common hardware
Common hardware
* Indicate Intel processor codenames. All products, dates and figures are preliminary, * Indicate Intel processor codenames. All products, dates and figures are preliminary, for planningfor planning purposes only, and subject to change without notice. purposes only, and subject to change without notice.
Software scales across generations
Software scales across generations
*Other trademarks and brands are the property of their respective owners 3737
ItaniumItanium®® 2 Processor 2 Processor
On track for mid’02 releases from multiple On track for mid’02 releases from multiple OEMs and ISVOEMs and ISV
Substantial performance leadership vs. Substantial performance leadership vs. RISCRISC
Delivering on performance promiseDelivering on performance promiseDelivering on performance promiseDelivering on performance promise
1.001.00
ItaniumItanium®® processor processor
800MHz 4MB 800MHz 4MB L3L3
SPECint2000SPECint2000
Using ItaniumUsing Itanium®® 2 2 optimizationsoptimizations
Source: Intel CorporationSource: Intel Corporation
SPECfp2000SPECfp2000 StreamStream OLTPOLTP ERPERP Linpack 10KLinpack 10K CAECAE
CPU/BandwidthCPU/Bandwidth EnterpriseEnterprise Technical Technical ComputingComputing
~2.0~2.0 ~2.0~2.0
~1.7~1.7 ~1.7~1.7
~2.1~2.1~1.9~1.9
*Other trademarks and brands are the property of their respective owners 3838
Deployment StrategyDeployment Strategy
Scale OutScale Out with fail-over with fail-over clusters on 1 to2-way clusters on 1 to2-way serversservers
Scale UpScale Up on 4 and 8-way on 4 and 8-way servers, then servers, then Scale OutScale Out on fail-over clusterson fail-over clusters
Scale UpScale Up on 8-way and on 8-way and above serversabove servers
ExamplesExamples
Inktomi*Inktomi*
Apache* Web ServerApache* Web Server
Microsoft Exchange* Microsoft Exchange* ServerServer
Oracle* 9iRACOracle* 9iRAC
SAS Enterprise Miner*SAS Enterprise Miner*
Oracle 9i* Oracle 9i*
Positioned To Scale Right Positioned To Scale Right Intel RelevanceIntel Relevance
MP
MP
Versatile Server Solutions For Scaling RightVersatile Server Solutions For Scaling RightVersatile Server Solutions For Scaling RightVersatile Server Solutions For Scaling Right
*Other trademarks and brands are the property of their respective owners 3939
Inflection point comingInflection point coming
Itanium2™Itanium2™ will have a 75%** price / performance will have a 75%** price / performance lead lead over USIII at introduction in Q3’02over USIII at introduction in Q3’02
– Itanium2™ will outperform USIII by 40%Itanium2™ will outperform USIII by 40%– Itanium2™ will cost 20% less than USIIIItanium2™ will cost 20% less than USIII
Oracle and Intel working to make 9i on Itanium a Oracle and Intel working to make 9i on Itanium a successsuccess
– Joint performance goal of 100k TPM-C on a single 4-Joint performance goal of 100k TPM-C on a single 4-way Itanium2™ serverway Itanium2™ server
– 13 Intel engineers onsite and an additional 24 at Intel 13 Intel engineers onsite and an additional 24 at Intel working to optimize 9i on Itanium2™working to optimize 9i on Itanium2™
– Intel supplying Oracle large numbers of Itanium2™ Intel supplying Oracle large numbers of Itanium2™ development systemsdevelopment systems
* McKinley is next generation Itanium™ processor** Estimated Q3’02 figures
*Other trademarks and brands are the property of their respective owners 4040
SummarySummary
ExistingExisting Oracle technologies can be used to build Oracle technologies can be used to build 100TB100TB databases databases
FamiliarFamiliar data warehousing techniques can be used to data warehousing techniques can be used to handle handle much largermuch larger volumes of historic data volumes of historic data
Best Best Price and PerformancePrice and Performance through clusters vs. through clusters vs. RiscRisc
9iRAC9iRAC makes this possible on commodity server makes this possible on commodity server platformsplatforms
Standard High Volume servers offer great Standard High Volume servers offer great performance todayperformance today and promise a and promise a safe investmentsafe investment for the futurefor the future
Thank YouThank You