Coarse Grained ClassificationCoarse‐Grained ClassificationBig DataBig Data
PBs of data, 102‐105 nodes
OperationalHigh qps, a few rows/op
e g BigTable Dynamo PNUTS
AnalyticLow qps, billions of rows/opMapReduce Hadoop Dryad
12
e.g., BigTable, Dynamo, PNUTS MapReduce, Hadoop, Dryad
Virtualization(Scalability)
Multi‐Tenancy Machine Virtualization34 Multi TenancyMap N logical systems into 1
physical system
Machine VirtualizationMap 1 logical system into N physical
systems
2
Fine Grained ClassifcationFine‐Grained Classifcation
• Virtualization– Goals and Trends
• Machine Virtualization
• Storage Virtualization– Impact on Database Systemsp y
• Database Systems in Virtual Environments
C l i• Conclusion
• “Any problem in computer science can be l d i h h l f i di i ”solved with another layer of indirection.”
(David Wheeler)
A typical scenario in data centersA typical scenario in data centers
i R i tRequirements
• Response time < 2s
• Throughput > 100 rq/sec
Requirements• Response time < 5s
• Throughput > 50 rq/sec
Customer A Customer BPay 100$
Throughput > 100 rq/sec
Pay 50$
Run auction site Run trading site
16Shared Data Center
16
Hosting applicationsHosting applications
E mail Web DatabaseE‐mail serverLinux
Web serverLinux
Database serverLinuxLinux Linux Linux
Data Center
17
Common idiom: One‐to‐one mapping of applications to nodes17
Solution: VirtualizationE‐mail serverserverLinux
Web server
E‐mail serverLi
Web serveri
Database serverLi
Consolidate
serverLinux Virtualization
Linux Linux Linux
Database serverserverLinux
18
Improved utilization using consolidation18
Problem – Poor utilizationProblem Poor utilization
4
5
s
Wasted Resources
3
4
er of C
PUs
2Num
be
1
0
Time (every 5 mins)
Ad h ll i h19
Ad‐hoc resource allocation schemes waste resources19
What is Virtualization?What is Virtualization?
h b f f h• Separating the abstract view of computing resources from the implementation of these resources
• A layer of indirection between abstract view and implementation of y presources– Hides implementation details
Controls mapping from abstract view to implementation– Controls mapping from abstract view to implementation
– Adds flexibility and agility to the computing infrastructure
– Can be used to solve many problems related to provisioning, manageability, security, …
– lower TCO, fewer computing resources
• “Virtualization will be a $20 billion market by 2010.” (IDC, Jan 2007)Virtualization will be a $20 billion market by 2010. (IDC, Jan 2007)
21
Why virtualization?Why virtualization?
Vi t li ti dd fl ibilit d ilit t th• Virtualization adds flexibility and agility to the computing infrastructure
• Can be used to solve many problems related to• Can be used to solve many problems related to– provisioning, manageability, security, …Pool and share computing resource– Pool and share computing resource
– Simplify administration and management– Improve fault toleranceImprove fault tolerance
• For organizations: Lower total cost of ownership for computing infrastructurefor computing infrastructure– Fewer computing resources– More resilient and simpler to managep g
25
What does Virtualization enable IWhat does Virtualization enable I
S lid i• Server consolidation– The vast majority of server are grossly underutilized
l d h l– Typical setup today: one machine per application (DBMS, web server, mail server, …)Provisioned for peak load Usually under utilized– Provisioned for peak load. Usually under‐utilized
– Instead, can run multiple applications on virtual machines that share the same physical machinemachines that share the same physical machine
– Reduces both CapEx (Capital EXpenditure) and OpEx(Operational EXpenditure)(Operational EXpenditure)
– Save hardware costs, administration costs, power, etc.
26
What does Virtualization enable IIWhat does Virtualization enable II
• S lid ti• Server consolidation
e ments
P1 e ments P2
Resource
requ
irem 1
Resource
requ
irem
Time Time
• Consolidate into a single machine with capacity P12– P12 < P1 + P2
Easier to manage
Time Time
– Easier to manage– Less total capacity and operating costs than original two
• Better utilization than the original two
27
What does Virtualization enable IIIWhat does Virtualization enable III
• Migration of VMs (both storage and CPU/memory)– Enables live load balancing
– Facilitates maintenance
Migrate VM
AppVM
to Machine 2(live migration)Need to bring Machine 1
down for maintenance
VMApp
Can now bring Machine 1
Machine 1 Machine 2(spare)
Can now bring Machine 1Down without serviceinterruption
Shared Storage
(spare)
28
What does Virtualization enable IVWhat does Virtualization enable IV
• High availability– Allows a small number of generic servers to back up all servers
VMDetect failure(heartbeat) VM
AppVM (heartbeat) VM
App
Restart VM on Machine 2 from image on sharedstorage
Machine 1 Machine 2(stand by)
g
Image of VMVM
Shared StorageVM
App29
TerminologyTerminology
E l ti• Emulation– Simulate complete hardware, allowing unmodified OS for different CPU to be runfor different CPU to be run
– Must simulate each instruction so slow– Example: BochsExample: Bochs
• Operating system virtualization– Isolated virtual servers within a serverIsolated virtual servers within a server– Guest and host OS use the same kernel– Example: Solaris Containers, Virtuozzo
• We discuss native virtualization or full virtualization
30
Machine Virtualization IMachine Virtualization I
A i t l hi " b t t " th ti• A virtual machine "abstracts" the computing resources of a physical machine into virtual resourcesresources
• Introduces a level of indirection between virtual resources and physical resourcesresources and physical resources
• End users only see the virtual resources– Can install their operating systems and run theirCan install their operating systems and run their applications on the virtual machines
• A Virtual Machine Monitor (or Hypervisor) is a ( yp )software layer that implements the mapping from virtual resources to physical resources
32
Machine Virtualization IIMachine Virtualization II
• Strong isolation between virtual machines• Flexible mapping between virtual resources andphysical resources
h l h h• Can have more virtual resources than the corresponding physical resources
• Can reallocate physical resources among VMsP h k i t d i t i t l• Pause, resume, checkpoint, and migrate virtualmachines
33
Virtualization HistoryVirtualization – History• Technique with long history (since the 1960's)
– Prominent since IBM 370 mainframe series
• Virtualization todayl– Larger scale
– Commodity hardware and operating systems
• Example Technologies• Example Technologies– IBM System Z/VM and IBM PowerVM on the Power Systems
– Sun X/VM and Zones/
– HP VM
– On the X86 processors• Xen and XenServer, Microsoft Hyper‐V, KVM, VMware ESX
34
Operating systemsOperating systems
• Run VMM in privileged mode, OS in user mode
• Privileged operations by the OS will trap
• Trap handler in VMM emulates these operations as if they were run on the virtual machine
35
Trap and Emulate Virtualization ITrap‐and‐Emulate Virtualization I
• Run VMM in privileged mode
• Run OS in user modeRun OS in user mode
• Privileged operations by the OS will trap
• Trap handler in VMM emulates these operations as if they were run on the virtual p ymachine
N i il d ti d• Non‐privileged operations can proceed as before with no intervention from the VMM
36
more features...more features
• Bi T l ti f G t C d• Binary Translation of Guest Code– Translate guest kernel code– Replace privileged instrs with safe “equivalent” instruction sequences– No need for traps– Permits any unmodified x86 OS to run in a VM
(can virtualize any instruction set)• Virtualization Hardware Assist
– recent CPUs have features to reduce someof the overhead at the monitor level
– does not remove all virtualization overheads– scheduling, memory management and I/O
are still virtualized with a software layer– The Binary Translation monitor is faster
than hardware‐assist for many workloads
ParavirtualizationParavirtualization• CPU and Memory Paravirtualization• CPU and Memory Paravirtualization
– extends the guest to allow direct interaction with the underlying hypervisor– reduces the monitor cost including memory and System call operations.
gains from paravirtualization are workload specific– gains from paravirtualization are workload specific• Device Paravirtualization
– places a high performance virtualization‐aware device driver into the guestparavirtualized drivers are more CPU efficient (less CPU overhead for– paravirtualized drivers are more CPU efficient (less CPU overhead forvirtualization)
– Paravirtualized drivers can also take advantage of HW features, like partial offload
Virtual Machines:Usage and Repositories
• Software development and testing– Multiple environments pfor development and testing
• Software deploymentPreconfigured virtual– Preconfigured virtual appliances
Repositories of virtual– Repositories of virtual appliances on the web
41
Why not Use Virtualization?Why not Use Virtualization?
• Performance penalty– Indirection through VM adds overheadg
– Can be reduced through better virtualization supportsupport
• Hiding details of physical resources– Some applications (e.g., DBMS!) make decisions based on assumptions about the physical resources
42
some (more) numbers... some (more) numbers
PH HDFS VM HDFS ith• PH‐HDFS vs VM‐HDFS with different data scale and 7 nodes cluster
• PH‐HDFS vs VM‐HDFS with different cluster scale withdifferent cluster scale, with the same data distribution (512MB per DataNode)
• PH‐HDFS vs VM‐HDFS with different throughputdifferent throughput (requests/s) and 7 nodes cluster
still more numbers... still more numbers
• WordCount example • Sort and wordcountexecution with different VM load per node, and different data size
Storage VirtualizationStorage Virtualization
• Storage virtualization is a layer of indirection that allows the definition of virtual storage gdevices
• Basis Operations• Basis Operations– create, destroy physical devices
– grow, shrink virtual devices
– change properties RAID mirrorg p p• size, performance, reliability
48virtual devices
Why Storage Virtualization?Why Storage Virtualization?
/• Minimize/avoid downtime– transparent redundancy
I f• Improve performance– Distribution and balance storage load
– Dynamic storage provisioningDynamic storage provisioning
– control placement
• Reduce cost of storage administrationg
physical storagepool
virtual devices
Application 1 Application 2 Application 1 Application2
Additional CapabilitiesAdditional Capabilities
File System Level• Versioning, snapshots, point‐
i i i
Database Level• Storage snapshots and versioning
backup and recoveryin‐time copies• Local and remote mirroring• Migration of virtual devices
– backup and recovery– concurrent applications
• Storage replication and mirroringdynamic DBMS provisioningMigration of virtual devices
– supports provisioning, hierarchical storagemanagement
– dynamic DBMS provisioning• Dynamic storage resource
allocation– accommodate workloadmanagement
• Auto‐administration– policy‐based management
accommodate workloadfluctuations
– e.g., resizable tablespaces• CPU and memory in the storage
• Storage QoS and performance isolation
system– caching– offload DBMS functions
50
What Does StorageVirtualization Enable?
• Consistent Interface for All Storage Applications
• Non‐disruptive LUN Creation, Expansion, Deletionp , p ,
• Non‐disruptive Data Consolidation, Data MigrationMigration
• Non‐disruptive RAID Level Changes
• Tiering of Storage Using Heterogenous Storage
• Replication Across Heterogenous StorageReplication Across Heterogenous Storage
• Backup/Restore, Archival, Thin Provisioning, etc
51
Storage Virtualization andDBMS‐Tuning I
• Characteristics of physical storage hidden from theDBMS (and DBA) CREATE TABLESPACE ts
MANAGED BY DATABASE• How to
– Layout DB objects?
MANAGED BY DATABASEUSING (DEVICE '/dev/rlv1' 10000,DEVICE '/d / l 2' 10000)
– Costing query plans
– Physical design
DEVICE '/dev/rlv2' 10000)OVERHEAD 6.0TRANSFERRATE 0.05y g
– Set DBMS parameters?• page/extent sizes
PAGESIZE 8192
• Prefetching
• costs
52
Observation: IO ProblemObservation: IO‐Problem
• Capacity versus Performance– consider 200GB of Disk Spacep
single disk
storagesubsystem
Observation 3: a big messObservation 3: ... a big mess Static Relationship between Servers
and Storage Systems SAN today – Problems and Limitations
P i t i t b l
Inefficient use of Storageresources, Migration of data usually disruptive
Out Out
FC SwitchManagementApplication
ESS Drivers ESS Drivers HPQ Drivers HPQ Drivers LCRB Drivers Proprietary, non‐interoperabelCopy Services
No homogenous Storage Management
ofSpace
ofSpace
Application
Flashcopy ?
Monolithic, expensive StorageSubsystem.Pay for Functionality when needing CapacityCapacity
OutofSpace Free
HPQ LCRBESS Remote Copy ?
Freecapacity
010101010101010101D t Mi ti010101010101010101010010101101001000Data Migration
End to End Physical Design IEnd‐to‐End Physical Design I
• Goal– Solve both DB design and storage design problems while preserving
virtualizationvirtualization.
• Two approaches1. export virtual device characteristics to DBAp
• weakens virtualization
• DBA/DBMS must model I/O costs
DBA i l d i l t• DBA involved in layout
2. describe DBMS storage workload to storage administrator• storage virtualization preservedg p
• storage administrator responsible for layout
• need to estimate DBMS storage workload
• can accommodate multiple storage clients• can accommodate multiple storage clients
57
End to End Physical Design IIEnd‐to‐End Physical Design II
1. Ignore problem interdependency: DB design, then storage system design
2 E l it i ti d i t l2. Exploit existing design tools
3. Predict storage workload at design time
58
Shared Disk SystemsShared‐Disk Systems
• Architecture– Many computing nodes, one shared disk system
Multiple database instances (one on each node) accessing the same– Multiple database instances (one on each node) accessing the same database over shared storage and act as a single “clustered” server.
Grid Cluster4 x 4 CPUs
Grid Cluster
Fibre Channel SAN Switches
4 x 16 GB RAM16 FC‐AL Controller
8 x 14 disks16 FC‐AL Controller1600MB/sec throughput…
Shared Disk System
Shared Disk Systems (2)Shared‐Disk Systems (2)
Ad t• Advantages– 100 % node utilization possible– All data is accessible even if one node/multiple nodes fail(s)/ p ( )– Provides facility for incremental growth
• DisadvantagesI d h i i ( h h )– Inter‐node synchronization (cache coherence) is required and involves complex lock management
– OS overhead for running shared disk software– Increased cost
• Database vendors and features• Database vendors and features– Oracle Real Application Cluster (Oracle RAC)– DB2 on OS/390 Cluster
Multi Tier CachingMulti‐Tier Caching
• Exploit memory in the storage system
• Both the DBMS and the storage system rely onBoth the DBMS and the storage system rely on caching
Wi h di i h i• Without coordination, caches can interactbadly
65
Example Two Tier CachingExample – Two‐Tier Caching
1. p misses tier‐one cache, requested from tier‐two
2 i ti t h2. p misses tier‐two cache, fetched from device
3 p placed into tier two cache3. p placed into tier‐two cache
4. p placed into tier‐one cache
Cache Inclusion
i i b th h Thi i th• p is in both caches. This is thecache inclusion problem.
66
Locality of 2nd Tier WorkloadsLocality of 2nd Tier Workloadsgood temporal locality weak temporal locality
Client ServerClient Server
67
Write HintsWrite Hints
• Why does tier 1 WRITE?1. Replacement: clean a block so that it can be replaced
2 Recovery: clean a block to limit potential data loss or2. Recovery: clean a block to limit potential data loss or
• Recovery effort– a write hint identifies the reason for a WRITEa write hint identifies the reason for a WRITE
– help to distinguish the good opportunities from the bad
– replacement vs. Recovery
– synchronous vs. Asynchronous
• Tier two interprets write hints to guide caching decisionsASYNCH SYNCH
REPLACE eviction likely soon eviction imminent
RECOV uncertain uncertain
68
Example 1: Archiving & BackupExample 1: Archiving & Backup
dl h d d diff l• Handle hot and warm data differently• Storage systems knows about a „databaseg yapplication“– assumes abstract entities independent of anyassumes abstract entities independent of anyspecific system
– tables, table spaces, log files, and control filestables, table spaces, log files, and control files
• Backup/Archivingmore than just the– more than just thedatabase!
Disk Mirroring and Online BackupDisk Mirroring and Online Backup
• Di k Mi i• Disk Mirroring– Properties
• copes with media errorsti ll i ll l• sequentially or in parallel
– Restore• have to resynchronize mirrors
(may take a very long time)(may take a _very_ long time)– Alternative (Oracle)
• multiplexing of redo logfiles and control files• allows for multiple destinations of archive log filesp g
• Online backup – use log files during the recovery process to recover the database to a
fully consistent statefully consistent state– requires to retain the RDBMS log files– the system has to figure out where to start recovery
Example Tivoli Storage ManagerExample Tivoli Storage Manager• back up the raw devices directly; the database must be offline during an image• back up the raw devices directly; the database must be offline during an image
backup for consistency. • back up the data to files, and then use Storage Manager to back up those created
files• directly back up the database
and log files; no need for intermediate files; database must be offline to ensure amust be offline to ensure a consistent backup copy
• integrated support for backing up data to Storage Manager; direct backup via Storagedirect backup via Storage Manager API; The underlying physical structure of the database (raw devices or files) is handled b th li ti th t fby the application; the type of backup (online, offline, incremental,table space) is determined and controlled by the application
Example 2: Flash StorageExample 2: Flash Storage
• Flash Disks– extremely fast random read speed
Slow random write speed– Slow random write speed
• Magnetic Disks– more efficient than flash disks at random write patternsmore efficient than flash disks at random write patterns
– slower random read speed than flash disks
Random Read RandomWrite
M ti Di k 12 ( ) 23 ( )Magnetic Disk 12x (rm) 23x (wm)
Flash Disk 1x (rf) 118x (wf)
74
Properties of NAND Flash DisksProperties of NAND Flash Disks• I/O Interface• I/O‐Interface
– at the OS level, SSDs behave identically to magnetic drives– unit of I/O‐Operation is a sector of 512 bytes (equal to magnetic disks)
• No mechanical latency– Purely electronic devices
Time needed to access a data item is independent of its position– Time needed to access a data item is independent of its position
• I/O asymmetryRead operations are much faster than write operations– Read operations are much faster than write operations
– When a value of a NAND‐cell is changed, it takes some time before it reaches a stable state
• Erase‐before‐write limitation– A sector cannot be overwritten (erase before update)– Moreover, only entirely blocks with many sectors, y y y
75
Hybrid Storage LayerHybrid Storage Layer• Design a database system that uses both• Design a database system that uses both
– a flash and– a magnetic disk
as storage media and improve its I/O performance– as storage media and improve its I/O performance• Flash Disk and Magnetic Disk side‐by‐side
– at the same levelA database page can reside either on the– A database page can reside either on theflash or on the magnetic disk but not on both
• Two problems to solvep– Adapting to changing I/O– workloads to optimally place– pages accross the two disks– page placement– Making the optimal choice of– the page to replace in the buffer– pool page replacement
76
Page PlacementPage Placement
• Upon eviction, decide page placement– On a per‐page basis
Adapt placement to changing workloads– Adapt placement to changing workloads
• Decision for a page as a 2‐state task systemsystem– State f : page currently on flash disk
– State m : page currently on magnetic disk
– Cost of serving a request:• rf (rm) when reading from flash (magnetic)
• wf (wm) when writing to flash (magnetic)wf (wm) when writing to flash (magnetic)
– Cost of a transition (page migration):• wf when migrating to flash
f h i i i• wf when writing to magnetic
77
Conservative Placement AlgorithmConservative Placement Algorithm• For each page maintain• For each page maintain
– pr : physical reads since last migration
– pw : physical writes since lastpw : physical writes since last migration
• Upon eviction compute– c: the cost of serving all pr, pw g p , p
from current state– c’: the cost of serving all pr, pw
from the other stateIf c c’ > wf+ wm then migrate– If c ‐c’ > wf+ wm , then migrate
• Why “conservative”?– Need to be certain that page is on
the wrong medium beforethe wrong medium before triggering a transition
– Decision based on the actual cost paid, make no assumptions
78
High AvailabilityHigh Availability
• High Availability– high or permanent availabilityg p y
– protection against data loss (recovery)
functionality substitute in error cases by the– functionality substitute in error cases by thestandby system (failover)
d i f d i– reduction of downtime
– transparency for applications desirable
Database Failover ClusterDatabase Failover Cluster
F il Cl t• Failover Cluster– database cluster that has redundant nodes to solely provide service pwhen database components fail
– reduces downtime of critical serverst ti f il d t ti i h tb t– automatic failure detection using heartbeats
– primary database (current database)– standby database (failover database)standby database (failover database)
• Example database vendors and featuresp– DB2 High Availability Disaster Recovery (HADR)– Oracle Data Guard
High Availability Disaster RecoveryHigh Availability Disaster Recovery
• High Availability Disaster Recovery (HADR)• High Availability Disaster Recovery (HADR)– database replication function for high availability
• creation of a standby database via Backup & Restore or Online S lit Mi iSplit Mirroring
• exchange of monitoring messages• use of log file transfer
t titransactions
HADR ships logs to St db d t b
Continuous
Primary database
Logs
Standbydatabase
LogsStandby database transaction
replay
Initialize Standby database using database restore
High Availability Disaster Recovery (2)High Availability Disaster Recovery (2)
R li t d d t d ti• Replicated data and operations– DDL & DML operationsoperations on table spaces and buffer pools– operations on table spaces and buffer pools
– reorganization– metadata for stored procedures and UDFs (but not– metadata for stored procedures and UDFs (but not their implementations)
• Non‐replicated data and operations– database and instance configuration parameters anddatabase and instance configuration parameters andtheir modifications
– LOBs > 1GB
HADR States of the Standby DatabaseHADR States of the Standby Database
l l h• local catch‐up– upon first start– application of local log filesapplication of local log files
• remote catch‐up pending– waiting for connection to primary databaseg p y
• remote catch‐up– application of archived andpp
active log files of the primary database
• peer status– both databases identically updated– transfer and application of the
log file pages to standby databaselog file pages to standby database
HADR Synchronization ModesHADR Synchronization Modes
• HADR S h i ti M d• HADR‐Synchronization Modes– Database configuration parameter: HADR_SYNCMODE– Writing of logs successful, if:
Logs are written to log files on the primary database AND
synchronous (SYNC)– synchronous (SYNC)primary database has received acknowledgement from the standby database that logs are successfully written to log files on the standby database
– near synchronous (NEARSYNC) [Default]near synchronous (NEARSYNC) [Default]primary database has received acknowledgement from the standby database that logs are successfully written main memory on the standby database
– asynchronous (ASYNC)Log records have been delivered to the standby database (no acknowledgement expected)
TAKEOVER CommandTAKEOVER Command• Switching the database roles (takeover) transactions• Switching the database roles (takeover)
– only for the standby database– requirements
• database in peer statusHADR resynchonizes• connection to primary database
available– TAKEOVER HADR ON
DB <dbname> NEW Standby databasePrimary database
Logs
NEW Primary databaseStandby database
Logs
HADR resynchonizesnew Primary and
new Standby databaseContinuous transaction replay
• HADR functionality substitution (failover)– in case of unavailability of the primary database no role‐switching!
only possible for the standby database
Primary database Standby database
– only possible for the standby database– requirements
• database status: peer status or remote catchup pending• primary database must be fully deactivated
→ otherwise two primary databases
transactions
→ otherwise two primary databases→ inconsistencies possible!
– TAKEOVER HADR ON DB <dbname> BY FORCE
Standby database t k
Primary databaseNEW Primary database
Standby database
Logstakes over
Oracle Data GuardOracle Data Guard
• Standby Database Types– Physical standby database
• Application of native redo logs to standby database• Application of native redo logs to standby database
• Fast and efficient synchronization, more simple configuration
• No read access for reporting, no backup of standby database
• since 11g: Active Data Guard option – read access to physical standby database
– Logical standby databaseg y• transformation of redo log toplain SQL and execution of SQL statementsof SQL statements on standby database
• read access for reporting
• Online backup on standby database• Online backup on standby database
• More complex configurationSource: https://mydoag.doag.org/formes/servlet/DocNavi?action=getFile&did=88890&key=
Databases: Why Use VMs for databases?
Vi t li ti t h i l l id th b t• Virtualization at hypervisor level provides the best abstraction– Each DBA has their own hardened, isolated, managedEach DBA has their own hardened, isolated, managed sandbox
• Strong Isolation– Security– Performance/ResourcesConfiguration– Configuration
– Fault Isolation• Scalable PerformanceScalable Performance
– Low‐overhead virtual Database performance– Efficiently Stack Databases per‐host
90
DBMS on top of Virtual MachinesDBMS on top of Virtual Machines
• ... yet another application?
• No! Virtualization poses several interesting research questions for database systems– Understanding the performance of database systems on virtual machines
– Configuring and tuning virtual machines running databasesystems
– Taking advantage of virtualization capabilities in the d bdatabase system
91
Performance OverheadPerformance Overhead
native Linux (L), XenoLinux (X), VMware workstation 3.2 (V) and User‐Mode Linux (U).
Detailed StudyDetailed Study
/• Run TPC‐H / PostgreSQL on– bare machine
– Xen virtual machine on identical hardware– Xen virtual machine on identical hardware
• Overhead of virtualization is not unacceptably high
• Can be made lower withCan be made lower with– better virtualization support
• Some new types of overheads
are introduced and must be
carefully managed
93
Virtualization and DBMS
TUNING VIRTUAL MACHINES RUNNING DATABASE MANAGEMENT SYSTEMS
Virtualization and DBMS
DATABASE MANAGEMENT SYSTEMS
Resource ProvisioningResource Provisioning• What level of resources to give to each DBMS?
– Configuring VM parameters
• How to tune the DBMS for a given level of resources?Configuring the DBMS parameters– Configuring the DBMS parameters
• Need a model of how resource
allocation affects databaseallocation affects database
performance
• Need optimization or control• Need optimization or control
algorithms to decide on the
optimal resource allocationoptimal resource allocation
– statically
d d i ll– and dynamically95
Dynamic Workloads
5
yWasted resourcesBursty load Bad response time
4
PUs
Peak
3
mbe
r of C
Peak
2Num
Average
0
1 Average
0
Time (every 5 mins)
P i i i f d i kl d i h d!
96
Provisioning for dynamic workloads is hard!96
Resource Provisioning Design AdvisorResource Provisioning ‐ Design Advisor• Example
– Workload 1: TPC‐H (10GByte)
– Workload 2: TPC‐H (10GByte) only Q18 (CPU intensive) (132 copies)
• Virtualization design advisor– 20% of CPU capacity
kl dto Workload 1
– 80% to Workload 2
97
Virtualization Design Advisor IVirtualization Design Advisor I
PerformanceGoals
PerformanceModell
Configuration On line RessourceWorkload Ressource PerformanceConfigurationAdvisor
On‐lineMonitoring
RessourceAdjustment
WorkloadDecription
RessourceAllocation
PerformanceData
RessourceAllocation (Dynamic)
98
Virtualization Design Advisor IIVirtualization Design Advisor II• Problem Statement
W : set of workloads processed in the same amount of time– Wi: set of workloads processed in the same amount of time– Ri: resource shares– Di: database instance– minimize
– constraint by degradation limit
– call resources are assigned to Wi
99
Cost Estimation Calibration ICost Estimation – Calibration I
U ti i th t d l• Use query optimizer as the cost model– Calibrate it to reflect virtual machine resource levels– Answer the following question: g q
“If the parameter settings were to be set in a particular way, what would be the cost of the optimal plan for the given workload?"
• Parameters– presecriptive parameters
‐> configuration of DBMS> configuration of DBMS (shared buffers and work mem)
– descriptive parameters‐> execution environment‐> execution environment(cpu tuple cost, random page cost, and effective cache size)
100
Cost Estimation Calibration IICost Estimation – Calibration II
• B i M th d l• Basic Methodology1. Define a calibration query Q and a calibration database D such that
CostDB(Q; Pi;D) is independent of all descriptive parameters in optimizer configuration Pi except for pik.configuration Pi except for pik.
2. Choose a resource allocation Ri, instantiate D, and run Q under that resource allocation, and measure its execution time TQ.
3. Solve Renormalize(CostDB(Q; Pi;D)) = TQ for pik, and associate the resulting l ith th ll ti R H th R li () f tipik value with the resource allocation Ri. Here the Renormalize() function
represents the application of the renormalization factor that was determined for the DBMS.(CostDB * the number of seconds required for a sequential I/O operation)
4. Repeat the two preceding steps for a variety of different resource allocations Ri, associating each with a value of pik.
5. Perform regression analysis on the set of (Ri; pik) value pairs to determine calibration function Cal from resource allocations R to p valuescalibration function Calik from resource allocations Ri to pik values.
101
Calibration StudyCalibration – Study• Problem
N CPU settings– N CPU settings– M memory settings– really need to execute NxM virtual machine settings and
calibrations?
• ObservationObservation– CPU, IO, and memory optimizer parameters are independent of
each other
cpu_tuple_cost cpu_index_tuple_cost cpu_operator_cost102
Configuration EnumerationConfiguration Enumeration
d l h h / h• Greedy Algorithm starting with 1/N share assignment– optimization functions are „fairly smooth and concave“
two workloads competing for CPUtwo workloads competing not for CPU
• Overhead– Calibration (Postgres): 9 minutes
E ti 1 i t
p gp g
– Enumeration: 1 minute
103
Some Experiments ISome Experiments I• Workload Definition based on TPC‐H
– Q18 is one of the most CPU intensive queries– Q18 is one of the most CPU intensive queries– Q21 is one of the least CPU intensive queries– Workload Units
• C: 25x Q18• C: 25x Q18• I: 1x Q21
– Experiment: Sensitivity to workload Resource Needs• W1 = 5C + 5IW1 = 5C + 5I• W2 = kC + (10‐k)I (increase of k ‐> more CPU intensive)
DB2 Postgres
104
Some Experiments IISome Experiments II
kl d• Workload Settings– W3 = 1C
– W4 = kC
• Workload SettingsW5 1C– W5 = 1C
– W6 = kI
105
Multi node Database ArchitecturesMulti node Database Architectures
ll bl l i d b hi• Full‐blown classic database architectures– no shared memory(shared disk / shared nothing)
– Query optimization criteria• high data parallelism• minimal data communication overhead(j i ti !)(joins, aggregations!)
• provide serializable schedule (commit protocol!)
A li ti ‘ ti• Application‘s perspective– SQL prompt
Example: GreenplumExample: Greenplum
• single Postgres Nodes + master servers
• local storageg(PolymorphicData Storage)Data Storage)
Oracle RACOracle RAC• Architecture• Architecture
– Two or more computing nodes with instances – Database nodes connected by high speed private interconnect for
synchronization and communication e g workload balancingsynchronization and communication, e.g. workload balancing– Connection to shared database on disk (SAN)– Cache Coherence with Oracle Cache Fusion
Public network
Node 1 Node 2Highspeed interconnect (IC)
DB Instance 1
DB Instance 2
g p ( )
Localstorage
Localstorage
Storage Network (e.g. SAN)
Shared Disk (database)
Oracle RAC (2)Oracle RAC (2)• Cache Fusion• Cache Fusion
– Cache coherence protocol using high speed private interconnect– "fuses" the in‐memory data cached physically on each computer
into one single global cacheinto one single, global cache– Avoiding synchronization over Shared Disk System I/O
Global Cache
LocalBuffer
LocalBuffer
LocalBuffer
Global Cache
Instance 1
Cache
Instance 3
Cache
ICInstance 2
Cache
Storage Network (e.g. SAN)
Shared Disk (database)
more techniques.. more techniques
• either– ... classical stuff!!!
• or– MapReduce like concepts(already discussed within this lecture)( y )
Virtualization at Multiple Levels IVirtualization at Multiple Levels I
T k li ti / kl d h t i ti i t• Take application/workload characteristics into account
Scenario 1– Scenario 1• different apps run on the same HW‐platform
– Scenario 2Sce a o• multiple instances of the same app (e.g. DBMS) run on the same HW‐platform but for different users
Organize virtual machines based on application– Organize virtual machines based on application characteristics
• Lesson 1– Group jobs with contrary resource requirements (CPU, I/O,…)
• Lesson 2– Group jobs with contrary SLA requirementsGroup jobs with contrary SLA requirements
113
Virtualization at Multiple Levels IIVirtualization at Multiple Levels II
P• Pros– Utilization raised while SLAs are ensured– Classic advisor questions:Classic advisor questions:– Forecast of peaks, low usage times?– Online vs. offline advisor– Placement, migration of jobs, virtual machines– Shut down parts of the cloud when not needed (sustainability)(sustainability)
• Cons– Highly dynamic environment – jobs can arrive anytimeHighly dynamic environment jobs can arrive anytime– Quite a bit of information needed about executed job (which is why we concentrate on DMA cloud)
114
SetupSetupDB DB DB DB DBDB Layer
Instance Instance Instance Instance Instance InstanceInstance Layer
DB Server
Layer
DB Server DB Server DB ServerLayer
VM VM VM VM VMVM Layer VM
HW LayerHW Layer
115
VM Barrier
DB DB DB DB DBDB Layer
User’s PoV
Instance Instance Instance Instance Instance InstanceInstance Layer orProvider’s PoV
DB Server
Layer
gn Adviso
DB Server DB Server DB ServerLayer
Cloud oud Desig
VM VM VM VM VM VMVM Layer DB‐Clo
HW LayerHW Layer
116
Complex Optimization IComplex Optimization I• Design Advisor Input
W kl d S ifi ti d R i t– Workload Specifications and Requirements• I/O, CPU
• Degree of parallelism• Degree of parallelism
• Availability, reliability, durability, scalability, …
– Service Level Agreementsg• Different user classes (standard, premium, … billing based on actual costs, resource usage, service quality)
• Throughput, response time, …
– Hardware Specifications
C i– Constraints• Migration constraints
• Configuration constraints• Configuration constraints
117
Complex Optimization IIComplex Optimization II• Design Advisor Goals
– Maximize Utilization
– Minimize Resource Requirements
– Ensure Service Qualityy
– Minimize reorganization/migration overhead
• Design Advisor ActionsAll ti– Allocation
– VM – HW mapping (contrary VMs share same hardware, VM )
– DBS – VM mapping (parallelism, partitioning)
– DB/Instance – DBS mapping (workload balancing)
• Possible Workload ConsolidationConfiguration– Configuration
– Physical DB design (indexes, bufferpools, …)
– VM configuration
118
Existing ToolsExisting Tools
Instance
DB DB DB DB DBDB Layer
DB Advisor (physical DB layout)• Indexes
Instance InstanceInstance Layer
Advisor
Indexes• MQTs• MDC• Redistribution of Tables
kl dDB Server
DB Server Layer
ud Design DB Workload Manager
VM VM VMVM Layer DB‐Clou
VM ToolsVM Tools• Monitoring• Resources Configuration• (manual) Migration
HW Layer
119
Layer Interactions ILayer Interactions I
• Static Environment Assumptions– Advisor expects static hardware environment
VM expects static (peak) resource requirements– VM expects static (peak) resource requirements
– Interactions between layers can improve performance/utilization
DB Advisor
VM ConfigurationVM Configuration
120
Layer Interactions IILayer Interactions IIIndex Storage
Possible Improvement
• Experiment– DB2 on Linux
TPC H workload on 1GB database
g
90% 35%25%
20%
– TPC‐H workload on 1GB database
– Ranges for resource grants• Main memory (BP) – 50 MB to 1GB
<1% <3%
5%
10%
15%
Index Expected Performance
• Additional storage (Indexes) – 5% to
30% DB size
Varying advisor output (17 26 indexes)
BP200MB
400MB
600MB
800MB
1GB
Storage
25%
20%
– Varying advisor output (17‐26 indexes)• Different possible improvement
• Different expected Performance after
5%
10%
15%improvement
BP200MB
400MB
600MB
800MB
1GB121
Scenario – Consolidate Workload
DB DB DB DB DBDB Layer
• CPU bound• Strict SLA• …
• I/O bound• Loose SLA• …
Instance Instance Instance Instance Instance InstanceInstance Layer or
DB Server
Layer
gn Adviso
DB Server DB Server DB ServerLayer
oud Desig
VM VM VM VM VM VMVM Layer DB‐Clo
HW LayerHW Layer
122
Workload Consolidation Experiment IWorkload Consolidation Experiment I• Experiment
– DB2 on Linux VM
10 GB TPC H database
Q1 Q2 Q1 Q2
– 10 GB TPC‐H database
– 2 different workloads, t d i lt lexecuted simultaneously
• Long running, I/O intensive queryDBS DBS DBS
• Short running, index accessing query
– Query throughput is measuredVM VM VM
– VM memory configuration varies
123
Workload Consolidation Experiment IIWorkload Consolidation Experiment IIShort Query ‐ Throughput ‐ 10GB TPC‐H
1 VM 2 VMsQ1 Q2 Q1 Q2
350000
360000
370000
380000
per minute
320000
330000
340000
512 MB or 1024 MB 768 MB or 1536 MB 1024 MB or 2048 MB
Que
ries
DBS DBS DBS
memory configuration
Long Query ‐ Throughput ‐ 10GB TPC‐H
VM VM VM
37
38
minute
1 VM 2 VMs
34
35
36
Que
ries per m
124
512 MB or 1024 MB 768 MB or 1536 MB 1024 MB or 2048 MB
memory configuration
Conclusion IConclusion I
h ll l b h dl d• How shall virtualization be handled on– Machine level (VM to HW)
– DBMS level (database to instance to database server)( )
• ... using …– Allocation between layers
C fi i i id l– Configuration inside layers
• … when …– Characteristics of the workloads are known
– Virtual machines are transparent
• … demanding that …d– SLAs are respected
– Each node’s utilization is maximized
– Number of nodes is minimized
Conclusion IIConclusion II
• Virtualization– Powerful mechanisms for improving computing infrastructure
Adopted by a wide range of organizations– Adopted by a wide range of organizations
• Database systems are increasingly being run in virtualizedenvironmentsenvironments– Significantly changes the operating environment
– At the same time can be very useful
Many opportunities for database researchers
129