1 moshe shadmon scaledb scaling mysql in the cloud

1

Moshe ShadmonScaleDB

Scaling MySQL in the Cloud

Shared Disk vs. Shared Nothing

Shared Nothing Shared Disk

Masters

Slaves

2

3

Start small, grow incrementally Scalable AND highly available Add capacity on demand with zero downtime Simplicity

No need to partition data No need for master-slave

Shared Disk Advantages

Server 1

OSS DBMSOSS

DBMS

ScaleDBScaleDB

VM

The Virtualized Cloud Database

Local DiskLocal Disk

OSS DBMSOSS DBMS

Storage EngineStorage Engine

My SQL Server Server 2

OSS DBMSOSS

DBMS

ScaleDBScaleDB

VMOSS

DBMSOSS

DBMS

ScaleDBScaleDB

VMOSS

DBMSOSS

DBMS

ScaleDBScaleDB

VMOSS

DBMSOSS

DBMS

ScaleDBScaleDB

VM

Shared Nothing

Shared StorageShared Storage

Shared Disk

4

ScaleDB As the Storage Engine

5

ScaleDB Storage Engine

MySql DatabaseManagement Level

Storage EngineLevel

MySql Server

ScaleDBCluster Manager

ScaleDB Node

ScaleDB APIScaleDB API

Transaction ManagerTransaction Manager

IndexManager

IndexManager

DataManager

DataManager

Lock Manager

Local Lock Manager

Local Lock Manager

Log ManagerLog Manager

RecoveryManagerRecoveryManager

Storage ManagerStorage Manager

Global Recovery Manager

Global Recovery Manager

Global SyncManager

Global SyncManager

Global LockManager

Global LockManager

ScaleDB Storage SystemScaleDB Storage System

Cache & Storage Devices




ScaleDB’s Internal Architecture

ScaleDBStorage Sysytem

Buffer Manager

Local Sync CoordinatorLocal Sync Coordinator

Threads ManagerThreads Manager

6

Deploying ScaleDB

…ScaleDB Cluster

Manager

ScaleDB Cluster

Manager

Node 1

DBMSDBMS

ScaleDBScaleDB

Node 2

DBMSDBMS

ScaleDBScaleDB

Node N

DBMSDBMS

ScaleDBScaleDB

ApplicationApplicationApplication Layer

Database Layer(Physical or VM nodes)

Storage LayerShared StorageShared Storage


ScaleDB

7

The Storage Engine

• Pluggable Storage Engine– Transactional storage engine

– Supports MySQL Storage Engine API

– Reads/Writes done via network to a shared storage

– Maintains a local cache

– Local Lock Manager – manage locking at the node level

– Connector to Cluster Manager – synchronize operations at a cluster level

8

The Cluster Manager

• Distributed Lock Manager – manage cluster level locks– Locks can be held over any type of resource:

• DBMS, Table, Partition, File, Block, Row etc.

– Supports multiple lock modes:• Read, Read/Write, exclusive etc.

– Synchronize state using messaging

• Local Lock Manager – manage locks at a node level– Maintains locks at the node level

– Synchronize state using shared memory

• Identifies node failures and manage recovery

9

The Cluster Manager

• Distributed Lock Manager– Synchronize conflicting processes between nodes in the

cluster• Example: 2 nodes need to update the same resource at the same

time.

– The challenge:• Requests are done via the network – can be expensive:

– Internal operations may be in nanoseconds , network operations are in milliseconds

– The solution• Requests are send only when conflicts occur

10

The Storage

• Independent storage nodes– Accessible via network

– Each node has a Cache Layer and a Persistent Layer

– Database nodes can force the write to disk based on transactional requirement

– Data can be distributed over multiple storage nodes

– Each Storage Node can be mirrored

– Each Storage Node may have a Hot Backup Node

11

The Storage Node

12

DisksDisks

CacheBased On LRU

Interface to Storage

Storage Node

– Manage the data in cache and flush to disk when required.

– Supports the storage engine calls for Read, Write, etc.

– Supports pushed calls from storage engine such Count Rows, Search, etc.

– Each node is a Linux machine. No need for Network File System (NFS).

Scaling the Storage Tier

…

ScaleDB Cluster

Manager

ScaleDB Cluster

Manager

Node 1

DBMSDBMS

ScaleDBScaleDB

Node 2

DBMSDBMS

ScaleDBScaleDB

Node N

DBMSDBMS

ScaleDBScaleDB


Storage Layer

13


CacheCache

TCP/UDPTCP/UDP


CacheCache

TCP/UDPTCP/UDP


CacheCache

TCP/UDPTCP/UDP


CacheCache

TCP/UDPTCP/UDP

Local Cache

Local Cache

Local Cache

GlobalCache

14

Global Cache

• Guarantees cache coherency • Manages caching of shared data• Minimizes access time to data which is not

in local cache and would otherwise be read from disk

• Implements fast direct memory access over high-speed interconnects for all data blocks and types

• Uses an efficient and scalable messaging protocol

HA of the Storage Tier

…ScaleDB Cluster

Manager

ScaleDB Cluster

Manager

Node 1

DBMSDBMS

ScaleDBScaleDB

Node 2

DBMSDBMS

ScaleDBScaleDB

Node N

DBMSDBMS

ScaleDBScaleDB


Storage Layer Shared

StorageShared Storage

Mirrored StorageMirrored Storage

ScaleDB

Hot Backup

Hot Backup

15


…ScaleDB Cluster

Manager

ScaleDB Cluster

Manager

Node 1

DBMSDBMS

ScaleDBScaleDB

Node 2

DBMSDBMS

ScaleDBScaleDB

Node N

DBMSDBMS

ScaleDBScaleDB


Partitioned Storage

Partitioned Storage

Partitioned Mirrored


Partitioned Hot

Backup

Partitioned Hot

Backup

Partitioned Storage

Partitioned Storage



Partitioned Hot

Backup

Partitioned Hot

Backup

Partitioned Storage

Partitioned Storage



Partitioned Hot

Backup

Partitioned Hot

BackupPartition 1 Partition 2 Partition Q

16


ScaleDB Cluster

Manager

ScaleDB Cluster

Manager

Node N

MySQLMySQLDatabase Layer(Physical or VM nodes)

17

ScaleDB

Local CacheLocal Cache

Cache

Storage

Cache

Storage

Cache

Storage

Cache

Storage

Main Main

Mirror Mirror

Cache

Storage

• Read – From Local Cache

– From Main Or Mirror• Get From Cache

• Get From Storage

• Write– To local cache

– At end of transaction• multicast to main and

mirror

• optional acknowledgement:– after receive

– after write

18

Traditional Query Processing

What Were Yesterday Sales ?

Get The Sales Table

Storage Array

Retrieve Entire Sales

Table

Process Table Data

DBMS Server

19

ScaleDB Query Processing

Storage Nodes

DBMS Server

What Were Yesterday Sales ?

Get October 15 Sales





20

• Advantages– Parallel processing:

• I/O calls are executed simultaneously on multiple Storage Nodes.• Logic pushed to storage layer:

“SELECTcustomer_name from calls WHERE amount > 200”

• Traditional approach – return all rows to the database• ScaleDB storage – return selected rows to the database

– Leverage cache on multiple storage nodes– Storage layer can be expended without downtime– Data is Mirrored – Support for Hot-Backup– Low cost

High Availability

• Failure of a node– Detected by the Cluster Manager

• A surviving node is requested to undo uncommitted transactions

• Failure of the Cluster Manager– Detected by the Standby Cluster Manager

• Requests all nodes to undo uncommitted transactions

• Failure of a Storage Node– Continue with a mirrored storage – or –

– Use the Storage Node Log to recover

21

22

Performance / Tuning

• Occurs when 2 or more nodes want the same resource at the same time

• Types of Contention:– Read/Read contention – is never a problem because of

the shared disk system– Read/Write contention – reader is requested to release

the block and grant is provided to writer– Write/Read or Write/Write –

• Writer sends block to the global cache layer,

• Buffer invalidate message is send to the other nodes

• Requestor receives the grant

23

Performance / Tuning

• Fast Network between the nodes – 2 logical networks:

• Between the database nodes and the Cluster Manager• Between the database nodes and the storage

– Optimize Socket Receive Buffers ( 256 KB – 1MB )

• Partition requests to maintain locality of data– Send requests that update/query the same data to the same node

• By Database• By Table • By Table with PK

– Logic can change dynamically to adopt to changes• Changes in data distribution• Changes in user behaviors• Additional DBMS nodes

ScaleDB: Elastic/Enterprise Database

Function SimpleDB RDS ScaleDB

Transactions No Yes Yes

Joins No Yes Yes

Data Consistency No (Eventual) Yes Yes

SQL Support No Yes Yes

ACID Compliant No Yes Yes

Supports MySQL applications without modification

No Yes Yes

Dynamic Elasticity (w/o interruption)

Yes No Yes

High-Availability Yes No Yes

Eliminates Partitioning Yes No Yes

Eliminates possible 5-minutedata loss upon failure

Yes No Yes

24

Value Proposition

• Runs on low-cost cloud infrastructures (e.g. Amazon)

• High-availability, no single point of failure

• Dramatically easier set-up & maintenance– No partitioning/repartitioning

– No slave and replication headaches

– Simplified tuning

• Scales up/down without interrupting your application

• Lower TCO

25

1 moshe shadmon scaledb scaling mysql in the cloud

Documents

networkeach node

node levelconnector

local cachelocal lock

messaginglocal lock

node levelsynchronize

node levelmaintains

hot backup node

cache layer