Samuel MaddenMIT CSAIL
Director, Intel ISTC in Big Data
Schism: Graph Partitioning for OLTP Databases in a Relational CloudImplications for the design of GraphLab
GraphLab Workshop 2012
The Problem with Databases• Tend to proliferate inside organizations
– Many applications use DBs• Tend to be given dedicated hardware
– Often not heavily utilized• Don’t virtualize well• Difficult to scale
This is expensive & wasteful– Servers, administrators, software licenses,
network ports, racks, etc …
3
RelationalCloud Vision• Goal: A database service that exposes self-serve
usage model– Rapid provisioning: users don’t worry about DBMS &
storage configurations
Example: • User specifies type and size of DB and SLA
(“100 txns/sec, replicated in US and Europe”) • User given a JDBC/ODBC URL• System figures out how & where to run user’s DB &
queries
Before: Database Silos and Sprawl
Application #3
Database #3
Application #4
Database #4
Application #2
Database #2
Application #1
Database #1$$ $$
$$$$
• Must deal with many one-off database configurations
• And provision each for its peak load
App #1
After: A Single Scalable Service
App #2 App #3
App #4
• Reduces server hardware by aggressive workload-aware multiplexing• Automatically partitions databases across multiple HW resources• Reduces operational costs by automating service management tasks
What about virtualization?• Could run each DB in a separate VM
• Existing database services (Amazon RDS) do this– Focus is on simplified management, not performance
• Doesn’t provide scalability across multiple nodes
• Very inefficient
Max Throughput w/ 20:1 consolidation (Us vs. VMWare ESXi)One DB 10x loadedAll DBs equal load
Key Ideas in this Talk: Schism• How to automatically partition transactional
(OLTP) databases in a database service
• Some implications for GraphLab
System Overview
Schism
Not going to talk about:- Database migration- Security- Placement of data
This is your OLTP Database
Curino et al, VLDB 2010
This is your OLTP database on Schism
Schism
New graph-based approach to automatically partition OLTP workloads across many machines
Input: trace of transactions and the DBOutput: partitioning plan
Results: As good or better than best manual partitioning
Static partitioning – not automatic repartitioning.
Challenge: Partitioning
Goal: Linear performance improvement when adding machines
Requirement: independence and balance
Simple approaches:• Total replication• Hash partitioning• Range partitioning
Partitioning Challenges
Transactions access multiple records?Distributed transactionsReplicated data
Workload skew?Unbalanced load on individual servers
Many-to-many relations?Unclear how to partition effectively
Many-to-Many: Users/Groups
Many-to-Many: Users/Groups
Many-to-Many: Users/Groups
Distributed Txn Disadvantages
Require more communicationAt least 1 extra message; maybe more
Hold locks for longer timeIncreases chance for contention
Reduced availabilityFailure if any participant is down
Example
Single partition: 2 tuples on 1 machineDistributed: 2 tuples on 2 machines
Each transaction writes two different tuples
Same issue would arise in distributed GraphLab
Schism Overview
Schism Overview
1. Build a graph from a workload trace– Nodes: Tuples accessed by the trace– Edges: Connect tuples accessed in txn
Schism Overview
1. Build a graph from a workload trace2. Partition to minimize distributed txnsIdea: min-cut minimizes distributed txns
Schism Overview
1. Build a graph from a workload trace2. Partition to minimize distributed txns3. “Explain” partitioning in terms of the DB
Building a Graph
Building a Graph
Building a Graph
Building a Graph
Building a Graph
Building a Graph
Replicated Tuples
Replicated Tuples
Partitioning
Use the METIS graph partitioner:min-cut partitioning with balance constraint
Node weight:# of accesses → balance workloaddata size → balance data size
Output: Assignment of nodes to partitions
Graph Size Reduction Heuristics
Coalescing: tuples always accessed together → single node (lossless)
Blanket Statement Filtering: Remove statements that access many tuples
Sampling: Use a subset of tuples or transactions
Explanation Phase
Goal:Compact rules to represent partitioning
42
5
1
1212
Users Partition
Explanation Phase
Goal:Compact rules to represent partitioning
Classification problem:tuple attributes → partition mappings
4 Carlo Post Doc. $20,0002 Evan Phd Student $12,000
5 Sam Professor $30,000
1 Yang Phd Student $10,000
1212
Users Partition
Decision Trees
Machine learning tool for classification
Candidate attributes:attributes used in WHERE clauses
Output: predicates that approximate partitioning
4 Carlo Post Doc. $20,0002 Evan Phd Student $12,000
5 Sam Professor $30,000
1 Yang Phd Student $10,000
1212
Users PartitionIF (Salary>$12000)
P1ELSE
P2
Evaluation: Partitioning Strategies
Schism: Plan produced by our tool
Manual: Best plan found by experts
Replication: Replicate all tables
Hashing: Hash partition all tables
YahooBench-A YahooBench-E0%
25%
50%
75%
100%
Schism Manual Replication Hashing
Benchmark Results: Simple
% Distributed Transactions
0%
25%
50%
75%
100%
Schism Manual Replication Hashing
Benchmark Results: TPC
% Distributed Transactions
0%
25%
50%
75%
100%
Schism Manual Replication Hashing
Benchmark Results: Complex
% Distributed Transactions
Implications for GraphLab (1)
• Shared architectural components for placement, migration, security, etc.
• Would be great to look at building a database-like store as a backing engine for GraphLab
Implications for GraphLab (2)
• Data driven partitioning– Can co-locate data that is accessed together
• Edge weights can encode frequency of read/writes from adjacent nodes
– Adaptively choose between replication and distributed depending on read/write frequency
– Requires a workload trace and periodic repartitioning
– If accesses are random, will not be a win– Requires heuristics to deal with massive graphs,
e.g., ideas from GraphBuilder
Implications for GraphLab (3)• Transactions and 2PC for serializability
– Acquire locks as data is accessed, rather than acquiring read/write locks on all neighbors in advance
– Introduces deadlock possibility– Likely a win if adjacent updates are
infrequent, or not all neighbors accessed on each iteration
– Could also be implemented using optimistic concurrency control schemes
Schism
Automatically partitions OLTP databases as well or better than
expertsGraph partitioning combined with decision
trees finds good partitioning plans for many applications
Suggests some interesting directions for distributed GraphLab; would be fun to explore!
Graph Partitioning Time
Collecting a Trace
Need trace of statements and transaction ids (e.g. MySQL general_log)
Extract read/write sets by rewriting statements into SELECTs
Can be applied offline: Some data lost
Effect of Latency
Replicated Data
Read: Access the local copyWrite: Write all copies (distributed txn)
• Add n + 1 nodes for each tuplen = transactions accessing tuple
• connected as star with weight = # writes
Cut a replication edge: cost = # of writes
Partitioning Advantages
Performance:• Scale across multiple machines• More performance per dollar• Scale incrementally
Management:• Partial failure• Rolling upgrades• Partial migrations