key-key-value stores for efficiently processing graph data in the cloud alexander g. connor panos k....
Post on 20-Dec-2015
213 views
TRANSCRIPT
Key-Key-Value Stores for Efficiently Processing Graph Data in the CloudAlexander G. Connor
Panos K. Chrysanthis
Alexandros Labrinidis
Advanced Data Management Technologies Laboratory
Department of Computer Science
University of Pittsburgh
Data in social networks• A social network manages user profiles, updates and
connections• How to manage this data in a scalable way?
• Key-value stores offer performance under high load
• Some observations about social networks• A profile view usually includes data from a user’s friends
• Spatial locality
• A friend’s profile is often visited next• Temporal locality
• Requests might ask for updates from several users• Web pages might include pieces of several user profiles• A single request requires connecting to many machines
Leveraging Locality• Can we take advantage of the connections?• What if we stored connected user’s profiles and data in
the same place?• Locality can be leveraged • The number of connections is reduced• User data can be pre-fetched
• We can think of this as a graph partitioning problem…• Partitions = machines• Vertices = user profiles, including update• Edges = connections• Objective: minimize the number of edges that cross partitions
Example – graph partitioning
• Many edges cross partitions• Accessing a vertex’s neighbors
requires accessing many partitions
• In a social network, requesting updates from followed users requires connecting to many machines
• Far fewer edges cross partitions• Accessing a vertex’s neighbors
requires accessing few partitions
• In a social network, fewer connections are made and related user data can be pre-fetched
Key-Key-Value Stores• Our proposed approach: extend the key-value model
• Data can be stored key-values• User profiles
• Data can also be stored as key-key-values• User connections• “Alice follows Bob”
• Use key-key-values to compute locality• On-line graph partitioning algorithm• Assign keys to grid locations based on connections• Each grid cell represents a data host• Keys that are related are kept together
Outline• Introduction
• Data in Social Networks• Leveraging Locality• Key-Key-Value Stores
• System Model• Client API• Adding a Key-Key-Value• Load management
• On-line partitioning algorithm• Simulation Parameters• Results
• Conclusion
Physical hosts
Virtual hosts
Address tableApplication Sessions
Physical Layer: Physical machines• can be added or removed dynamically as demands change
Logical Layer: Virtual machines• Organized as a square grid• Run the KKV store software• Manage replication • Can be moved between physical machines as needed
Address Table: Mapping Store• a transactional, distributed hash table• maps keys to virtual machines
Application Layer: Client API• maintain client sessions• cached data
Client API and Sessions• Clients use a simple API that includes the get, put and
sync commands• Data is pulled from the logical layer in blocks
• Groups of related keys
• The client API keeps data in an in-memory cache• Data is pushed out asynchronously to virtual nodes in
blocks• Push/pull can be done synchronously if requested by the
client• Offers stronger consistency at the cost of performance
bob
put(alice, bob, follows)
alice 1,18,8
Virtual hosts
Address table
kv(bob, ...)...kkv(alice, bob, follows)
kv(alice, ...)...kkv(alice, bob, follows)
8,8
1,1 8,8
Adding a key-key-valueTwo users: Alice and BobUse the Address Table to determine the virtual machine (node) that hosts Alice’s dataWrite the data to that nodeUse the address table to determine the node that hosts Bob’s dataWrite the same data to that nodeThe on-line partitioning algorithm moves Alice’s data to Bob’s node because they are connected
Splitting a Node
Virtual hosts
If one node becomes overloaded, it can initiate a splitTo maintain the grid structure, nodes in the same row and column must also splitOnce the split is complete, new physical machines can be turned on
• Virtual nodes can be transferred to these new machines
Outline• Introduction
• Data in Social Networks• Leveraging Locality• Key-Key-Value Stores
• System Model• Client API• Adding a Key-Key-Value• Load management
• On-line Partitioning Algorithm• Simulation Parameters• Results
• Conclusion
On-line Partitioning Algorithm• Runs periodically in parallel on each virtual node
• Also after a split or merge
For each key stored on a nodeDetermine the number of connections (key-key-values) with keys on other nodes
Can also be sum of edge weights
Find the node that has the most connectionsIf that node is different than the current nodeIf the number of connections to that node is greater than the number of connections to the current nodeIf this margin is greater than some threshold
Move the key to the other nodeUpdate the address table
• Designed to work in a distributed, dynamic setting• NOT a replacement for off-line algorithms in static settings
Parameter Values
No. Vertices (V) 100-400
Branching Factor (b) 10%-100% of V
Distribution of b Zipf
alpha 1.5
Partitioning Algorithms On-line, Kernighan-Lin
On-line Workload Random, pre-generated history of edge inserts
On-line algorithm run frequency
Every V/10 inserts
On-line threshold Improvement > 0
Trials 3 per graph size
Experimental Parameters
On-line partitions as well as Kernighan-Lin
Partitioning Quality Results
% Edges in partition
Vertices in graph
50 100 150 200 250 300 350 400 4500.00%
5.00%
10.00%
15.00%
20.00%
25.00%
30.00%
35.00%
On-lineKL
On-line partitions 2x faster than Kernighan-Lin!
Vertices moved
Vertices in graph
Partitioning Performance Results
50 100 150 200 250 300 350 400 4500
200
400
600
800
1000
1200
1400
1600
On-lineKL
Conclusions• Contributions:
• A novel model for scalable graph data stores that extends the key-value model• Key-key-value store
• A high-level system design• A novel on-line partitioning algorithm
• Preliminary experimental results• Our proposed algorithm shows promise in the distributed, dynamic
setting
What’s Ahead?• Prototype system implementation
• Java, PostgreSQL• Performance Analysis against MongoDB, Cassandra• Sensitivity Analysis• Cloud Deployment