social hash - usenix · summary assignment problems are common in distributed systems design...
TRANSCRIPT
Social Hash:an Assignment Framework for Optimizing Distributed Systems Operations on Social Networks
Alon Shalita, Brian Karrer, Igor Kabiljo, Arun Sharma, Alessandro Presta, Aaron Adcock, Herald Kllapi, and Michael Stumm
March 2016
Assignment ProblemFront-end clusters
Cache
Point-of-Presence
(PoP)Cache
Cache
Alon’s HTTP requests
Igor’s HTTP requests
Assignment ProblemFront-end clusters
Cache
Cache
Cache
Point-of-Presence
(PoP)
Alon’s HTTP requests
Igor’s HTTP requests
Assignment Problem OptimizationFront-end clusters
Cache
Cache
Cache
Point-of-Presence
(PoP)
Solution RequirementsFront-end clusters
Cache
Cache
Cache
Balanced
Point-of-Presence
(PoP)
Solution RequirementsFront-end clusters
Cache
Cache
Cache
Adaptive
Point-of-Presence
(PoP)
Alon’s HTTP requests
Solution RequirementsFront-end clusters
Cache
Cache
Cache
Stable
Point-of-Presence
(PoP)
Alon’s HTTP requests
Solution RequirementsFront-end clusters
Cache
Cache
Cache
Fast decision
Point-of-Presence
(PoP)
Alon’s HTTP requests
Social Hash framework
Social Hash framework
!
Components!(e.g.,!compute!clusters!or!storage!subsystems)!
Groups!
Objects!(e.g.,!data!records!or!HTTP!requests)!
static!!
assignm
ent!
dynamic!!
assignm
ent!
Static assignment
▪ Goal: assign similar objects sent to the same group
▪ Data access pattern -> represent as graph -> graph partitioning
▪ Large-scale optimization: slow, time-consuming
Dynamic assignment
▪ Goal: adapt to maintain load balance by altering group -> component
▪ hardware changes
▪ dynamic workload
▪ addition and removal of objects
▪ Two-level framework separates optimization from adaptation
▪ Slow optimization -> static
▪ Fast adaptation -> dynamic
▪ Group-to-component ratio controls tradeoff
Social Hash framework!!!!!!!!!!!!!!!!!!!!!! !
Social!Hash!Tbl! Assignment!Tbl!
group!g"
Lookup!
Request!
Missing!key!!assignment!
key"
failed!
group!
c"
Graph!Partitioning!
graph!specifications! Dynamic!!
Assignment!
monitoring!info!operator!console!
key"
HTTP Request Routing
Social Hash for Facebook’s web routing
▪ Objects: HTTP request identified by user, Components: front-end clusters
▪ PoP: Dynamic assignment by hash ringFront-end clusters
Cache
Cache
Point-of-Presence
(PoP)
Edge locality for Facebook’s web routing
●
●●
●●
●●
●●
●●
●
●
●
●
●
●
0.00
0.25
0.50
0.75
1.00
10 100 1,000 10,000 100,000Number of groups
Edge
loca
lity
▪ Production routing: 21k groups for 10’s of front-end clusters
▪ Over half of friendships are within groups
▪ Updated on a weekly basis (~1% movement)
Live traffic experiment: TAO miss rate
−30
−20
−10
0
10
Sun
Mon
Tue
Wed Thu Fri
Sat
Sun
Mon
Tue
Wed Thu Fri
Sat
Day
Perc
enta
ge c
hang
e in
TAO
miss
rate
(%)▪ Orange: traffic shifts
▪ Red: duration of test
▪ Green: updated Social Hash table
Storage Sharding
Assignment Problem 2: Storage sharding
Arun’s query
Objects: data recordsComponents: storage machines
Assignment Problem 2: Storage sharding
Arun’s query
Objects: data recordsComponents: storage machines
Static assignment
▪ Minimize fanout through bipartite graph partitioning
▪ Graph contains recent queries and data records
▪ edge => query accesses data record
▪ Dotted: edge locality optimization
▪ Solid: fanout optimization●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
10
20
30
40
50
3 10 30 100 300 1,000 3,000 10,000Number of groups
Aver
age
fano
ut
Storage sharding deployment
▪ Graph database with thousands of storage servers
▪ Group-to-component ratio of 8
▪ Static assignment every few months
▪ Results:
▪ Average latencies decreased by over 50%
▪ CPU utilization decreased by over 50%
Summary
▪ Assignment problems are common in distributed systems design
▪ Proposed Social Hash framework for solving assignment problems
▪ Two-level design optimizes performance with graph partitioning
▪ Two Facebook integrations in production for over a year
▪ HTTP Request Routing: > 25% reduction in TAO miss rate
▪ Storage Sharding: Latency and CPU utilization reduced by over 50%