efficient memory disaggregation with infiniswap memory disaggregation with infiniswap juncheng gu,...
TRANSCRIPT
Efficient Memory Disaggregation with Infiniswap
Juncheng Gu, Youngmoon Lee, Yiwen Zhang,Mosharaf Chowdhury, Kang G. Shin
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 1
3/30/17 4
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 5
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 6
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 7
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 8
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 9
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 10
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 11
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 12
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
3/30/17 13
Performance degradation
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.06
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
0.18
0.47
0.94 0.97
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory 75%workingsetsinmemory 50%workingsetsinmemory
Memory overestimation
3/30/17 14
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
Time (days)
3/30/17 15
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
Time (days)
3/30/17 16
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
0.5
Time (days)
3/30/17 17
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
0.5≈30%
Time (days)
3/30/17 18
• Google Cluster Analysis[1]
[1] Reiss, Charles, et al. "Heterogeneity and dynamicity of clouds at scale: Google trace analysis." SoCC’12.
Memory underutilization
How to utilize ABU memory?
Allocated Used
Porti
on o
f Mem
ory
0.8
0.5≈30%
Time (days)Can we utilize this memory?
3/30/17 20
Disaggregate free memory
Machine 2
Used memory Free memory Remote memory
Machine 3 Machine 4 Machine N
Machine 1
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
3/30/17 21
Disaggregate free memory
Machine 2
Used memory Free memory Remote memory
Machine 3 Machine 4 Machine N
Machine 1
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
Machine 2
Memory Disaggregation Layer
Machine 3 Machine 4 Machine N
Machine 1
Used memory Free memory Remote memory
3/30/17 22
What are the challenges?
• Minimize deployment overhead• No hardware design• No application modification
• Tolerate failures• e.g. network disconnection, machine crash
• Manage remote memory at scale
No HW design No appmodification
Fault-tolerance Scalability
Memory Blade[ISCA’09]
HPBD[CLUSTER’05] / NBDX[1]
RDMA key-value service(e.g. HERD[SIGCOMM’14], FaRM[NSDI’14])
Intel Rack Scale Architecture(RSA)[2]
Infiniswap
3/30/17 23
Recent work on memory disaggregation
[1] https://github.com/accelio/NBDX[2] http://www.intel.com/content/www/us/en/architecture-and-technology/rack-scale-design-overview.html
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 24
3/30/17 25
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
3/30/17 26
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
Infiniswap Block Device• Swap space• Request router
3/30/17 27
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
Local disk• [ASYNC] backup swapped-out
data• Tolerate remote memory
failure
3/30/17 28
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
Infiniswap Deamon• Local memory region• Remote memory service
3/30/17 29
System Overview
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
Local Disk RNIC
Machine 1
ApplicationInfiniswapDaemon User
Space
Machine 2
RNIC
SyncAsync
RDMA • One-sided operations• Bypass remote CPU
Objectives Ideas
No hardware designRemote paging
No application modification
Fault-tolerance Local backup disk
Scalability Decentralized remote memory management
3/30/17 30
How to meet the design objectives?
3/30/17 31
One-to-many
Application1 Application2
Virtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 1 Machine 2
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 3
RNIC
Local Disk
User Space
Kernel Space
Async Sync
3/30/17 32
Many-to-many
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 1 Machine 2
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 3
RNIC
Application1 Application2 User Space
Kernel SpaceVirtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
Machine 4
Local Disk Local Disk
Async Sync AsyncSync
3/30/17 33
Many-to-many
Application1 Application2User Space
Kernel Space Virtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 1 Machine 2
RNIC
ApplicationInfiniswapDaemon User
Space
Machine 3
RNIC
Application1 Application2 User Space
Kernel SpaceVirtual Memory Manager (VMM)
Infiniswap Block Device
RNIC
Machine 4
Local Disk Local Disk
Async Sync AsyncSync
How to scale remote memory?
• How to find remote memory in the cluster?• Which remote mapping should be evicted?
Objectives Ideas
No hardware designRemote paging
No application modification
Fault-tolerance Local backup disk
Scalability Decentralized remote memory management
3/30/17 34
How to meet the design objectives?
3/30/17 35
Management unit: memory page?
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 36
Management unit: memory page?
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Local Page Remote Pagep100 <s1, p1>
1GB = 256K entries1GB = 256K RTTs
3/30/17 37
Management unit: memory slab!
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 38
Management unit: memory slab!
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 39
Which remote machine should be selected?
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 40
Which remote machine should be selected?
Goal: balance memory utilization
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 41
Which remote machine should be selected?
Ø Central controller
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 42
Which remote machine should be selected?
Ø Central controller
Ø Decentralized approach
3/30/17 43
[1]Power of two choices[1]
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
[1] Mitzenmacher, Michael. "The power of two choices in randomized load balancing.”, Ph.D. thesis, U.C. Berkeley, 1996
3/30/17 44
[1]Power of two choices[1]
[1] Mitzenmacher, Michael. "The power of two choices in randomized load balancing.”, Ph.D. thesis, U.C. Berkeley, 1996
Infiniswap Block Device
InfiniswapDaemon
InfiniswapDaemon
InfiniswapDaemon
3/30/17 45
Slab eviction
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Mapped Slab Unmapped Slab
3/30/17 46
Slab eviction
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Mapped Slab Unmapped Slab
3/30/17 47
Slab eviction
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Infiniswap Daemon
1 2 3 4
Remote Memory Used Memory
Mapped Slab Unmapped Slab
3/30/17 48
Which slab should be evicted?
Daemon: Does not know the swap activities
Infiniswap Daemon
1 2 3 4
3/30/17 49
Daemon: Too expensive to query all the slabs
Infiniswap Daemon
1 2 3 4
Which slab should be evicted?
Infiniswap Daemon
1 2 3 4
3/30/17 50
Power of multiple choices[1]
Select E least-active slabs from E+E’ random slabs
[1] Park, Gahyun. "A generalization of multiple choice balls-into-bins.” PODC’11
Infiniswap Daemon
1 2 3 4
3/30/17 51
Power of multiple choices[1]
Select E least-active slabs from E+E’ random slabs
[1] Park, Gahyun. "A generalization of multiple choice balls-into-bins.” PODC’11
Infiniswap Daemon
1 2 3 4
Infiniswap Daemon
1 2 3 4
3/30/17 52
Power of multiple choices[1]
Select E least-active slabs from E+E’ random slabs
[1] Park, Gahyun. "A generalization of multiple choice balls-into-bins.” PODC’11
Infiniswap Daemon
1 2 3 4
Infiniswap Daemon
1 2 4
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 53
3/30/17 54
Implementation
• Connection Management• One RDMA connection per active block device - daemon pair
• Control Plane• SEND, RECV
• Data Plane• One-sided RDMA READ, WRITE
Kernel Space
InfiniswapBlock Device
User Space
InfiniswapDaemon
RDMA
3/30/17 55
What are we expecting from Infiniswap?
§ Application performance
§ Cluster memory utilization
§ Network usage
§ Eviction overhead
§ Fault-tolerance overhead
§ Performance as a block device
3/30/17 56
Evaluation
2 x 8 cores (32 vcores)64GB DRAM56Gbps InfiniBand NIC
32-node cluster
InfiniBandNetwork
• 50% working sets in memory
3/30/17 57
Application performance
• Application performance is improved by 2-16x
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
• 50% working sets in memory
3/30/17 58
Application performance
• Application performance is improved by 2-16x
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
0.04 0.060.12
0.04
0.66
0.77
0.61
0.08
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Normalized
Perform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
• 50% working sets in memory
3/30/17 59
Application performance
• Application performance is improved by 2-16x
0.04 0.060.12
0.040
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Norm
alize
dPerform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
0.04 0.060.12
0.04
0.66
0.77
0.61
0.08
0
0.2
0.4
0.6
0.8
1
VoltDB(TPC-C)
Memcached(Facebook/FBSYS)
PowerGraph(TunkRank)
GraphX(PageRank)
Normalized
Perform
ance
100%workingsetsinmemory Disk+50%workingsetsinmemoryInfiniswap+50%workingsetsinmemory
• 90 containers (applications), mixing all applications and memory constraints.
3/30/17 60• Cluster memory utilization is improved from 40.8% to 60% (1.47x)
Cluster memory utilization
0
20
40
60
80
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29
Mem
oryU
tiliza
tion(%)
RankofMachines
0
20
40
60
80
100
120
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
AxisTitle
AxisTitle
ChartTitle
Infiniswapw/oInfiniswap
Agenda• Motivation and related work
• Design and system overview
• Implementation and evaluation
• Future work and conclusion
3/30/17 61
3/30/17 62
Limitations and future work• Trade-off in fault-tolerance
• Local disk is the bottleneck• Multiple remote replicas
• Fault-tolerance vs. space-efficiency
• Performance isolation among applications• W/o limitation on each application’s usage• W/o mapping between remote memory and applications
• Infiniswap: remote paging over RDMA• Application performance• Cluster memory utilization
3/30/17 63
Conclusion
• Efficient, practical memory disaggregation• No hardware design• No application modification• Fault-tolerance• Scalability
Source code is coming soon!https://github.com/Infiniswap/infiniswap.git