practical memory disaggregation
TRANSCRIPT
1
Practical Memory DisaggregationA Case Study in Network-Informed Data Systems Design
Mosharaf ChowdhuryNovember 2020
Five Years Ago…
3
The volume of data businesses want to make sense of is increasing
2015 2016 2017 2018
1. Data Volume Will Keep Increasing
4
2015 2016 2017 2018
Data Systems
5
Big Data AnalyticsAI/ML Tools
Massive dataHigh parallelismGPU clustersDistributed…
> 100 ms
Over the World
2. Deployed in Diverse Networks
< 10 µs
Within a Rack Within a Datacenter~ 1 ms
6
Network-Informed Data Systems Design
7
I. Network-adaptive Big Data and AI/ML systems
II. Tailoring data systems to extreme networksI. Computation over the InternetII. Leveraging high-speed networks
2016 20202017 20192018
HU
G
CO
DA
Herm
es
EC-C
ache
Carbyne
QO
OP
Tiresias
Salu
s
AlloX
NO
CS
Sol
Pand
o
Infin
iswap
DSL
R
Leap
NetLock
CellScope
PracticalMemoryDisaggregation
8
Memory is King!
9
Perform Great!
36.18
6.619
1.5420
5
10
15
20
25
30
35
40
100% 75% 50%
TPS
(T
hous
ands
)
In-Memory Working Set
TPC-C on VoltDB
10
Perform Great Until Memory Runs Out
36.18
6.619
1.5420
5
10
15
20
25
30
35
40
100% 75% 50%
TPS
(T
hous
ands
)
In-Memory Working Set
TPC-C on VoltDB
11
Perform Great Until Memory Runs Out
36.18
6.619
1.5420
5
10
15
20
25
30
35
40
100% 75% 50%
TPS
(T
hous
ands
)
In-Memory Working Set
95.8
44.9
23.8
0
20
40
60
80
100
120
100% 75% 50%O
ps (
Tho
usan
ds)
In-Memory Working Set
TPC-C on VoltDB FB Workload on Memcached
12
Perform Great Until Memory Runs Out
36.18
6.619
1.5420
5
10
15
20
25
30
35
40
100% 75% 50%
TPS
(T
hous
ands
)
In-Memory Working Set
57 67.5
453.4
1
10
100
1000
100% 75% 50%
Com
plet
ion
Tim
e (s
)
In-Memory Working Set
TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95
.8
44.9
23.8
0
20
40
60
80
100
120
100% 75% 50%O
ps (
Tho
usan
ds)
In-Memory Working Set
13
50% Less Memory Causes Slowdown of …
36.18
6.619
1.5420
5
10
15
20
25
30
35
40
100% 75% 50%
TPS
(T
hous
ands
)
In-Memory Working Set
57 67.5
453.4
1
10
100
1000
100% 75% 50%
Com
plet
ion
Tim
e (s
)
In-Memory Working Set
TPC-C on VoltDB FB Workload on Memcached PageRank on PowerGraph95
.8
44.9
23.8
0
20
40
60
80
100
120
100% 75% 50%O
ps (
Tho
usan
ds)
In-Memory Working Set
14
Between a Rock and a Hard Place
OverallocationLeads to underutilization
30-50% in Google, Alibaba, and Facebook
UnderallocationLeads to severe performance loss
VS.
15
Machine 1 Machine 2 Machine 3 Machine N
Used Memory Free Memory
…
Disaggregated Memory
Memory Disaggregation
Remote Memory 16
Network is Getting Faster!
TCP/IP Hundreds of µsec
DPDK Tens of µsec
RDMA Single-digit µsec
Hundreds of nsecDRAM
time to access a 4KB memory page17
What is PracticalMemoryDisaggregation?
1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …
18
1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …
19
What is PracticalMemoryDisaggregation?
Infiniswap
w/ Juncheng Gu and many othersNSDI’17
Efficient Memory Disaggregation
20
Applicability
Application ChangesPe
rfor
man
ce
Best
Poor
Major NoMinor
Key-Value
File
Disk
Paging
Very Good
File
Disk
Key-Value
Paging
How can we enable any application to leverage disaggregated memory without sacrificing performance?
Remote Memory Paging
22
Exposes memory across server boundaries• Scalable• Efficient• Fault-tolerant
No changes to• applications,• operating systems, or• hardware
Core Idea
23
1. Infiniswap Block DeviceFinds free remote memory, maps pages, and provides fault tolerance without any central coordination
2. Infiniswap DaemonProactively evicts remote pages to ensure transparent, best-effortservice
Exposes free remote memory as swap devices in a decentralized manner without affecting remote processes
Infiniswap in One Slide
Container 1 Container NInfiniswapDaemon
…
User Space
Kernel Space
Local Disk
Async Sync
RNIC
2 3 2
Machine-1
User Space
Kernel Space
InfiniswapDaemonContainer A
Machine-2
X Mapped to memory of Machine-X
Virtual Memory Manager (VMM)
Infiniswap Block Device
Individual pagePage fault
User Space
Kernel Space
InfiniswapDaemonContainer A
Machine-3
Container 1 Container NInfiniswapDaemon
…
User Space
Kernel Space
Local Disk
Async Sync
RNIC
2 3
Machine-N
Virtual Memory Manager (VMM)
Infiniswap Block Device
24
Scalability via Decentralization
How to find free remote memory in a large cluster?• Problem: Centralized solution can be slow and expensive• Solution: Power of two choices
How to evict mapped memory?• Problem: LRU/LFU is hard because one-sided RDMA bypasses CPU• Solution: Power of many choices
25
Higher Efficiency & Better Load Balancing
Higher Utilization
0
20
40
60
80
100
1 3 5 7 9 11 13 15 17 19 21 23 25 27 29Mem
ory
Util
izat
ion
(%)
Rank of 32 Machines
Infiniswap w/o Infiniswap
26
35.89
27.74
19.33
0
5
10
15
20
25
30
35
40
100% 75% 50%
TPS
(T
hous
ands
)
In-Memory Working Set
56.1 63.9 64.2
1
10
100
1000
10000
100% 75% 50%
Com
plet
ion
Tim
e (s
)
In-Memory Working Set
99.1
100.
4
91.3
0
20
40
60
80
100
120
140
160
100% 75% 50%O
ps (
Tho
usan
ds)
In-Memory Working Set
TPC-C on VoltDB FB Workload on Memcached
Even on 50% Memory, Slowdown is
PageRank on PowerGraph
<2X ≈1X ≈1X 27
1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …
28
What is PracticalMemoryDisaggregation?
w/ Hasan Al MarufATC’20 Best Paper
29
LeapEffectively Prefetching Remote Memory
User Space
Kernel Space
Device Mapping Layer
Block Device Driver
Generic Block Layer
I/O Scheduler Request QueueRequest queue processing:
Insertion, Merging, Sorting, Staging and Dispatch
bio
Remote Memory
Dispatch Queue
Memory ManagementUnit (MMU)
Process 1 Process 2 Process N…
Page Fault
RDMA: 4.3 us
0.27 us
10.04 us
21.88 us
2.1 us
CacheMiss
CacheHit
MMU Page Cache
Life of a Page
30
Where Does the Time Go?Page Request
In Page Cache?
Allocate Cache for Page
Read Request?
No
Yes
Update Page Table & End I/OPrepare for I/O
YesNo
Queue and Batch Requests
Execute I/O
0.12 µs
2.1 µs
10.04 µs
21.88 µs
RDMA: 4.3 µs
0.15 µs
Fast Path
Slow Path
31
We Need to
32
1. Increase cache hit• Faster path serves more page faults
2. Reduce the latency of the slow path• Remove block-layer operations unnecessary for RDMA
User Space
Kernel Space
Device Mapping Layer
Block Device Driver
Generic Block Layer
I/O Scheduler Request QueueRequest queue processing:
Insertion, Merging, Sorting, Staging and Dispatch
bio
Remote Memory
Dispatch Queue
Memory ManagementUnit (MMU)
Process 1 Process 2 Process N…
Page Fault
RDMA: 4.3 us
0.27 us
10.04 us
21.88 us
2.1 us
CacheMiss
CacheHit
MMU Page Cache
Life of a Page
33
User Space
Kernel Space
Remote Memory
Memory ManagementUnit (MMU)
Process 1 Process 2 Process N…
Page Fault
RDMA: 4.3 us
0.27 us
2.1 us
CacheMiss
CacheHit
MMU Page Cache
Life of a Pagew/ Leap
Process Specific Page Access Tracker
Leap
Trend Detection
Prefetch CandidateGeneration
Prefetcher
Eager Cache Eviction
34
0.34 us
Prefetching in Linux
Reads ahead pages sequentially
Based only on the last page access
Does not distinguish between processesCannot detect thread-level access irregularities
too aggressive on seq: cache pollution
too conservative off seq: brings nothing
35
Trend Detection in LeapStart with a smaller window of Access History
Majority found?
Doubles the window size
No Yes
Run Boyer-Moore on the window
Return Majority ∆maj
Max. window
size?
YesNo trend found
No
Resilient to short term irregularity
Identifies the majority element in access history
Regular trends can be found within recent accesses
36
Lowers Remote Page Access Latency by…Sequential Access Stride Access
0
0.2
0.4
0.6
0.8
1
0.01 1 100 10000
CD
F
Latency (us)
Infiniswap
Infiniswap+Leap
0
0.2
0.4
0.6
0.8
1
0.01 1 100 10000
CD
F
Latency (us)
4X 104X37
Performs Great Even After Memory Runs Out
TPC-C on VoltDB
<2X
37.00
27.74
19.33
1.50
5
10
15
20
25
30
35
40
100% 75% 50% 25%
TPS
(T
hous
ands
)
In-Memory Working Set
Infiniswap
37 36.3 35.6
15.6
0
5
10
15
20
25
30
35
40
100% 75% 50% 25%
TPS
(T
hous
ands
)
In-Memory Working Set
TPC-C on VoltDB
Infiniswap + Leap
≈1X 38
Performs Great Even After Memory Runs Out
TPC-C on VoltDB
37.00
27.74
19.33
1.50
5
10
15
20
25
30
35
40
100% 75% 50% 25%
TPS
(T
hous
ands
)
In-Memory Working Set
Infiniswap
37 36.3 35.6
15.6
0
5
10
15
20
25
30
35
40
100% 75% 50% 25%
TPS
(T
hous
ands
)
In-Memory Working Set
TPC-C on VoltDB
Infiniswap + Leap
2.4X 39
Applicability & Performance
Application Changes
Perf
orm
ance
Best
Poor
Major NoMinor
Very Good
File
Disk
Key-Value
Infiniswap
40
Infiniswap + Leap
1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …
41
What is PracticalMemoryDisaggregation?
Memtrade
Justitia
Hydra
1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …
42
What is PracticalMemoryDisaggregation?
w/ ZhuolongYu, Yiwen Zhang and othersSIGCOMM’20
43
NetLockLock Management with Programmable Switches
Transactions
44
Transaction processing needs• High throughput;• Low latency; and• Policy support
Existing approaches• Centralized: low throughput• Decentralized: limited policy support
Network-Assisted Lock Management
45
Transaction processing needs• High throughput;• Low latency; and• Policy support
Challenges• Limited memory to store the locks• Limited functionalities to process the
locks and realize the policies
Programmable Switch
NetLock processes lock requests with a combination of switch and servers• The switch only stores and
processes the requests on hot locks• Servers do the rest
Implemented on a 6.5Tbps Barefoot Tofino switch
46
Clients
L2/L3 Routing
LockTable
Lock TableServer
DatabaseServers
ToR Switch Lock TableServer
NetLock
NetLock Architecture
Switch Memory Disaggregation
47
Determine how much switch memory is needed for a target throughput• Formulated as a fractional knapsack problem• Depends of expected contention
Handling overflow• Move locks back and forth between switch and servers
48
Single µs Latency
20X lower latency for TPC-C over DSLR
Billions of Locks/Sec
49
18X higher throughput for TPC-C over DSLR
1. Applicability2. Scalability3. Efficiency4. Performance5. Isolation6. Resilience7. Security8. Generality9. …
50
What is PracticalMemoryDisaggregation?
51
Memory Disaggregation
Wide-Area Computing
AI/MLSystems
Big DataSystems
Network-Informed Data Systems Design
52
Juncheng Gu Jie You Yiwen ZhangFan Lai Jiachen Liu Hasan Al Maruf Peifeng Yu
PhD Students
Undergraduate& Master’s
Collaborators
Chris ChenYinwei DaiShuoren FuSongyuan Guan
Jack KosaianQinye LiYang LiuYuze Lou
Alexander NebenWenting TanYue TanKaiwei Tu
Yuchen WangYujia XieYilei XuJiaxing Yang
Yiwei ZhangJiangchen ZhuJingyuan ZhuXiangfeng Zhu
Aditya AkellaGanesh AnanthanarayananWei BaiVladimir BravermanShuchi ChawlaKai ChenLi ChenAsaf CidonYanhui GengAli GhodsiAyush Goel
Robert GrandlChuanxiong GuoMatan HamilisAnthony HuangAnand P. IyerMyeongjae JeonXin JinSamir KhullerTan N. LeYoungmoon LeeLi Erran Li
Hongqiang LiuZhenhua LiuHarsha V. MadhyasthaKshiteej MahajanBarzan MozafariLinh NguyenAurojit PandaManish PurohitJunjie QianKannan RamchandranK. V. Rashmi
Kang G. ShinScott ShenkerBrent StephensIon StoicaXiao SunMuhammed UluyolShivaram VenkataramanCarl WaldspurgerHongyi WangJingfeng WuSheng Yang
BairenYiDong Young YoonZhuolongYuHong ZhangJunxue ZhangYuhong ZhongYibo Zhu