sprinkler: maximizing resource utilization in many-chip solid state disk myoungsoo jung (ut dallas)...
TRANSCRIPT
![Page 1: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/1.jpg)
Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk
Myoungsoo Jung (UT Dallas)Mahmut Kandemir (PSU)University of Texas at DallasComputer Architecture and Memory systems Lab
![Page 2: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/2.jpg)
Takeaway • Observations: – Employing more and more flash chips is not a promising
solution – Unbalanced flash chip utilization and low parallelism
• Challenges:– The degree of parallelism and utilization depends highly
on incoming I/O request patterns • Our approach:– Sprinkles I/O request based on internal resource layout
rather than the order imposed by a storage queue– Commits more memory requests to a specific internal
flash resource
![Page 3: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/3.jpg)
Revisiting NAND Flash Performance
Memory Cell Performance (excluding data movement)– READ: 20 us ~ 115 us – WRITE: 200 us ~ 5 ms
ONFI 4.0 800 MB/secWRITE 1.6 ~ 20 MB/sec
READ 70 ~ 200 MB/sec
Flash Interface (ONFI 3.0)– SDR : 50 MB/sec– NV-DDR : 200 MB/sec– NV-DDR2 : 533 MB/sec
![Page 4: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/4.jpg)
Revisiting NAND Flash Performance
ONFI 4.0 800 MB/sec
PCI Express (single lane)– 2.x: 500 MB/sec– 3.0: 985 MB/sec– 4.0: 1969 MB/sec PCIe 4.0 (16-lanes)
31.51 GB/sec
![Page 5: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/5.jpg)
Revisiting NAND Flash Performance
200 MB/s
800 MB/s 31 GB/s
Performance Disparity (even under an ideal situation)
![Page 6: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/6.jpg)
How can we reduce the performance disparity?
![Page 7: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/7.jpg)
WAY 0 WAY 1
CH A
CH B
WAY 0 WAY 1CH
ACH
B
WAY 0 WAY 1
CH A
CH B
WAY 0 WAY 1
CH A
CH B
Internal Parallelism
A Single Host-level I/O Request
![Page 8: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/8.jpg)
Unfortunately, the performance of many-chip SSDs are not significantly improved as the amount of internal resource increases
![Page 9: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/9.jpg)
Many-chip SSD Performance
Performance stagnates
![Page 10: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/10.jpg)
Utilization and Idleness
Utilization sharply goes down
Idleness keeps growing
![Page 11: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/11.jpg)
Flas
h M
emor
y Re
ques
t(P
hysi
cal A
ddre
ss)
Flas
h M
emor
y Re
ques
t(V
irtua
l Add
ress
)
I/O Service Routine in a Many-chip SSD
NVMHC
Queuing Memory Request Building
Core (Flash Translation Layer)
Memory Request Commitment Transaction Handling
Flash Controllers
Dev
ice-
leve
l Que
ue
Arrivals
I/O
Req
uest
Parsing Data Movement Initiation
Memory Requests: data size is the same as atomic flash I/O unit size
AddressTranslation
Execution Sequence
Striping &Pipelining
Transaction Decision
Interleaving & Sharing
Out-of-order Scheduling System- and Flash-level Parallelism
A flash transaction should be decided before entering the execution stage
Challenge: I/O access patterns and sizes are all determined by host-side kernel modules
![Page 12: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/12.jpg)
Challenge Examples
• Virtual Address Scheduler• Physical Address Scheduler
![Page 13: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/13.jpg)
Virtual Address Scheduler(VAS)12345
CHAN
NEL
A
12
12
12
C3
Chip
ID Plan
e ID
BUS CELL
RB = true
RB = falseBUS CELL
RB = true
Physical Offset
Physical Offset
IdleStall due to the I/O Request 3 collision at C5
Tail Collision
BUS CELL
RB = true
Tail Collision
Physical Offset
BUS CELL
RB = true
Req. 1Req. 2
LATENCYLATENCY
Req. 3Req. 4
LATENCYLATENCY
Req. 5 LATENCY
Physical Offset
C0 C3 C6
C1 C4 C7
C2 C5 C8
CHIP 3 (C3)
![Page 14: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/14.jpg)
Physical Address Scheduler (PAS)12345
CHAN
NEL
A
12
12
12
C3
Chip
ID Plan
e ID
C0 C3 C6
C1 C4 C7
C2 C5 C8
Physical Offset
Physical Offset
Physical Offset
Pipelining
BUS CELL
RB = true
Tail Collision
Tail CollisionCollision
RB = false
Tail Collision
Tail Collision
BUS CELL
RB = true
BUS CELL
RB = true
Tail CollisionBUS CELL
RB = true
Req. 1Req. 2
LATENCYLATENCY
Req. 3Req. 4
LATENCYLATENCY
Req. 5 LATENCY
CYCLE SAVEDCYCLE SAVED
CYCLE SAVED
CHIP 3 (C3)
![Page 15: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/15.jpg)
Observations
• # of chips < # of memory requests– The total number of chips is relatively fewer than the
total number of memory request coming from different I/O requests
• There exist many requests heading to the same chip, but to different internal resources– Multiple memory requests can be built into a high FLP
transaction if we could change commit order
![Page 16: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/16.jpg)
Insights• Stalled memory requests can be immediately served– If the scheduler could compose the requests beyond the boundary
of I/O requests and commit them regardless of the order of them• It can have more flexibility in building a flash transaction
with high FLP– If the scheduler can commit them targeting different flash internal
resources
![Page 17: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/17.jpg)
Sprinkler
• Relaxing the parallelism dependency– Schedule and build memory requests based on the
internal resource layout• Improving transactional-locality – Supply many memory requests to underlying flash
controllers
![Page 18: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/18.jpg)
RIOS: Resource-driven I/O Scheduling
C0 C3 C6
C1 C4 C7
C2 C5 C8
12345
• Relaxing the parallelism dependency– Schedule and build memory
requests based on the internal resource layout
67891011
![Page 19: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/19.jpg)
RIOS: Resource-driven I/O Scheduling
C0 C3 C6
C1 C4 C7
C2 C5 C8
12345
• RIOS – Out-of-Order Scheduling– Fine Granule Out-of-Order
Execution– Maximizing Utilization
67891011
![Page 20: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/20.jpg)
FARO: FLP-Aware Request Over-commitment
• High Flash-Level Parallelism (FLP)– Bring as many requests as possible to flash controllers,
allowing them to coalesce many memory requests into a single flash transaction
• Consideration– A careless memory requests over-commitment can
introduce more resource contention
![Page 21: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/21.jpg)
• Overlap Depth– The number of memory requests heading to different planes and
dies, but the same chip• Connectivity – Maximum number of memory requests that belong to the same
I/O request
C3
FARO: FLP-Aware Request Overcommitment
RIOS
FARO
Overlap depth : 4Connectivity : 2
Overlap depth : 4Connectivity : 1
![Page 22: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/22.jpg)
Sprinkler12345
CHAN
NEL
A
12
12
12
C3
Chip
ID Plan
e ID
C0 C3 C6
C1 C4 C7
C2 C5 C8
BUSBUS
CELLCELL
BUSBUS
CELLCELL
RB = true RB = false
Req. 1Req. 2
LATENCYLATENCY
Req. 3Req. 4
LATENCYLATENCY
Req. 5 LATENCY
CYCLE SAVEDCYCLE SAVED
CYCLE SAVED
Pipelining
![Page 23: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/23.jpg)
Evaluations• Simulation
– NFS (NANDFlashSim) http://nfs.camelab.org– 64 ~ 1024 flash chips -- dual die, four plane
(our SSD simulator simultaneously executes 1024 NFS instances) – Intrinsic latency variation (write: fast page: 200 us ~ slow page: 2.2 ms,
read: 20 us)• Workloads
– Mail file sever (cfs), hardware monitor (hm), MSN file storage server (msnfs), project directory service (proj)
– High transactional locality workloads: cfs2, msnfs2~3• Schedulers
– VAS : Virtual Address Scheduler, using FIFO– PAS: Physical Address Scheduler, using extra queues– SPK1: Sprinkler, using only FARO– SPK2: Sprinkler, using only RIOS– SPK3: Sprinkler, using both FARO and RIOS
![Page 24: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/24.jpg)
Throughput
300 MB/s improvement
Compared to VAS: 42 MB/s ~ 300 MB/s improvement Compared to PAS : 1.8 times better performance
4x improvement
[Bandwidth]
[IOPS]
![Page 25: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/25.jpg)
I/O and Queuing Latency
SPK1 is worse than PAS
SPK2 is worse than SPK1
SPK1 itself cannot secure enough memory requests and still have
parallelism dependency Large req.size
[Avg. Latency]
[Queue Stall Time]SPK3 (Sprinkler) at least reduces the device-level latency and queue pending time by 59% and 86%, respectively.
![Page 26: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/26.jpg)
Idleness EvaluationSPK1 shows worse inter-idleness
reduction than PAS
SPK1 shows better intra-idleness reduction than PAS
[Inter-chip Idleness]
[Intra-chip Idleness]When considering both intra and inter-chip idleness, SPK3 outperforms all schedulers tested (around 46%)
![Page 27: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/27.jpg)
Conclusion and Related Work• Conclusion:– Sprinkler relaxes the parallelism dependency by
sprinkling memory requests based on the underlying internal resources
– Sprinkler offers at least 56.6% shorter latency and 1.8 ~ 2.2 % better bandwidth than a modern SSD controller
• Related work:– Balancing timing constraints, fairness, and different
dimensions of physical parallelism by DRAM-based memory controller [HPCA’10, MICRO’10 Y.Kim, MICRO’07, PACT’07]
– Physical Address Scheduling [ISCA’12 TC’11]
![Page 28: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/28.jpg)
![Page 29: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/29.jpg)
Parallelism Breakdown
[VAS]
[SPK1 FARO-only]
[SPK2 RIOS-only]
[SPK3 Sprinkler]
![Page 30: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/30.jpg)
# of Transactions
[64-chips] [1024-chips]
![Page 31: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/31.jpg)
Time Series Analysis
![Page 32: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/32.jpg)
GC
![Page 33: Sprinkler: Maximizing Resource Utilization in Many-Chip Solid State Disk Myoungsoo Jung (UT Dallas) Mahmut Kandemir (PSU) University of Texas at Dallas](https://reader037.vdocument.in/reader037/viewer/2022102900/551aa08d5503466b3a8b56ac/html5/thumbnails/33.jpg)
Sensitivity Test
[64-chips]
[1024-chips]
[256-chips]