05 - nvm ssds - pages.cs.wisc.edu

NVM-based SSDsCS 839 - Persistence

Learning outcomes

• Understand the software overheads in different layers of storage access when devices run at the speed of memory, not flash or disk• Understand where software can be optimized for reducing latency

Questions from reviews

• Could FusionIO or RAID be done on top of PCM?• Would you want to use PCM devices in RAID-like array?• How evaluate power efficiency with FPGA?

• Use power models from Cacti tool• Why compare against SSD & Disk?• How specific to PCM?

• No flash translation layer!• Parallelism for 4kb

• Can read from multiple PCM DIMMs in parallel and hit latency target• Unfair evaluation!• Weak workloads – had no file system• Confusions:

• PCIe protocol & DMA

Background story

• Faster persistent memory is raising interest (2009), initially investigated for DRAM replacement• Natural use case is for faster SSDs• Both Intel and Non-Volatile Systems Lab (NVSL) at UCSD build

prototype devices based on DRAM to identify software overheads well before PCM/3d Xpoint becomes available

OS/HW I/O path

• Each layer adds latency

• SATA/SCSI: HBA layer adds 25 usec• Goal is to aggregate multiple slow devices, not

needed• 6 PIOs needed for IO submission

• Outcome: NVMe interface• Driver talks to PCIe directly• Single PIO for request submission• Completions pushed to memory, not need PIO to

read• Interrupts steered to core that submits request• Multiple request queues for multi-core scaling

Benefit of NVMe interface

Software Dominates!

Where does time go with NVMe?

I/O scheduler

• What is the goal of an I/O scheduler?• Why is it valuable for Disks & Flash SSDs?• Is it valuable to NVM SSDs?

I/O scheduler

• What is the goal of an I/O scheduler?• Optimize order of requests to maximize performance• Implement prioritization/fairness rules

• Why is it valuable for Disks & Flash SSDs?• Strong benefit of sequential I/O so reordering helps• Useful when there are lots of requests queued

• How does it work?• Separate scheduler thread takes enqueued data from rest of kernel, submits IO

requests• Adds 2 usec of overhead

• Is it valuable to NVM SSDs?• Not as much – NO-OP scheduler does best, does least

Issuing/completing requests

• Early version required multiple PIO writes to submit a request (like SATA)• Result: need to acquire a lock to prevent races on multicore• Does not scale to lots of cores

• Solution: make request submission a single atomic operation• Pack everything into 64 bits

• 8 bit tag to match response to requests• 8 bit command• 16 bit length• 32 bit storage address

• Remove memory address of buffer – attach to channel instead! • Allow multi-threaded interrupt handling

• Old approach: read status fields, then clear interrupt• Requires lock to atomically read status & clear interrupt

• New approach: interrupt automatically cleared when read status; guaranteed next update causes a new interrupt & updates status

Avoiding interrupts

• Interrupts allow doing other useful work during I/O• What happens if I/O is

fast?

Avoiding interrupts

• Interrupts allow doing other useful work during I/O• What happens if I/O is

fast?

For a 4KiB transfer, Ta = 4.9, Td = 4.1, Tb = 4.1, and Tu = 2.7

Polling is much faster than interrupts

• HW is faster: no interrupt generation, PIO to check status• Execution time used during

async I/O (Tb above) is shorter than the extra time added to context switch & do interrupts• Result: net loss of

performance from interrupts

When are interrupts beneficial? Why

Removing copies

• Standard I/O copies data from usermode to kernel buffers• Can we get rid of this copy?

Removing copies

• Standard I/O copies data from usermode to kernel buffers• Can we get rid of this copy?

• Memory locations used for I/O must be pinned à user code either does copy, or make expensive syscall to pin

• Must pass memory locations to SSD à adds I/O operations

Overall results

Internal SSD scheduling

• Issue: mix of short and large requests• 4kb vs 2 MB

• If run in order (FIFO), short requests wait for long ones to complete, hurt latency

• Solution: look at CPU schedulers• round-robin in queue• Serve 4KB of each request, then put back in queue

How does Moneta compare against a fast Flash SSD?• Notes:

• Not that much faster for some workloads.

• Why?

• Generally

Real Optane SSD

• Much lower tail latency• Much more stable latency

at higher IOPS

What is missing from Moneta?

What is missing from Moneta?

• Still have to enter kernel, move data from user to kernel• Still have slow file system – adds 50-60% of latency on to hardware

• Only have DRAM simulation, not PCM

Onyx – first PCM SSD

• Bought PCM chips, built their own DIMMs, built SSD• Performance not much better

than SSD

Start-gap wear leveling

User-mode access to Moneta SSD

• What is needed to let applications access Moneta directly?

Goal for Moneta-Direct

User-mode access to Moneta SSD

• What is needed to let applications access Moneta directly?

• User-space space driver• Virtualization / many channels (many processes)• Protection

User-space driver

• What is it for?

• How access it?

User-space driver

• What is it for?• Knows protocol for talking to SSD• Issues requests, waits for response

• Function:• Call kernel to do open/close• Load information about file to user space – where blocks are located• Implement read/write/sync in use space

• How access it?• Idea 1: add new API• Idea 2: relink programs against new implementation of I/O syscalls• Idea 3: LD_PRELOAD to dynamically link programs against new implementation

Virtualization – many channels

• Each process needs separate connection to SSD• Submit separate requests, should not see requests of other processes• Interface to SSD includes privileged state – control over whole device

• Channels: safe user-mode connection to device• Supports 1000+ registers• Implemented as memory-mapped region with command register,

• How limit what addresses can be used for DMA?• If not, could access any physical memory• Solution: kernel provides DMA buffer for each channel, can only DMA to/from

that buffer

Enforcing protection

• What do we need to enforce for user-level access to submit I/O requests?

Enforcing protection

• What do we need to enforce for user-level access to submit I/O requests?

• Only access files you have opened• à only access blocks of files you have opened

• Solution: provide information about what blocks are accessible to SSD• Get extents (file offset, range of blocks) from file system• Provide to SSD for a channel• SSD caches some set

What happens during a request?

• Allocate a tag; get 16kb region of DMA buffer attached to tag• Lookup offset in library extent map; if not present (soft miss) call

kernel to fetch• Submit request to SSD• Poll for completion

• SSD looks up block in extent map to see if accessible• If not, fail• If not present, fail and make user-driver call kernel

to provide info to SSD

Moneta-Direct results

Summary

• With fast SSD, HW and SW overhead dominate performance

• HW optimizations:• Simpler, single PIO interface• Atomic interfaces for scalability• Scheduling• Move in OS functionality – permission checks

• SW interfaces• Remove unnecessary layers• Use polling for short requests• Move I/O submission to user space

05 - nvm ssds - pages.cs.wisc.edu

Documents