1
Providing Atomic Sector Updates in Software for Persistent Memory
Vishal Verma
Vault 2015
2
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
3
NVDIMMs and Persistent Memory
● NVDIMMs are byte-addressable
● We won't talk of “Total System Persistence”● But using persistent memory DIMMs for storage
● Drivers to present this as a block device - “pmem”
CPUcaches
DRAM
Persistent Memory
Traditional Storage
Speed Capacity
4
Problem Statement
• Byte addressability is great– But not for writing a
sector atomically
Userspace
write()
'pmem' driver - /dev/pmem0
- - - NVDIMM0 1 2 3
memcpy()
5
Problem Statement
• On a power failure, there are three possibilities
1.No blocks are torn (common on modern drives)
2.A block was torn, but reads back with an ECC error
3.A block was torn, but reads back without an ECC error (very rare on modern drives)
• With pmem, we use memcpy()
– ECC is correct between two stores
– Torn sectors will almost never trigger ECC on the NVDIMM
– Case 3 becomes most common!
– Only file systems with data checksums will survive this case
6
Naive solution
• Full Data Journaling
• Write every block to the journal first
• 2x latency
• 2x media wear
7
Slightly better solution
• Maintain an 'on-disk' indirection table and an in-memory free block list
• The map/indirection table has LBA -> actual block offset mappings
• New writes grab a block from free list
• On completing the write, atomically swap the free list entry and map entry NVDIMM
LBA Actual
0 42
1 5050
2 314
3 3
Free List
0
2
12
42 - LBA 0
3 - LBA 3
314 - LBA 2
0 - Free
Map
8
Slightly better solution
• Maintain an 'on-disk' indirection table and an in-memory free block list
• The map/indirection table has LBA -> actual block offset mappings
• New writes grab a block from free list
• On completing the write, atomically swap the free list entry and map entry NVDIMM
LBA Actual
0 42
1 5050
2 314
3 3
Free List
0
2
12
42 - LBA 0
314 - LBA 2
write( to LBA 3 )
Map
0 - Free
3 - LBA 3
9
Slightly better solution
• Maintain an 'on-disk' indirection table and an in-memory free block list
• The map/indirection table has LBA -> actual block offset mappings
• New writes grab a block from free list
• On completing the write, atomically swap the free list entry and map entry NVDIMM
LBA Actual
0 42
1 5050
2 314
3 0
Free List
3
2
12
42 - LBA 0
3 - Free
314 - LBA 2
0 - LBA 3
Map
10
Slightly better solution
• Easy enough to implement
• Should be performant
• Caveat:– The only way to recreate the free list is to read the entire map
– Consider a 512GB volume, bs=512 => reading 1073741824 map entries
– Map entries have to be 64-bit, so we end up reading 8GB at startup
– Could save the free list to media on clean shutdown
– But...clunky at best
11
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
12
The Block Translation Table• nfree: The number of free blocks in reserve.
• Flog: Portmanteau of free list + log– Has nfree entries.– Each entry has two 'slots' that 'flip-flop'– Each slot has:
• Info block: Info about arena - offsets, lbasizes etc.
• External LBA: LBA as visible to upper layers
• ABA: Arena Block Address - Block offset within an arena
• Premap/Postmap ABA: The block offset into the data area as seen prior to/post indirection from the map
Arena
Arena Info Block (4K)
Data Blocks
BTT Map
Info Block Copy (4K)
BTT Flog (8K)
Backing Store
Arena 0512G
Arena 1512G
.
.
.
nfree reserved blocksBlock being written
Old mapping
New mapping
Sequence num
13
What's in a lane?
• The idea of “lanes” is purely logical
• num_lanes = min(num_cpus, nfree)
• lane = cpu % num_lanes
• If num_cpus > num_lanes, we need locking on lanes– But if not, we can simply preempt_disable() and need not take a lock
CPU 0get_lane() = 0
Lane 0
Free List
blk seq slot
2 0b10 0
6 0b10 1
14 0b01 0
LBA old new seq LBA` old` new` seq`
5 32 2 0b10 XX XX XX XX
XX XX XX XX 8 38 6 0b10
42 42 14 0b01 XX XX XX XX
Flog
CPU 1get_lane() = 1
Lane 1
CPU 2 Lane 2get_lane() = 2
5 2
8 6
42 14
Map
14
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
15
BTT – Reading a block
• Convert external LBA to Arena number + pre-map ABA
• Get a lane (and take lane_lock if needed)
• Read map to get the mapping
• If ZERO flag is set, return zeroes
• If ERROR flag is set, return an error
• Read data from the block that the map points to
• Release lane (and lane_lock)
CPU 0
Lane 0
read() LBA 5
Read data from 10
pre post
5 10
Map
Release Lane 0
16
BTT – Writing a block
• Convert external LBA to Arena number + pre-map ABA
• Get a lane (and take lane_lock if needed)
• Use lane to index into free list, write data to this free block
• Read map to get the existing mapping
• Write flog entry: [premap_aba / old postmap_aba / new postmap_aba / seq]
• Write new post-map ABA into map.
• Write old post-map entry into the free list
• Calculate next sequence number and write into the free list entry
• Release lane (and lane_lock)
CPU 0
Lane 0
blk seq slot
2 0b10 0
Free List[0]
write() LBA 5
write data to 2
pre post
5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10}
map[5] = 2
pre post
5 2
Map
Release Lane 0
free[0] = {10, 0b11, 1}
17
BTT – Analysis of a write
CPU 0Lane 0
blk seq slot
2 0b10 0
Free List[0]
write() LBA 5
write data to 2
pre post
5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2
pre post
5 2
Map
ReleaseLane 0
free[0] = {10, 0b11, 1}
Opportunities for interruption/power failure
18
BTT – Analysis of a write
CPU 0Lane 0
blk seq slot
2 0b10 0
Free List[0]
write() LBA 5
write data to 2
pre post
5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2
pre post
5 2
Map
ReleaseLane 0
free[0] = {10, 0b11, 1}
• On reboot:
– No on-disk change had happened, everything comes back up as normal
19
BTT – Analysis of a write
CPU 0Lane 0
blk seq slot
2 0b10 0
Free List[0]
write() LBA 5
write data to 2
pre post
5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2
pre post
5 2
Map
ReleaseLane 0
free[0] = {10, 0b11, 1}
• On reboot:
– Map hasn't been updated
– Reads will continue to get the 5 → 10 mapping
– Flog will still show '2' as free and ready to be written to
20
BTT – Analysis of a write
CPU 0Lane 0
blk seq slot
2 0b10 0
Free List[0]
write() LBA 5
write data to 2
pre post
5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2
pre post
5 2
Map
ReleaseLane 0
free[0] = {10, 0b11, 1}
• On reboot:
– Read flog[0][0] = {5, 10, 2, 0b10}
– Flog claims map[5] should have been '2', but map[5] is still '10' (== flog.old)
– Since flog and map disagree, recovery routine detects an incomplete transaction
– Flog is assumed to be “true” since it is always written before the map
– Recovery routine completes the transaction by updating map[5] = 2; free[0] = 10
21
BTT – Analysis of a write
CPU 0Lane 0
blk seq slot
2 0b10 0
Free List[0]
write() LBA 5
write data to 2
pre post
5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2
pre post
5 2
Map
ReleaseLane 0
free[0] = {10, 0b11, 1}
• Special case, the flog write is torn:
• On reboot:
– Read flog[0][0] = {5, 10, X, 0b11}; flog[0][1] = {X, X, X, 0b01}
– Since seq is written last, the half-written flog entry does not show up as “new”
– Free list is reconstructed using the newest non-torn flog entry flog[0][1] in this case
– map[5] remains '10', and '2' remains free.
Bit sequence for flog.seq: 01->10->11->01 Old New← →
22
BTT – Analysis of a write
CPU 0Lane 0
blk seq slot
2 0b10 0
Free List[0]
write() LBA 5
write data to 2
pre post
5 10
Map (old)
flog[0][0] = {5, 10, 2, 0b10} map[5] = 2
pre post
5 2
Map
ReleaseLane 0
free[0] = {10, 0b11, 1}
• On reboot:
– Since both flog and map were updated, free list reconstruction will happen as usual
23
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
24
Let's Race! Write vs. Write
CPU 1 CPU 2
write LBA 0 write LBA 0
get-free[1] = 5 get-free[2] = 6
write data - postmap ABA 5 write data - postmap ABA 6
... ...
read old_map[0] = 10 read old_map[0] = 10
write log 0/10/5/xx write log 0/10/6/xx
write map = 5 write map = 6
write free[1] = 10 write free[2] = 10
25
Let's Race! Write vs. Write
CPU 1 CPU 2
write LBA 0 write LBA 0
get-free[1] = 5 get-free[2] = 6
write data - postmap ABA 5 write data - postmap ABA 6
... ...
read old_map[0] = 10 read old_map[0] = 10
write log 0/10/5/xx write log 0/10/6/xx
write map = 5 write map = 6
write free[1] = 10 write free[2] = 10
26
Let's Race! Write vs. Write
CPU 1 CPU 2
write LBA 0 write LBA 0
get-free[1] = 5 get-free[2] = 6
write data - postmap ABA 5 write data - postmap ABA 6
... ...
read old_map[0] = 10 read old_map[0] = 10
write log 0/10/5/xx write log 0/10/6/xx
write map = 5 write map = 6
write free[1] = 10 write free[2] = 10
Critical section
27
Let's Race! Write vs. Write● Solution: An array of map_locks indexed by a hash of the premap ABA
CPU 1 CPU 2
write LBA 0; get-free[1] = 5; write_data to 5 write LBA 0; get-free[2] = 6; write_data to 6
lock map_lock[0 % nfree]
read old_map[0] = 10
write log 0/10/5/xx; write map = 5; free[1] = 10
unlock map_lock[0 % nfree] lock map_lock[0 % nfree]
read old_map[0] = 5
write log 0/5/6/xx; write map = 6; free[2] = 5
unlock map_lock[0 % nfree]
28
Let's Race! Read vs. Write
CPU 1 (Reader) CPU 2 (Writer)
read LBA 0 write LBA 0
... get-free[2] = 6
read map[0] = 5 write data to postmap block 6
start reading postmap block 5 write meta: map[0] = 6, free[2] = 5
... another write LBA 12
... get-free[2] = 5
... write data to postmap block 5
finish reading postmap block 5
BUG! – writing a block that is being read from
● This doesn't corrupt on-disk layout, but the read appears torn
29
Let's Race! Read vs. Write
CPU 1 (Reader) CPU 2 (Writer)
read LBA 0 write LBA 0
read map[0] = 5 get-free[2] = 6; write data
write rtt[1] = 5 write meta: map[0] = 6, free[2] = 5
start reading postmap block 5 another write LBA 12
... get-free[2] = 5
... scan RTT – '5' is present - wait!
finish reading postmap block 5 ...
clear rtt[1] ...
write data to postmap block 5
● Solution: A Read Tracking Table indexed by lane, tracking in-progress reads
30
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
31
That's Great...but is it Fast?
● Overall, BTT to introduces a ~10% performance overhead
● We think there is still room for improvement
512B Block size 4K Block size
Write Amplification ~4.6% [536B] ~0.5% [4120B]
Capacity Overhead ~0.8% ~0.1%
32
Introduction
The Block Translation Table
Read and Write Flows
Synchronization
Performance/Efficiency
BTT vs. DAX
33
BTT vs. DAX
● DAX stands for Direct Access
● Patchset by Matthew Wilcox, merged into 4.0-rc1
● Allows mapping a pmem range directly into userspace via mmap
● DAX is fundamentally incompatible with the idea of BTT
● If the application is aware of persistent, byte-addressable memory, and can use it to an advantage, DAX is the best path for it
• If the application relies on atomic sector update semantics, it must use the BTT– It may not know that it relies on this..
● XFS relies on journal updates being sector atomic– For xfs-dax, we'd need to use logdev=/dev/[btt-partition]
34
Resources
● http://pmem.io - General persistent memory resources. Focuses on the NVML, a library to make persistent memory programming easier
● The 'pmem' driver on github: https://github.com/01org/prd
● linux-nvdimm mailing list: https://lists.01.org/mailman/listinfo/linux-nvdimm
● linux-nvdimm patchwork: https://patchwork.kernel.org/project/linux-nvdimm/list/
● #pmem on OFTC