pmfs - pages.cs.wisc.edu
TRANSCRIPT
![Page 1: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/1.jpg)
PMFSCS 839 - Persistence
![Page 2: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/2.jpg)
Learning objectives
• Appreciate the difference between redo & undo logging
• Understand how hardware features can optimizes PM software
• Understand XXX
![Page 3: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/3.jpg)
Project
• Proposals should be presented in class on 10/14 – 2 weeks
• Some ideas are up on the web page
• Work in groups of 2-3; I can help find groups if you need
• You can overlap with your own research & courses, but needs to be a distinct effort (can’t turn in same regular-sized project for two courses)
![Page 4: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/4.jpg)
Notes from reviews
• When making a claim, please explain:
• “Mapping PM to kernel's virtual address space is not secure.”
• Please say why: what bad thing could happen, and why this is different than current systems. If they already have the same flaw, it is generally an accepted risk
• How should we think about papers on PM before PM was commercially available? What do we expect for evaluation
![Page 5: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/5.jpg)
Background story
• Intel is developing 3d Xpoint
• Others have proposed various crazy file system designs
• Their engineers/researchers have deep knowledge of real architectural issues, architectural features
• Seek to show various low-level options
![Page 6: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/6.jpg)
What are the biggest ideas in PMFS?
![Page 7: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/7.jpg)
Big ideas in PMFS
• Fine-grained redo logging
• Leveraging unique Intel processor features for PMFS• Hardware transactions
• Memory ordering rules
• Write-protect disable
• Intel’s programming model for PM (with hardware)
• Use of large pages
![Page 8: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/8.jpg)
What are the concerns PMFS addresses?
![Page 9: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/9.jpg)
What are the concerns PMFS addresses?
• How to control ordering
• How to provide atomicity
• How to handle memory mapping
• How to handle stray writes
• …
![Page 10: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/10.jpg)
Real Intel support for persistent memory
• Problem:• How implement flush/fence operations?
• Options:• MTRR to enable Write-through caching• NTSTORE – what happens if data in cache?• SFENCE – only waits for data to hit controller, not memory
• Solution• CLFLUSHOPT – asynchronous flush• PM_WBARRIER – guarantees durability of data previously flushed
• All flushes from any core?• Only flushes from this core?
![Page 11: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/11.jpg)
Hardware support for atomic & ordered writes• 8 byte: regular store instruction• 16 byte: cmpxchg16b compare with RDX:RAX and swap with data from
RCX:RBX• Useful for setting size & modification time atomically (BPFS problem)
• 64-byte transactions – use Intel hardware transactional memory• XBEGIN• Write to a cache line multiple times• XEND• CLFLUSH
• Writes to same cacheline happen in order• STORE A, 14• STORE A+8, 27• Guarantees 27 never reaches pmem if 1 doesn’t also reach pmem
![Page 12: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/12.jpg)
Mapping PM into kernel address space
• Why do this?
Volatile data PM
![Page 13: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/13.jpg)
Protection from stray writes
• How real is the problem?
• How is this handled in normal file systems?
![Page 14: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/14.jpg)
Handling stray writes
• Hardware protection: block writes in hardware
• Software protection: block writes in software• Type-safety
• Software fault isolation
• Software protection: hide PM• Map it far away from anything else
• Don’t tell normal code the real address
• Only reveal the real address is internal super-correct code
Normal data PMPM
addr
![Page 15: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/15.jpg)
Write protection in hardware
• What hardware features are possible?• Kernel on kernel
• Kernel on user
• User on kernel
• User on user
![Page 16: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/16.jpg)
Idea for protection
• Have accessor functions for reading/writing PM• Normal stores shouldn’t have access
• Accessors enable access, do write, disable access• Example: TX_BEGIN allows access / TX_END removes
access
• Example: D_RW(ptr) allows acess, *ptr does not
![Page 17: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/17.jpg)
Write protection in hardware
• What hardware features are possible?• Kernel on kernel
• CR0: disable write protection for compatibility with 80386 (1985(
• Kernel on user• SMAP: blocks kernel access to user data – prevent tricking kernel into revealing user data
(2012)
• User on kernel• Normal page permissions
• User on user • MPK: memory protection keys allow turning access on/off from usermode
Normal data:R/W PM:R
CR0: WP=1
PM:R/W
CR0: WP=0
![Page 18: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/18.jpg)
Address Space
Intel Memory Protection Keys (MPK)
18
… …
Page Table Entry (PTE)
Page 1
Page 2
Page 3
…Page 1PKEY
2
• Available in Skylake server CPUs
• Tag memory pages with PKEY
![Page 19: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/19.jpg)
Address Space
Intel Memory Protection Keys (MPK)
19
CPU Core
PKRU Register
Page 1
Page 2
Page 3• Available in Skylake server CPUs
• Tag memory pages with PKEY
• Permission Register (PKRU)
… …
Page Table Entry (PTE)
…Page 1PKEY
20 0 1 1000 0 …
1W
1R
0W
0R
2R
2W
15W
15R
…
![Page 20: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/20.jpg)
Address Space
Intel Memory Protection Keys (MPK)
20
CPU Core
1W
1R
0W
0R
2R
2W
15W
15R
…
PKRU Register
Page 1
Page 2
Page 3• Available in Skylake server CPUs
• Tag memory pages with PKEY
• Permission Register (PKRU)
• Userspace instruction to update PKRU• Fast switch between 11 – 260 cycles/switch
… …
Page Table Entry (PTE)
…Page 1PKEY
20 0 1 1110 0 …
![Page 21: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/21.jpg)
Address Space
Intel Memory Protection Keys (MPK)
21
CPU Core
PKRU Register
Page 1
Page 2
Page 3• Available in Skylake server CPUs
• Tag memory pages with PKEY
• Permission Register (PKRU)
• Userspace instruction to update PKRU• Fast switch at 50 cycles/switch
By itself, MPK does not protect
against malicious attacks.
… …
Page Table Entry (PTE)
…Page 1PKEY
21 1 1 1111 1 …
1W
1R
0W
0R
2R
2W
15W
15R
…
![Page 22: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/22.jpg)
Using Memory protection keys
Safe_write(object * ptr, object & some_data) {
MPK_WRPKRU(0b0...01100)
*ptr = some_data; // write to PM
MPK_WRPKRU(0b0...00000) }
}
![Page 23: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/23.jpg)
Efficient layout
• What layout changes are possible with NVM?
![Page 24: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/24.jpg)
Efficient layout
• What layout changes are possible with NVM?
• Use memory-optimized data structures: Btree• (was also good for disk …, but can choose
different size blocks)
• Allocate blocks for MMU sizes• 4KB, 2MB, 1GB• Allows use of huge pages in TLB
• Policy: when use large pages?
![Page 25: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/25.jpg)
What consistency mechanism is best?
• Logging/journaling?
• Shadow paging/CoW?
![Page 26: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/26.jpg)
What consistency mechanism is best? Why?
• Logging/journaling?• Good for small updates – not much data to write twice
• Low write amplification
• Shadow paging/CoW?• Good for large writes – avoid writing twice
![Page 27: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/27.jpg)
Undo/redo logging:
Undo loggingWrite (log, x =2);
CLFLUSH(log); MFENCE; WBARRIER
X = 3;
Write (log, y = 2);
CLFLUSH(log); MFENCE; WBARRIER
Y = X
CLFLUSH(x); CLFLUSH(Y); MFENCE; WBARRIER
Redo loggingTmp_x = 3;
Tmp_y = tmp_x;
Write(log, x = 3);
Write(log, y = 3);
CLFLUSH(log); MFENCE; WBARRIER
X = tmp_x;
Y = tmp_y;
CLFLUSH(x); CLFLUSH(Y); MFENCE; WBARRIER
X=2, y = 2;
TX_BEGINx = 3y = X
TX_COMMIT
Is the final WBARRIER needed?
![Page 28: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/28.jpg)
What makes one better
Undo logging
+ directly read/write to new data
- More flushes/fences: before every write
Redo logging
+ only need to write log before end of transaction
+ only one fence between log and data writes
- Need to store new values somewhere else, track where they are
![Page 29: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/29.jpg)
PMFS log record
• Records are 64 bytes
• Stored in a circular buffer
• COMMIT record indicates TX committed
![Page 30: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/30.jpg)
Efficient log management
• Log is circular buffer
• On failure: how know what entries are valid?
X=1
Y=2
S=3
C
F=4
G=7
![Page 31: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/31.jpg)
Efficient log management
• Add generation ID to each entry
• After failure: only entries with latest generation_ID are valid
• Rely on ordered write to cacheline• Write gen_ID last to commit log entry
X=1, 1
Y=2, 1
S=3, 1
C, 1
F=4, 2
G=7, 2
![Page 32: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/32.jpg)
testing
What can go wrong with code?
![Page 33: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/33.jpg)
Testing
What can go wrong with code?
• Missing flushes
• Missing fences
• Missing WBARRIERS
![Page 34: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/34.jpg)
How test?
• Collect trace of store/fence/wbarrier
• Replays all possible orderings of stores• Reordering stores
• Crashing before wbarrier
• Check data structure consistency everywhere• Is a double-linked list still doubly linked?
![Page 35: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/35.jpg)
Memory mapping
• Map PM pages right into user address spaces
• For mmap: register page-fault handler• Attach PM page on fault
• For read/write: on access, copy directly from PM page to user buffer (bypass page cache)
![Page 36: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/36.jpg)
Evaluation
• PM Emulator platform• Intel special sauce – modify microcode in processor
• Add latency periodically to model added latency of PM
• Throttle bandwidth of access to model lower bandwidth
• How good is this compared to gem5/software emulators?
![Page 37: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/37.jpg)
t
Evaluation
File-based Access
File I/O & utilities
![Page 38: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/38.jpg)
t
Evaluation
3. Memory-mapped I/O
![Page 39: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/39.jpg)
t
Evaluation
3. Memory-mapped I/O
Neo4j Graph Database
Logging overhead
![Page 40: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/40.jpg)
t
Evaluation
4. Write Protection
![Page 41: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/41.jpg)
Outcome
• PMFS showed what was a possible, but …
• It didn’t scale well to large numbers of cores
• It had bugs – writing an FS is hard
• But, applying most important concepts to Ext4 gave good/better performance• Using large pages• Bypassing page cache• Fine-grained logging (Maybe – not sure)
![Page 42: PMFS - pages.cs.wisc.edu](https://reader031.vdocument.in/reader031/viewer/2022020621/61e725974019e16f4848fb28/html5/thumbnails/42.jpg)
Questions from reviews
• Application control over durability?
• Why map all PM into kernel address space?
• Should we buffer some data in DRAM for performance?
• What happened to PMFS?
• Could we use PMFS as a cache in front of HDD?
• Why can we skip final wbarrier ?
• How good is huge page support?