userfaultfd and post-copy migration
TRANSCRIPT
![Page 1: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/1.jpg)
Userfaultfd and Post-Copy Migration
Mike Rapoport
![Page 2: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/2.jpg)
Outline● Migration background● Userfaultfd● Post-copy migration
![Page 3: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/3.jpg)
Migration: why?● Spectacular● Statefull application with no downtime
○ Hardware upgrades○ Software upgrades requiring boot
● Load balancing
![Page 4: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/4.jpg)
Migration: how?
● Very simple○ Save state on source○ Copy state to destination○ Restore state on destination
● Memory is the heaviest part○ Pre-copy vs post-copy
![Page 5: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/5.jpg)
Migration flows
Pre-copy
● Track memory, copy inactive part● Freeze on source● Copy state and remaining memory● Unfreeze on destination
Post-copy
● Freeze on source● Copy state except memory● Enable “remote swap”● Unfreeze on destination● Bring memory on demand
![Page 6: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/6.jpg)
Pre-copy
prepare memory copy 1
memory copy n freeze state
copy unfreeze
time
Running on
sourceStopped Running
on dest
![Page 7: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/7.jpg)
Post-copy
prepare rest of the memoryfreeze state copy unfreeze
time
remote page faults
Running on
sourceStopped Running on dest
![Page 8: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/8.jpg)
Pre-copy vs post-copy
https://youtu.be/lo2JJ2KWrlA
Pre-Copy
+ Less vulnerable to node failures
+ High performance in “UP” state- Longer downtime- Might diverge
Post-Copy
- More vulnerable to node failures
- Slowdown after migration+ Shorter downtime+ Predictable downtime
![Page 9: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/9.jpg)
Userfaultfd highlights● Delegation of page faults to userspace● File descriptor with ioctl’s for control● Poll and read to get page fault notifications● mcopy_atomic to “map” the page
○ Can handle zero pages
![Page 10: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/10.jpg)
Userfaultfd setup● Initialize user fault page descriptor
○ uffd = syscall(__NR_userfaultfd, O_CLOEXEC | O_NONBLOCK);
● API handshake○ ioctl(uffd, UFFDIO_API, &uffdio_api);
● Register range○ uffdio_register.range.start = (unsigned long) start;○ uffdio_register.range.len = nr_pages * page_size;○ uffdio_register.mode = UFFDIO_REGISTER_MODE_MISSING;○ ioctl(uffd, UFFDIO_REGISTER, &uffdio_register);
![Page 11: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/11.jpg)
Page fault handling
● Wait for event○ pollfd[0].fd = uffd;○ pollfd[0].events = POLLIN;○ poll(pollfd, 1, -1);
● Read the event○ read(uffd, &uffd_msg, sizeof(uffd_msg));○ if (msg.event != UFFD_EVENT_PAGEFAULT)○ oops...○ faulting_address = msg.arg.pagefault.address
![Page 12: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/12.jpg)
Page fault handling● “Map” normal page
○ uffdio_copy.dst = faulting_address;○ uffdio_copy.src = source_page_address;○ uffdio_copy.len = page_size;○ uffdio_copy.mode = 0;○ uffdio_copy.copy = 0;○ ioctl(uffd, UFFDIO_COPY, &uffdio_copy);
● “Map” zero page○ uffdio_zeropage.range.start = faulting_address;○ uffdio_zeropage.len = page_size;○ uffdio_zeropage.mode = 0;○ ioctl(uffd, UFFDIO_ZEROPAGE, &uffdio_zeropage);
![Page 13: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/13.jpg)
Under the hood● syscall(__NR_userfaultfd)
○ Allocate userfault context○ Create a file hooked to an anonymous inode○ Wait for API handshake
● ioctl(UFFDIO_API)○ Verify that userspace and kernel talk the same language
● ioctl(UFFDIO_REGISTER)○ Find VMA covering desired range○ Make sure the VMA can “user fault”○ Add userfault context to the VMA
![Page 14: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/14.jpg)
Under the hood● Page fault
○ Faulting address covered by VMA with userfault context○ Add “page fault” message to file poll queue○ Wake up process polling the uffd○ Return VM_FAULT_UFFD_RETRY to mm core
● UFFDIO_COPY/UFFDIO_ZEROPAGE○ Allocate a page○ Create a page table entry for faulting address○ Copy the page content from user or○ Map to zero page
![Page 15: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/15.jpg)
VM post-copy migration● Guest memory is a part of QEMU
address space● Combine pre- and post-copy● Straightforward flow
○ Start a thread for for user fault handling○ Register guest memory areas with userfaultfd
○ Guest page fault causes UFFD_EVENT_PAGEFAULT
■ Request the page from source■ copy/zero guest memory upon response
○ Fetch non-faulting pages in the background
![Page 16: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/16.jpg)
CRIU + post-copy migration● Different address spaces
○ Restore controller○ Restored processes
● Basic flow similar to VMs○ Start a daemon for user fault handling○ Register restored process areas with userfaultfd
■ Might be quite a few uffds○ Handle page faults○ Fetch non-faulting memory in the background
● BUT
![Page 17: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/17.jpg)
Non-cooperative userfaultfd ● Page fault cannot block restorer
○ Use UFFDIO_WAKE ioctl
● Processes change mappings on the flight○ fork()○ madvise(..., MADV_DONTNEED)○ mremap()
![Page 18: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/18.jpg)
Future● Kernel WIP
○ Write protected pages○ fork, madvise, mremap events○ hugetlbfs○ tmpfs
● CRIU○ Make it work? ;-)
![Page 19: Userfaultfd and Post-Copy Migration](https://reader034.vdocument.in/reader034/viewer/2022052606/58795ff91a28ab1e388b6133/html5/thumbnails/19.jpg)
References● https://www.kernel.org/doc/Documentation/vm/userfaultfd.txt● http://wiki.qemu.org/Features/PostCopyLiveMigration● https://criu.org/Userfaultfd