red hat, inc. - kernel-based virtual machine€¦ · copyright © 2016 red hat inc. 21 struct...
TRANSCRIPT
![Page 1: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/1.jpg)
Copyright © 2016 Red Hat Inc.
Userland Page Faults and BeyondUserland Page Faults and BeyondWhy How and What’s NextWhy How and What’s Next
Red Hat, Inc.
Andrea Arcangeli <aarcange at redhat.com>
LinuxCon North America, Toronto
24 Aug 2016
![Page 2: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/2.jpg)
Copyright © 2016 Red Hat Inc.
2
Topics● Normally page faults are a kernel internal thing..
– Why offload page faults to userland?● Initial use case that required it
● Upstream/production status● How the userfaultfd API works● Other use cases● Development status● Demo
![Page 3: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/3.jpg)
Copyright © 2016 Red Hat Inc.
3
Why: Memory Externalization● Memory externalization is about running a program with
part (or all) of its memory residing on a remote node
● Memory is transferred from the memory node to the compute node on access
● Memory can be transferred from the compute node to the memory node if it's not frequently used during memory pressure
Node 0Compute nodeLocal Memory
Node 0Memory nodeRemote Memory
userfault
Node 0Compute nodeLocal Memory
Node 0Memory nodeRemote Memory
memorypressure
![Page 4: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/4.jpg)
Copyright © 2016 Red Hat Inc.
4
Postcopy live migration● Postcopy live migration is a form of memory
externalization
● When the QEMU compute node (destination) faults on a missing page that resides in the memory node (source) the kernel has no way to fetch the page
– Solution: let QEMU in userland handle the pagefault
Partially funded by the Orbit European Union project
Node 0Compute nodeLocal Memory
Node 0Memory nodeRemote Memory
userfault
QEMUsource
QEMUdestination
Postcopy live migration
![Page 5: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/5.jpg)
Copyright © 2016 Red Hat Inc.
5
SourceNode
DestinationNode
uffd postcopy live migration
QEMUdestination
userfaultfd threadBlocked poll()/read()
Kernel/guest modedestination
vcpuN RUNNINGMemory is missing
QEMUSource
Memory
Userfaultfd
Use
rlan
dN
etw
ork
prot
ocol
![Page 6: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/6.jpg)
Copyright © 2016 Red Hat Inc.
6
SourceNode
DestinationNode
userfaultfd event notification
QEMUdestination
userfaultfd threadPOLLIN/readwake
Kernel/guest modedestination
Blocks in-kernelvcpuN thread
QEMUSource
Memory
Userfaultfd
Address of the faultU
serl
and
Net
wor
k pr
otoc
ol
![Page 7: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/7.jpg)
Copyright © 2016 Red Hat Inc.
7
SourceNode
DestinationNode
QEMU network page request
QEMUdestination
userfaultfd threadsends page request
Kernel modedestination
Blocked in-kernelvcpuN thread
QEMUSource
Memorygets page request
Use
rlan
dN
etw
ork
prot
ocol
Page
req
uest
Userfaultfd
![Page 8: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/8.jpg)
Copyright © 2016 Red Hat Inc.
8
SourceNode
DestinationNode
QEMU receives page
QEMUdestination
userfaultfd threadPage received
Kernel modedestination
Blocked in-kernelvcpuN thread
QEMUSource
Memorysends page
Use
rlan
dN
etw
ork
prot
ocol
Page
rec
eive
d
Userfaultfd
![Page 9: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/9.jpg)
Copyright © 2016 Red Hat Inc.
9
SourceNode
DestinationNode
UFFDIO_ZEROPAGE/COPY
QEMUdestination
userfaultfd threadUFFDIO_COPY/...
Kernel modedestination
vcpuN wakenvcpuN RUNNING
QEMUSource
Memory
Use
rlan
dN
etw
ork
prot
ocol
UserfaultfdUFFDIO_COPY
UFFDIO_ZEROPAGE
● vcpuN thread blocked the whole time in kernel mode● vcpuN never returned to userland● vcpuN never received signals and never had to invoke
other syscalls to notify and wake up the userfaultfd thread
![Page 10: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/10.jpg)
Copyright © 2016 Red Hat Inc.
10
Missing pages notification● QEMU destination running in the compute node must be
notified the first time a page fault happens if a page is still missing
Destination guest virtual memory (kernel vma)
Not mapped virtual addresses (pages) must notify userland on access
![Page 11: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/11.jpg)
Copyright © 2016 Red Hat Inc.
11
UFFDIO_COPY - atomic memcpy()
1
2 3 4 5 6
tmp_addr
Guest physical address space
2 3 4 5 6
tmp_addr
Guest physical address space
CopyOf 1
1
![Page 12: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/12.jpg)
Copyright © 2016 Red Hat Inc.
12
userfaultfd latency
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 270
200
400
600
800
1000
1200
1400
1600
userfault latency during postcopy live migration - 10Gbitqemu 2.5+ - RHEL7.2+ - stressapptest running in guest
<= latency in milliseconds
num
ber
of u
serf
aults
Userfaults triggered on pages that were already in network-flight are instantaneous. Background transfer seeks at the last userfault address.
![Page 13: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/13.jpg)
Copyright © 2016 Red Hat Inc.
13
KVM precopy live migration
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Before precopy
After precopy
During precopy
Precopy never completes until the database benchmark completes
10Gbit NIC120GiB guestDatabase TPM
~120secTime to
Transfer RAMover network
precopy
![Page 14: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/14.jpg)
Copyright © 2016 Red Hat Inc.
14
KVM postcopy live migration
0
100000
200000
300000
400000
500000
600000
700000
800000
900000
Before postcopy
During postcopy
After postcopy
# virsh migrate .. --postcopy --timeout <sec> --timeout-postcopy# virsh migrate .. --postcopy --postcopy-after-precopy
precopy runsFrom 5m
To 7m
postcopy runsFrom 7m
To about ~9mdeterministic
precopy
postcopy
khugepagedcollapses THPs
![Page 15: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/15.jpg)
Copyright © 2016 Red Hat Inc.
15
All available upstream● Userfaultfd() syscall in Linux Kernel >= v4.3● Postcopy live migration in:
– QEMU >= v2.5.0● Author: David Gilbert @ Red Hat Inc.
– Postcopy in Libvirt >= 1.3.4– OpenStack Nova >= Newton
● … and coming soon in production starting with:– RHEL 7.3
![Page 16: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/16.jpg)
Copyright © 2016 Red Hat Inc.
16
What’s Next?● The current upstream kernel support is limited to:
– Missing faults (i.e. missing pages)– Anonymous memory (i.e. malloc)
● What about:– other types of memory
● tmpfs● hugetlbfs● perhaps real filesystem pagecache too
– Write Protect faults– Removing the memory atomically… after adding it with UFFDIO_COPY– sending the opened userfaultfd to a different “manager” process
● so that it can manage the memory behinds its back in a non-cooperative way
![Page 17: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/17.jpg)
Copyright © 2016 Red Hat Inc.
17
userfaultfd API● The ioctl(uffd,...) API is versioned● Can be extended in a backwards compatible way● The current version of the API can already provide
all features mentioned in the previous slide– Thanks to the Linux Kernel community feedback
● Not specific to postcopy:– new way to manage the memory to do things
that weren’t possible before
![Page 18: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/18.jpg)
Copyright © 2016 Red Hat Inc.
18
UFFDIO_APIufd = syscall(__NR_userfaultfd, O_CLOEXEC)
struct uffdio_api api_struct;
api_struct.api = UFFD_API;
/* = 0 → userland asks for default pagefault support
WP support if kernel returns UFFD_FEATURE_PAGEFAULT_FLAG_WP */
api_struct.features = 0;
/* non cooperative mode */
/* api_struct.features = UFFD_FEATURE_EVENT_FORK | UFFD_FEATURE_EVENT_REMAP |
UFFD_FEATURE_EVENT_MADVDONTNEED; */
if (ioctl(ufd, UFFDIO_API, &api_struct)) {
/* err */
}
ioctl_mask = (__u64)1 << _UFFDIO_REGISTER |
(__u64)1 << _UFFDIO_UNREGISTER;
if ((api_struct.ioctls & ioctl_mask) != ioctl_mask) {
/* err */
}
![Page 19: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/19.jpg)
Copyright © 2016 Red Hat Inc.
19
UFFDIO_REGISTER struct uffdio_register reg_struct;
reg_struct.range.start = (uintptr_t)host_addr;
reg_struct.range.len = length;
reg_struct.mode = UFFDIO_REGISTER_MODE_MISSING;
if (ioctl(mis->userfault_fd, UFFDIO_REGISTER, ®_struct)) {
/* err */
}
![Page 20: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/20.jpg)
Copyright © 2016 Red Hat Inc.
20
Wait for event● poll() / epoll() (/ select)
struct pollfd pollfd[1];
pollfd[0].fd = uffd;
pollfd[0].events = POLLIN;
ret = poll(pollfd, 1, -1);
● Read()struct uffd_msg msg;
ret = read(uffd, &msg, sizeof(msg));
● Check uffd_msg eventif (msg.event != UFFD_EVENT_PAGEFAULT) { /* err */ }
offset = (char *)(unsigned long)msg.arg.pagefault.address - area_dst;
![Page 21: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/21.jpg)
Copyright © 2016 Red Hat Inc.
21
struct uffd_msgstruct uffd_msg {
__u8 event;__u8 reserved1;__u16 reserved2;__u32 reserved3;union {
struct {__u64 flags;__u64 address;
} pagefault;struct {
__u32 ufd;} fork;struct {
__u64 from;__u64 to;__u64 len;
} remap;struct {
__u64 start;__u64 end;
} madv_dn;struct {
/* unused reserved fields */__u64 reserved1;__u64 reserved2;__u64 reserved3;
} reserved;} arg;
} __packed;
sizeof(struct uffd_msg) 32bit/64bit ABI enforcementZeros here, can extend with UFFD_FEATURE flags
UFFD_EVENT_* tells which part of the union is valid
Default cooperative support tracking pagefaultsUFFD_EVENT_PAGEFAULT
Non cooperative support tracking MM syscallsUFFD_EVENT_FORKUFFD_EVENT_REMAPUFFD_EVENT_MADVDONTNEED
![Page 22: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/22.jpg)
Copyright © 2016 Red Hat Inc.
22
Handle fault● UFFDIO_COPY
struct uffdio_copy uffdio_copy;
uffdio_copy.dst = (unsigned long) area_dst + offset;
uffdio_copy.src = (unsigned long) area_src + offset;
uffdio_copy.len = page_size;
uffdio_copy.mode = 0;
uffdio_copy.copy = 0;
if (ioctl(uffd, UFFDIO_COPY, &uffdio_copy)) {
/* err */
}
● UFFDIO_ZEROPAGEstruct uffdio_zeropage zero_struct;
zero_struct.range.start = (uint64_t)(uintptr_t)host;
zero_struct.range.len = getpagesize();
zero_struct.mode = 0;
if (ioctl(mis->userfault_fd, UFFDIO_ZEROPAGE, &zero_struct)) {
/* err */
}
![Page 23: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/23.jpg)
Copyright © 2016 Red Hat Inc.
23
Possible use cases● Replace mprotect/mremap+SIGSEGV● Efficient snapshotting● JIT/to-native compilers removal of write bits● Non cooperative usage● CRIU lazy restore from disk● Add robustness to shared memory● Add host enforcement to QEMU balloon driver backend● Distributed shared memory● Obsoletes soft_dirty● Obsoletes “volatile pages” SIGBUS notification
![Page 24: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/24.jpg)
Copyright © 2016 Red Hat Inc.
24
userfaultfd vs SIGSEGV● Note: the translation is not strictly 1:1 and
many more combinations are possible...
UFFDIO_REGISTER_MODE_MISSINGUFFDIO_COPY
UFFDIO_ZEROPAGEpoll() / epoll() / blocking read()
mprotect(PROT_NONE)mprotect(PROT_READ|PROT_WRITE)
SIGSEGV signal
UFFDIO_REGISTER_MODE_WPUFFDIO_WRITEPROTECT
UFFDIO_WRITEPROTECT_MODE_WPpoll() / epoll() / blocking read()
mprotect(PROT_READ)mprotect(PROT_READ|PROT_WRITE)
SIGSEGV signal
UFFDIO_REMAPUFFDIO_*
poll() / epoll() / blocking read()
mremap()SIGSEGV signal
![Page 25: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/25.jpg)
Copyright © 2016 Red Hat Inc.
25
userfaultfd vs SIGSEGV● The poll() / epoll() / read() notification is much faster than
a SIGSEGV signal– No signal masking complexities and inefficiencies– The blocked task blocks in-kernel and never returns to
userland● UFFDIO_ ioctls are faster and more SMP scalable than
their syscall equivalents like mprotect/mremap– All locks are lightweight– “vmas” are never modified in fast paths– No risk of running out “vmas”
https://lab.nexedi.cn/kirr/wendelin.core/blob/master/bigfile/virtmem.c
![Page 26: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/26.jpg)
Copyright © 2016 Red Hat Inc.
26
Efficient Snapshotting● Huawei working on QEMU/KVM postcopy live snapshotting● Redis can use it too● No need of fork()
– Use threads instead● COW faults can be throttled
– COW faults after fork() cannot be throttled● THP always enabled will run optimally
– It is userland that decides the granularity of the fault so if it wants to COW at 4KiB granularity it can
– THP becomes more “transparent” than it already was● Requires WP support
https://lists.nongnu.org/archive/html/qemu-devel/2016-01/msg00664.html
http://redis.io/topics/latency
![Page 27: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/27.jpg)
Copyright © 2016 Red Hat Inc.
27
Optimize away JIT/to-native compilers write bits
● JIT/to-native compilers always set a bit to track which “Cards”/pages were modified
● WP support can allow the garbage collector to track which pages were modified to avoid collecting write bits
● Similar to dirty logging mode in QEMU/KVM● IIRC this was one the targeted optimizations of the OSv
unikernel– userfaultfd might achieve the equivalent result but
without linking the JVM run time in the kernel● Requires WP support
https://medium.com/@MartinCracauer/generational-garbage-collection-write-barriers-write-protection-and-userfaultfd-2-8b0e796b8f7f
http://jcdav.is/2015/11/09/More-JVM-Signal-Tricks/
![Page 28: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/28.jpg)
Copyright © 2016 Red Hat Inc.
28
Non cooperative usage● Virtuozzo and IBM working on it● QEMU is aware of the uffd, the guest apps are not● The app sends a “uffd” to a “manager” and then it keeps
running unaware the “manager” is managing the memory behinds its back
● Container postcopy live migration● fork()/mremap() and other VM syscalls must send
userfaultfd events– Not only page faults will send events to userland
● Nesting of uffds is going to be “interesting”http://www.slideshare.net/kerneltlv/userfaultfd-and-postcopy-migration
![Page 29: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/29.jpg)
Copyright © 2016 Red Hat Inc.
29
CRIU lazy restore from disk● Start the restored task immediately
– Despite the load of its memory from disk isn’t complete
● Conceptually similar to postcopy live migration– The program memory is read from disk
instead of being transferred through the network
https://criu.org/Userfaultfd
![Page 30: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/30.jpg)
Copyright © 2016 Red Hat Inc.
30
Add robustness to shmem● Oracle contributed the UFFDIO_MISSING support to hugetlbfs● IBM contributed the UFFDIO_MISSING support to tmpfs● If the database hits a bug and accidentally writes a byte in a
shmem file hole, a new page is silently allocated and the corruption goes unnoticed
● An accidental read of zeros from a shmem file hole would also go unnoticed
● userfaultfd at zero cost notifies the database that something unexpectedly tried to read or write to a hugetlbfs or tmpfs file hole– “no vma split” and UFFDIO_COPY are the main feature here– For this use case if an userfaultfd event ever happens, it is sign a
bug is causing memory corruption
![Page 31: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/31.jpg)
Copyright © 2016 Red Hat Inc.
31
Host enforcement for memory ballooning
● The QEMU/KVM memory balloon driver is not host enforced
● After the host backend driver calls MADV_DONTNEED, the guest can still deflate the memory balloon
● userfaultfd UFFDIO_MISSING support, at zero cost, can allow QEMU/KVM to notice the guest is malicious and terminate it
![Page 32: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/32.jpg)
Copyright © 2016 Red Hat Inc.
32
Distributed shared memory● Distributed Shared Memory project at Berkeley● Other research interest● Needs UFFDIO_REMAP to extract memory
– mremap() equivalent● Needs WP support to allow read-shared, write-
exclusive model● Can interact with HMM (Heterogeneous Memory
Management) and NVIDIA’s unified memory– GPU userfaults?
![Page 33: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/33.jpg)
Copyright © 2016 Red Hat Inc.
33
Obsoletes soft_dirty● Soft dirty tells userland, which pages were modified during a certain
runtime● It has to scan all pagetables of all memory that could possibly have
been modified to find out– O(N) complexity where N is the number of pages– It still requires an initial wrprotect fault
● WP support is likely to simply obsolete soft_dirty because it will still scale well with an unlimited amount of memory– Similar to Intel VMM PML hw feature (Page Modification Logging)– Main drawback is that userfaultfd blocks the WP fault– It should be possible to add an event async queue model to turn
on and off at run time with an UFFDIO_ASYNC ioctl, to prevent the wrprotect fault to block
![Page 34: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/34.jpg)
Copyright © 2016 Red Hat Inc.
34
Obsoletes volatile pages SIGBUS● “Volatile pages” may be reclaimed and freed by the Linux
Virtual Memory at any time– They could contain uncompressed graphic bitmaps
that can be re-created at low cost● Worth to keep in memory only as long as there is
no Virtual Memory pressure– A prominent “volatile pages” use case is Android
● userfaultfd can tell userland if the volatile page is missing
● No need of their own missing “volatile page” SIGBUS framework to notify userland
![Page 35: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/35.jpg)
Copyright © 2016 Red Hat Inc.
35
Development status● tmpfs, hugetlbfs support works (selftests provided for both)● Write Protect support works
– Needs accuracy with swapping, blocker but in progress..● WP|MISSING support works but it may need some extension
– UFFDIO_COPY will always add the page read-write– Zeropages created by UFFDIO_ZEROCOPY remains readonly forever
● non cooperative support works– Minor issues:
● Cannot recreate shared anon pages● Cannot handle uffd nesting
● UFFDIO_REMAP is implemented but not in aa.git and shall be revisited– Only distributed shared memory needs it
![Page 36: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/36.jpg)
Copyright © 2016 Red Hat Inc.
36
Git userfault branch● https://git.kernel.org/cgit/linux/kernel/git/andrea/aa.git/log/?h=userfault
![Page 37: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/37.jpg)
Copyright © 2016 Red Hat Inc.
37
userfaultfd in action● Time for a live migration demo!
![Page 38: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/38.jpg)
Copyright © 2016 Red Hat Inc.
Live migration total time
Total time0
50
100
150
200
250
300
350
400
450
500
autoconvergepostcopyse
cond
s
![Page 39: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/39.jpg)
Copyright © 2016 Red Hat Inc.
Max UDP latency0
5
10
15
20
25
precopy timeoutautoconvergepostcopyse
cond
s
Live migration max perceived downtime latency
![Page 40: Red Hat, Inc. - Kernel-based Virtual Machine€¦ · Copyright © 2016 Red Hat Inc. 21 struct uffd_msg struct uffd_msg { __u8 event; __u8 reserved1; __u16 reserved2; __u32 reserved3;](https://reader034.vdocument.in/reader034/viewer/2022042202/5ea334cc5b5a4e33ad6aa325/html5/thumbnails/40.jpg)
Copyright © 2016 Red Hat Inc.
40
Q/A