everything's a file descriptor · everything’s a file descriptor josh triplett...

Post on 24-Jul-2020

6 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Everything’s a File Descriptor

Josh Triplettjosh@joshtriplett.org

Linux Plumbers Conference 2015

“Everything’s a file”

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

I /home/josh/doc/presentations/lpc-2015/fd/fd.pdf

I /etc/hostname

I /dev/null

I /dev/zero

I /dev/ttyS0

I /dev/dri/card0

I /dev/cpu/0/cpuid

I /tmp/.X11-unix/X0

I /proc/1/environ

I /proc/cmdline

I /sys/class/block/sda/queue/rotational

I /sys/firmware/acpi/tables/DSDT

Everything has a filename?

Everything has a filename?

////////////////////////////////Everything has a filename?

I Pipes

I Sockets

I epoll

I memfd

I KVM virtual machines and CPUs

I . . .

Everything’s a file descriptor

I What is a file descriptor, really?

I What can you do with a file descriptor?

I What interesting file descriptors exist?

I How do you build a new type of file descriptors?

I What interesting file descriptors don’t exist?

I What is a file descriptor, really?

I What can you do with a file descriptor?

I What interesting file descriptors exist?

I How do you build a new type of file descriptors?

I What interesting file descriptors don’t exist yet?

What is a file descriptor, really?

I struct fd, struct fdtable

I struct file

struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY);

xdup = dup(x);

y = open("testfile", O_RDONLY);

read(x, &c, 1);

putchar(c);

read(xdup, &c, 1);

putchar(c);

read(y, &c, 1);

putchar(c);

struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY);

xdup = dup(x);

y = open("testfile", O_RDONLY);

read(x, &c, 1);

putchar(c);

read(xdup, &c, 1);

putchar(c);

read(y, &c, 1);

putchar(c);

struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY);

xdup = dup(x);

y = open("testfile", O_RDONLY);

read(x, &c, 1);

putchar(c); /* Prints ’0’ */

read(xdup, &c, 1);

putchar(c);

read(y, &c, 1);

putchar(c);

struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY);

xdup = dup(x);

y = open("testfile", O_RDONLY);

read(x, &c, 1);

putchar(c); /* Prints ’0’ */

read(xdup, &c, 1);

putchar(c); /* Prints ’1’ */

read(y, &c, 1);

putchar(c);

struct fd versus struct file

testfile contains “0123456789”

x = open("testfile", O_RDONLY);

xdup = dup(x);

y = open("testfile", O_RDONLY);

read(x, &c, 1);

putchar(c); /* Prints ’0’ */

read(xdup, &c, 1);

putchar(c); /* Prints ’1’ */

read(y, &c, 1);

putchar(c); /* Prints ’0’ */

struct fd

x

xdup

y

struct file

f pos:f count:

f pos: 0f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 0f count: 1

f pos: 0f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 0f count: 2

f pos: 0f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 0f count: 2

f pos: 0f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 1f count: 2

f pos: 0f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 2f count: 2

f pos: 0f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 2f count: 2

f pos: 1f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 2f count: 2

f pos: 1f count: 1

testfile

0

1

2

3

...

userspace int

kernel object driver-specific

struct fd

x

xdup

y

struct file

f pos: 2f count: 2

f pos: 1f count: 1

testfile

0

1

2

3

...

userspace int kernel object

driver-specific

struct fd

x

xdup

y

struct file

f pos: 2f count: 2

f pos: 1f count: 1

testfile

0

1

2

3

...

userspace int kernel object driver-specific

File descriptor:Userspace reference to

kernel object

What can you do with afile descriptor?

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

I read, write

I seek

I preadv, pwritev

I stat

I Blocking or non-blocking

I poll, select, epoll

I dup, dup2

I Send over a UNIX socket via SCM_RIGHTS

I Inherited over exec

I mmap

I sendfile, splice, tee

I openat

I . . .

I ioctl

Use file descriptors!

What interesting filedescriptors exist?

eventfd

I 64-bit counter used as an event queue

I write: Add value to counterI read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kerneland userspace

eventfd

I 64-bit counter used as an event queue

I write: Add value to counter

I read: Block until non-zero; read value and reset to 0I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kerneland userspace

eventfd

I 64-bit counter used as an event queue

I write: Add value to counterI read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kerneland userspace

eventfd

I 64-bit counter used as an event queue

I write: Add value to counterI read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kerneland userspace

eventfd

I 64-bit counter used as an event queue

I write: Add value to counterI read: Block until non-zero; read value and reset to 0

I “Semaphore mode”: Read 1 and decrement by 1

I poll: Ready for reading if non-zero

I Several drivers use eventfd to signal events between kerneland userspace

timerfd

I Allows handling timers as file descriptors

I Throw them in the poll loop with everything else

I Create with specified timeout

I read: Block until timeout; return number of times expired

I poll: Reading for reading if timeout passed

Signals

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signals

I Receive asynchronous events in a process

I Suspend execution, save registers, move execution to handler

I Restore registers and resume execution when handler done

I Assume a userspace stack to push and pop state

I sigaltstack sets an alternate stack to switch to

I Set up stack to return into call to sigreturn for cleanup

I Can receive signals while in a kernel syscall

I Some syscalls restart afterward

I Syscalls with timeouts adjust them (restart_syscall)

I Other syscalls return EINTR

I Can mask signals to avoid interruption

I Special syscalls that also set signal mask (ppoll, pselect,KVM_SET_SIGNAL_MASK ioctl)

I “async-signal-safe” library functions

Signed-off-by: <(;,;)@r’lyeh>

signalfd

I File descriptor to receive a given set of signals

I Block “normal” signal delivery; receive via signalfd instead

I read: Block until signal, return struct signalfd_siginfo

I poll: Readable when signal received

signalfd

I File descriptor to receive a given set of signals

I Block “normal” signal delivery; receive via signalfd instead

I read: Block until signal, return struct signalfd_siginfo

I poll: Readable when signal received

How do you build a new typeof file descriptor?

Semantics

I read and writeI NothingI Raw dataI Specific data structure

I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and writeI NothingI Raw dataI Specific data structure

I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and writeI NothingI Raw dataI Specific data structure

I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and writeI NothingI Raw dataI Specific data structure

I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and writeI NothingI Raw dataI Specific data structure

I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Semantics

I read and writeI NothingI Raw dataI Specific data structure

I poll/select/epollI Must match read/write blocking behavior if anyI Can have pollable fd even if read/write do nothing

I seek and file position

I mmap

I What happens with multiple processes, or dup?

I For everything else: ioctl

Implementation

I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object

I simple_read_from_buffer, simple_write_to_buffer

I no_llseek, fixed_size_llseekI Check file->f_flags & O_NONBLOCK

I Blocking: wait_queue_headI Non-blocking: return -EAGAIN

Implementation

I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object

I simple_read_from_buffer, simple_write_to_buffer

I no_llseek, fixed_size_llseekI Check file->f_flags & O_NONBLOCK

I Blocking: wait_queue_headI Non-blocking: return -EAGAIN

Implementation

I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object

I simple_read_from_buffer, simple_write_to_buffer

I no_llseek, fixed_size_llseek

I Check file->f_flags & O_NONBLOCKI Blocking: wait_queue_headI Non-blocking: return -EAGAIN

Implementation

I anon_inode_getfdI Doesn’t need a backing inode or filesystemI Provide an ops structure and private data pointerI Private data points to your kernel object

I simple_read_from_buffer, simple_write_to_buffer

I no_llseek, fixed_size_llseekI Check file->f_flags & O_NONBLOCK

I Blocking: wait_queue_headI Non-blocking: return -EAGAIN

What interesting filedescriptors don’t exist yet?

Child processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

I fork/clone

I Parent process gets the child PID

I Parent uses dedicated syscalls (waitpid) to wait for child exit

I When child exits, parent gets SIGCHLD signal

I Parent makes waitpid call to get exit status

Problems:

I Waiting not integrated with poll loops

Signals

I Process-global; libraries can’t manage only their own processes

Alternatives

I Set SIGCHLD handler, write to pipe or eventfdI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between libraries

Signals

I signalfd for SIGCHLDI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between librariesI Must block SIGCHLD; breaks code expecting SIGCHLD

Alternatives

I Set SIGCHLD handler, write to pipe or eventfdI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between libraries

Signals

I signalfd for SIGCHLDI Still process-global; gets all child exit notificationsI Requires coordinating global signal handling between librariesI Must block SIGCHLD; breaks code expecting SIGCHLD

clonefd

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

clonefd

I New flag for clone

I Return a file descriptor for the child process

I read: block until child exits, return exit information

I poll: becomes readable when child exits

I Maintains a reference to the child’s task_struct

I Relatively simple, except. . .

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architecture

I Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architectures

I Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparenting

I Work in progress

Complications

Need a new clone system call for the fd out parameter

clone syscall parameters vary by architectureI Avoided in the new syscall

clone is out of parameters (6) on some architecturesI Pass parameters via a struct and size

Low-level copy_thread function grabbed tls parameterdirectly from syscall register arguments; couldn’t move it

I Pass parameter normally via C, fix assembly syscall entryI Fixed with copy_thread_tls (merged in 4.2)

ptrace and reparentingI Work in progress

History and status

I Thiago Macieira originally proposed forkfd to simplify Qt

I Josh and Thiago started on clonefd earlier this year

I Some infrastructure merged into 4.2

I Syscall aimed for future kernel after resolving ptrace issues

File descriptor:Userspace reference to

kernel object

What else can we do with areference to task_struct?

Process IDs

Process IDs

I Small integers used to reference processes

I Used pervasively in process syscalls

I Enumerated as directories in /proc

I Unique within root container

I Container PID namespaces map a subset of these

I PIDs do not hold a reference; can be reused

I Race condition if used from non-parent process

Process IDs

I Small integers used to reference processes

I Used pervasively in process syscalls

I Enumerated as directories in /proc

I Unique within root container

I Container PID namespaces map a subset of these

I PIDs do not hold a reference; can be reused

I Race condition if used from non-parent process

Process IDs

I Small integers used to reference processes

I Used pervasively in process syscalls

I Enumerated as directories in /proc

I Unique within root container

I Container PID namespaces map a subset of these

I PIDs do not hold a reference; can be reused

I Race condition if used from non-parent process

clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

clonefd as process identifier

I Unique across the entire system

I Holds a reference to the process

I Race-free

I Can pass via exec, UNIX sockets

I Allows non-parent processes to obtain exit information

Next steps

I Merge clonefd

I For each PID syscall, add an fd variant

I Add ioctls to obtain process information

I Add process enumeration (next, child, root)

Other future file descriptors

Warning: wild speculation andconjecture ahead

Other future file descriptors

Warning: wild speculation andconjecture ahead

User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

User and group IDs

I Suppose users and groups were unique kernel objects?

I Unique across container user namespaces

I “Get unused user/group”

I Set up arbitrary mappings when mounting a filesystem

I Allow a process to hold multiple credentials (like setgroups)

Filesystem mounts

I Suppose mount returned a directory file descriptor

I openat relative to the filesystem

I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts

Filesystem mounts

I Suppose mount returned a directory file descriptor

I openat relative to the filesystem

I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts

Filesystem mounts

I Suppose mount returned a directory file descriptor

I openat relative to the filesystem

I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts

Filesystem mounts

I Suppose mount returned a directory file descriptor

I openat relative to the filesystem

I Separate call to bind into the filesystem namespace

I Bind existing dirfd for bind mounts

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

Summary

I File descriptor: Userspace reference to kernel object

I Reference-counted, race-free, unambiguous ID

I Well-defined semantics

I Extensive operations

I poll and blocking

I Use file descriptors in new APIs

I Don’t invent new identifier namespaces

top related