principles of computer systems lecturer: chris gregg ......stat, and so forth). the system call...

Lecture 04: Filesystem Data Structures and System CallsLecture 04: Filesystem Data Structures and System Calls

Principles of Computer Systems

Spring 2019

Stanford University

Computer Science Department

Lecturer: Chris Gregg

PDF of this presentation1

https://web.stanford.edu/class/cs110/static/lectures/04-Filesystem-DataStructures-and-System-Calls/lecture-04-filesytem-datastructures-and-system-calls.pdf

Linux maintains a data structure for each active process. These data structures are called processcontrol blocks, and they are stored in the process table.Process control blocks store many things (the user who launched it, what time it was launched, CPUstate, etc.). Among the many items it stores is the descriptor table.Each process maintains its own set of descriptors. Descriptors 0, 1, and 2 are understood to be treatedas standard input, standard output, and standard error, but there are no predefined meanings fordescriptors 3 and up. Descriptors 0, 1, and 2 are most often bound to the terminal.The user program sees a descriptor as the identifier needed to interact with a resource (most often afile) via read, write, and close calls. Internally, that descriptor is an index into the descriptortable.The process control block tracks which descriptors are in use and which ones aren't. When allocating a new descriptor, the OS typically chooses the smallest available number.

Lecture 04: Filesystem Data StructuresLecture 04: Filesystem Data Structures

2

Each table entry tracks information specific to the dynamics of that session. mode tracks whether we'rereading, writing, or both. cursor tracks a position within the file payload. refcount tracks thenumber of descriptors across all processes that refer to that session. (We'll discuss the vnode field in amoment.)The illustration here calls out just one file table entry referenced by process 1001, descriptor 3. A call to open(filename, O_RDONLY) from within the second of three processes might result in the above.


If a descriptor table entry is in use,it maintains alink to an open file table entry. An openfile table entry maintains information about anactive session with a file (or something thatbehaves like a file, like terminal, or a networkconnection).

3

Each process maintains its own descriptor table, butthere is only one, system-wide open file table. This allowsfor file resources to be shared between processes, andwe'll soon see just how common shared file resourcesreally are.


At any one time,there are multipleactive processes,each of whichtypically has many—well, at leastthree—opendescriptors.

As drawn above, descriptors 0, 1, and 2 in each of the three PCBs alias the same three sessions. That'swhy each of the referred table entries have refcounts of 3 instead of 1.This shouldn't surprise you. If your bash shell calls make, which itself calls g++, each of them inserts text into the same terminal window.

4


The data structure stores file type (e.g. regular file, directory, symlink, terminal), a refcount, thecollection of function pointers that should be used to read, write, and otherwise interact with theresource, and, if applicable, a copy of the inode that resides on the filesystem on behalf of that file.In this sense, the vnode is an inode cache that brings information about the file (e.g. file size,owner, permissions, etc.) so that it can be accessed more quickly.

Each of the open file entries maintains access to a vnode,which itself is a structure housing static informationabout a file or file-like resource.

5


There is one system-wide vnode table for the same reason there is one system-wide open filetable. Independent file sessions reading from the same file don't need independent copies of thevnode. They can all alias the same one.

6

Lecture 04: Filesystem Data StructuresLecture 04: Filesystem Data StructuresNone of thesekernel-residentdata structuresare visible tousers. Note thefilesystem itself isa completelydifferentcomponent, andthat filesysteminodes of openfiles are loadedinto vnode tableentries. The yellowinode in the vnodeis an in-memoryreplica of theyellow sliver ofmemory in thefilesystem.

7

Lecture 04: System CallsLecture 04: System Calls

System calls are functions that our programs use to interact with the OS and request some core service be executed ontheir behalf.

Examples of system calls we've seen this quarter: open, read, write, close, stat, and lstat. We'll see mayothers in the coming weeks.Functions like printf, malloc, and opendir aren't themselves system calls. They're C library functions thatthemselves rely on system calls to get their jobs done.

Unlike traditional user functions (the ones we write ourselves, libc and libstdc++ library functions), system callsneed to execute in some privileged mode so they can access data structures, system information, and other OSresources intentionally hidden from user code.The implementation of open, for instance, needs to access all of the filesystem data structures for existence andpermissioning. Filesystem implementation details should be be hidden from the user, and permission information shouldbe respected as private.The information loaded and manipulated by open should be inaccessible to the user functions that call open.

Restated, privileged information shouldn't be discoverable.

That means we need a different call and return model for system calls than we have for traditional functions.

8

Lecture 04: System CallsLecture 04: System CallsRecall that each process operates as if it owns all ofmain memory.The diagram on the right presents a 64-bitprocess's general memory playground thatstretches from address 0 up through and including

2 - 1.CS107 and CS107-like intro-to-architecturecourses present the diagram on the right, anddiscuss how various portions of the address spaceare cordoned off to manage traditional function calland return, dynamically allocated memory, accessglobal data, and machine code storage andexecution.

No process actually uses all 2 bytes of its addressspace. In fact, the vast majority of processes use aminiscule fraction of what they otherwise thinkthey own.

64

64

9

Lecture 04: System CallsLecture 04: System CallsThe code segment stores all of the assembly codeinstructions specific to your process. The address ofthe currently executing instruction is stored in the%rip register, and that address is typically drawnfrom the range of addresses managed by the codesegment.The data segment intentionally rests directly on topof the code segment, and it houses all of the explicitlyinitialized global variables that can be modified bythe program.The heap is a software-managed segment used tosupport the implementation of malloc, realloc,free, and their C++ equivalents. It's initially verysmall, but grows as needed for processes requiring agood amount of dynamically allocated memory.The user stack segment provides the memoryneeded to manage user function call and return alongwith the scratch space needed by functionparameters and local variables.

10

Lecture 04: System CallsLecture 04: System CallsThere are other relevant segments that haven't beencalled out in prior classes—at least not in CS107 here(although some are covered in CS107E).The rodata segment also stores global variables,but only those which are immutable—i.e. constants.As the runtime isn't supposed to change anythingread-only, the segment housing constants can beprotected so any attempts to modify it are blocked bythe OS.The bss segment houses the uninitialized globalvariables, which are defaulted to be zero (one of thefew situations where pure C provides default values).The shared library segment links to sharedlibraries like libc and libstdc++ with code forroutines like C's printf, C's malloc, or C++'sgetline. Shared libraries get their own segment soall processes can trampoline through some glue code—that is, the minimum amount of code necessary—tojump into the one copy of the library code that existsfor all processes.

11

Lecture 04: System CallsLecture 04: System CallsThe user stack maintains a collection of stack frames for thetrail of currently executing user functions.64-bit process runtimes rely on %rsp to track the addressboundary between the in-use portion of the user stack andthe portion that's on deck to be used should the currentlyexecuting function invoke a subroutine.The x86 64 runtime relies on callq and retq instructionsfor user function call and return.

The first sixparameters are passedthrough %rdi, %rsi,%rdx, %rcx, %r8, and%r9. The stack frameis used as generalstorage for partialresults that need to bestored somewhereother than a register(e.g. a seventhincoming parameter)

12

Lecture 04: System CallsLecture 04: System CallsThe user function call and return protocol, however, doeslittle to encapsulate and privatize the memory used by afunction.Consider, for instance, the execution of loadFiles as perthe diagram below. Because loadFiles's stack frame isdirectly below that of its caller, it can use pointer arithmeticto advance beyond its frame and examine—or even update—the stack frame above it.After loadFiles returns, main could use pointer arithmeticto descend into the ghost of loadFiles's stack frame andaccess data loadFiles never intended to expose.Functions aresupposed to bemodular, but thefunction call andreturn protocol'ssupport formodularity andprivacy is pretty soft.

13

Lecture 04: System CallsLecture 04: System CallsSystem calls like open and stat need access to OS implementationdetail that should not be exposed or otherwise accessible to the userprogram.That means the activation records for system calls need to be stored ina region of memory that users can't touch, and the system callimplementations need to be executed in a privileged, superuser modeso that it has access to information and resources that traditionalfunctions shouldn't have.The upper half of a process's address space is kernel space, andnone of it is visible to traditional user code.Housed within kernel space is a kernel stack segment, itself used toorganize stack frames for system calls.callq is used for user function call, but callq would dereference afunction pointer we're not permitted to dereference, since it resides inkernel space.We need a different call and return model for system calls—one thatdoesn't rely on callq.

14

Lecture 04: System CallsLecture 04: System CallsThe relevant opcode is placed in %rax. Each system call has its ownopcode (e.g. 0 for read, 1 for write, 2 for open, 3 for close, 4 forstat, and so forth).The system call arguments—there are at most 6—are evaluated andplaced in %rdi, %rsi, %rdx, %r10, %r8, and %r9. Note the fourthparameter is %r10, not %rcx.The system issues a software interrupt (otherwise known as a trap) byexecuting syscall, which prompts an interrupt handler to execute insuperuser mode.The interrupt handler builds a frame in the kernel stack, executes therelevant code, places any return value in %rax, and then executesiretq to return from the interrupt handler, revert from superusermode, and execute the instruction following the syscall.If %rax is negative, errno is set to abs(%rax) and %rax is updated tocontain a 1. If %rax is nonnegative, it's left as is. The value in is %raxthen extracted by the caller as any return value would be.

15

Lecture 04: System CallsLecture 04: System CallsSystem Calls Summary

We use system calls because we don't want user code to haveaccess to sensitive parts of the system, such as raw hardware ornetworking.It isn't possible to restrict access through the normal call/returnstack-framed-based function calling procedure.

A program can modify any data it has access to, and a program'sentire stack is completely accessible to the program.If the OS let a use program muck around with hardware, shareddata structures (e.g., the open file table), or other sensitiveinformation, this would be a security nightmare.

A system call uses an interrupt to transfer control to the OS kernel.The user can only call a set of well-defined system calls, and there islittle room for a security breach.Once the kernel is running a system call, it is in complete control ofthe system, and can access the necessary resources to fulfill thesystem call's needs.After a system call, the kernel returns control to the user program.

16

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingUntil now, we have been studying how programsinteract with hardware, and now we are going tostart investigating how programs interact with theoperating system.

In the CS curriculum so far, your programs haveoperated in a single process, meaning, basically,that one program was running your code, line-for-line. The operating system made it look like yourprogram was the only thing running, and that wasthat.

Now, we are going to move into the realm ofmultiprocessing, where you control more thanone process at a time with your programs. You willtell the OS, “do these things concurrently”, and itwill.

17

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to Multiprocessing

What is a process?

When you start a program, it runs in a single process. It has a process id (an integer) that the OSassigns. A program can get its own process id with the getpid system call:

All programs are associated with one or more process, and a process is scheduled to run its codeby the OS. The OS has to schedule all currently-running processes (i.e., processes that arewaiting to run their next line of code), and it does so on a time-shared basis.In other words: there may be tens, hundreds, or thousands of processes "running" at once, buton a single-processor system, only one can run at a time, and this is coordinated by the OS.On multi-processor machines (like most modern computers), multiple programs can literally runat the same time, one-per processor.

// file: getpidEx.c#include#include#include // getpid int main(int argc, char **argv){ pid_t pid = getpid(); printf("My process id: %d\n",pid); return 0;}

123456789

1011

cgregg@myth57$ ./getpidEx My process id: 7526

18

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingNew system call: fork

A program may decide that it wants to run multipleprocesses itself. We will see many examples of whya program may want to do this as the courseprogresses.

If a program wants to launch a second process, it uses the fork system call.fork() does exactly this:

It creates a new process that starts on the following instruction after the original, parent,process. The parent process also continues on the following instruction, as well.The fork call returns a pid_t (an integer) to both processes. Neither is the actual pid of theprocess that receives it:

The parent process gets a return value that is the pid of the child process.The child process gets a return value of 0, indicating that it is the child.

The child process does, indeed, have its own pid, but it would need to callgetpid itself to retrieve it.

All memory is identical between the parent and child, though it is not shared (it is copied).19

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingNew system call: fork

The reason that the parent and its child get different return values from fork is twofold:

It differentiates them. It is almost a certainty that the parent and child will have differentobjectives after the fork, and it is useful for a process to know whether it is the parent or thechild.It is useful for the parent to know its child's pid. There is no other way for a parent to easilyget its children's process ids (if a child wants to get its parent's pid, it can call getppid)

Here's a simple program that knows how to spawn new processes. It uses the system calls fork,getpid, and getppid. The full program can be viewed .right here

int main(int argc, char *argv[]) { printf("Greetings from process %d! (parent %d)\n", getpid(), getppid()); pid_t pid = fork(); assert(pid >= 0); printf("Bye-bye from process %d! (parent %d)\n", getpid(), getppid()); return 0;}

1234567

20

http://www.stanford.edu/class/cs110/examples/processes/basic-fork.c


Here is the output of two consecutive runs of the program:

myth60$ ./basic-fork Greetings from process 29686! (parent 29351)Bye-bye from process 29686! (parent 29351)Bye-bye from process 29687! (parent 29686) myth60$ ./basic-fork Greetings from process 29688! (parent 29351)Bye-bye from process 29688! (parent 29351)Bye-bye from process 29689! (parent 29688

123456789

There are a couple of things to note about this program:

The original process has a parent, which is the shell -- that is the program that you run inthe terminal.The ordering of the parent and child output is nondeterministic. Sometimes the parentprints first, and sometimes the child prints first.

int main(int argc, char *argv[]) { printf("Greetings from process %d! (parent %d)\n", getpid(), getppid()); pid_t pid = fork(); assert(pid >= 0); printf("Bye-bye from process %d! (parent %d)\n", getpid(), getppid()); return 0;}

1234567

21

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to Multiprocessingfork is called once, but it returns twice.

fork knows how to clone the calling process, synthesize a virtually identical copy of it, andschedule the copy as if it were running all along.

All segments (data, bss, init, stack, heap, text) are faithfully replicated.All open file descriptors are replicated, and these copies are donated to the clone.

As a result, the output of our program is the output of two processes.

We should expect to see a single greeting but two separate bye-byes.Each bye-bye is inserted into the console by two different processes. The OS's processscheduler dictates whether the child or the parent gets to print its bye-bye first.

getpid and getppid return the process id of the caller and the process id of the caller's parent,respectively.

22

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingYou might be asking yourself, How do I debug two processes at once? This is a very good question! gdbhas built-in support for debugging multiple processes, as follows:

set detachonfork off

This tells gdb to capture any fork'd processes, though it pauses them upon the fork.

info inferiors

This lists the processes that gdb has captured.

inferior X

Switch to a different process to debug it.

detach inferior X

Tell gdb to stop watching the process, and continue it

You can see an entire debugging session on the basicfork program .right here

23

https://web.stanford.edu/class/cs110/examples/processes/basic-fork_gdb.txt

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingDifferences between parent calling fork and child generated by it:

The most obvious difference is that each gets a unique process id. That's important. Otherwise,the OS can't tell them apart.Another key difference: fork's return value in the two processes

When fork returns in the parent process, it returns the pid of the new childWhen fork returns in the child process, it returns 0. Again, that isn't to say the child's pid is 0,but rather that fork elects to return a 0 as a way of allowing the child process to easily self-identify as the child process.The return value can be used to dispatch each of the two processes in a different direction(although in this introductory example, we don't do that).

24

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingDifferences between parent calling fork and child generated by it:

The most obvious difference is that each gets a unique process id. That's important. Otherwise, theOS can't tell them apart.Another key difference: fork's return value in the two processes

When fork returns in the parent process, it returns the pid of the new childWhen fork returns in the child process, it returns 0. Again, that isn't to say the child's pid is 0,but rather that fork elects to return a 0 as a way of allowing the child process to easily self-identify as the child process.The return value can be used to dispatch each of the two processes in a different direction(although in this introductory example, we don't do that).

All data segments are replicated. Aside from checking the return value of fork, there is virtuallyno difference in the two processes, and they both continue after fork as if they were the originalprocess.There is no default sharing of data between the two processes, though the parent process canwait (more below) for child processes to complete.You can use shared memory to communicate between processes, but this must be explicitly set upbefore making fork calls (you will not be responsible for shared memory in this course)

25

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingSecond example: A tree of fork calls

While you rarely have reason to use fork this way, it's instructive to trace through a short programwhere spawned processes themselves call fork. The full program can be viewed .right here

static const char const *kTrail = "abcd";int main(int argc, char *argv[]) { size_t trailLength = strlen(kTrail); for (size_t i = 0; i < trailLength; i++) { printf("%c\n", kTrail[i]); pid_t pid = fork(); assert(pid >= 0); } return 0;}

123456789

10

26

http://www.stanford.edu/class/cs110/examples/processes/fork-puzzle.c

Lecture 04: Introduction to MultiprocessingLecture 04: Introduction to MultiprocessingSecond example: A tree of fork calls

Two samples runs on the rightReasonably obvious: A single a is printed by thesoon-to-be-great-grandaddy process.Less obvious: The first child and the parent eachreturn from fork and continue running in mirrorprocesses, each with their own copy of the global"abcd" string, and each advancing to the i++ linewithin a loop that promotes a 0 to 1. It's hopefullyclear now that two b's will be printed.Key questions to answer:

Why aren't the two b's always consecutive?How many c's get printed?How many d's get printed?Why is there a shell prompt in the middle of theoutput of the second run on the right?

myth60$ ./fork-puzzle a b c b d c d c c d d d d d d myth60$

myth60$ ./fork-puzzle a b b c d c d c d d c d myth60$ d d d

27


Third example: Synchronizing between parent and child using waitpid

waitpid can be used to temporarily block one process until a child process exits.

The first argument specifies the wait set, which for the moment is just the id of the childprocess that needs to complete before waitpid can return.The second argument supplies the address of an integer where termination informationcan be placed (or we can pass in NULL if we don't care for the information).The third argument is a collection of bitwise-or'ed flags we'll study later. For the timebeing, we'll just go with 0 as the required parameter value, which means that waitpidshould only return when a process in the supplied wait set exits.The return value is the pid of the child that exited, or -1 if waitpid was called and therewere no child processes in the supplied wait set.

pid_t waitpid(pid_t pid, int *status, int options);

28



Consider the following program, which is more representative of how fork really gets usedin practice (full program, with error checking, is ):right here

int main(int argc, char *argv[]) { printf("Before.\n"); pid_t pid = fork(); printf("After.\n"); if (pid == 0) { printf("I am the child, and the parent will wait up for me.\n"); return 110; // contrived exit status } else { int status; waitpid(pid, &status, 0) if (WIFEXITED(status)) { printf("Child exited with status %d.\n", WEXITSTATUS(status)); } else { printf("Child terminated abnormally.\n"); } return 0; } }

29

http://cs110.stanford.edu/examples/processes/separate.c



The output is the same every single time the above program is executed. The implementation directs the child process one way, the parent another.The parent process correctly waits for the child to complete using waitpid.The parent lifts child exit information out of the waitpid call, and uses the WIFEXITED macro to examine some high-order bits of its argument to confirm the process exitednormally, and it uses the WEXITSTATUS macro to extract the lower eight bits of itsargument to produce the child return value (which we can see is, and should be, 110).The waitpid call also donates child process-oriented resources back to the system.

myth60$ ./separate Before. After. After. I am the child, and the parent will wait up for me. Child exited with status 110. myth60$

30


Synchronizing between parent and child using waitpid

This next example is more of a brain teaser, but it illustrates just how deep a clone theprocess created by fork really is (full program, with more error checking, is ). The code emulates a coin flip to seduce exactly one of the two processes to sleep for asecond, which is more than enough time for the child process to finish.The parent waits for the child to exit before it allows itself to exit. It's akin to the parent notbeing able to fall asleep until he/she knows the child has, and it's emblematic of the types ofsynchronization directives we'll be seeing a lot of this quarter.The final printf gets executed twice. The child is always the first to execute it, because theparent is blocked in its waitpid call until the child executes everything.

right hereint main(int argc, char *argv[]) { printf("I'm unique and just get printed once.\n"); bool parent = fork() != 0; if ((random() % 2 == 0) == parent) sleep(1); // force exactly one of the two to sleep if (parent) waitpid(pid, NULL, 0); // parent shouldn't exit until child has finished printf("I get printed twice (this one is being printed from the %s).\n", parent ? "parent" : "child"); return 0; }

31

http://cs110.stanford.edu/examples/processes/parent-child.c


Spawning and synchronizing with multiple child processes

A parent can call fork multiple times, provided it reaps the child processes (via waitpid)once they exit. If we want to reap processes as they exit without concern for the order theywere spawned, then this does the trick (full program checking ):right here

int main(int argc, char *argv[]) { for (size_t i = 0; i < 8; i++) { if (fork() == 0) exit(110 + i); } while (true) { int status; pid_t pid = waitpid(-1, &status, 0); if (pid == -1) { assert(errno == ECHILD); break; } if (WIFEXITED(status)) { printf("Child %d exited: status %d\n", pid, WEXITSTATUS(status)); } else { printf("Child %d exited abnormally.\n", pid); } } return 0; }

32

http://web.stanford.edu/class/cs110/examples/processes/reap-as-they-exit.c



Note we feed a -1 as the first argument to waitpid. That -1 states we want to hear aboutany child as it exits, and pids are returned in the order their processes finish.Eventually, all children exit and waitpid correctly returns -1 to signal there are no moreprocesses under the parent's jurisdiction.When waitpid returns -1, it sets a global variable called errno to the constant ECHILD tosignal waitpid returned -1 because all child processes have terminated. That's the "error"we want.

myth60$ ./reap-as-they-exit Child 1209 exited: status 110 Child 1210 exited: status 111 Child 1211 exited: status 112 Child 1216 exited: status 117 Child 1212 exited: status 113 Child 1213 exited: status 114 Child 1214 exited: status 115 Child 1215 exited: status 116 myth60$

myth60$ ./reap-as-they-exit Child 1453 exited: status 115 Child 1449 exited: status 111 Child 1448 exited: status 110 Child 1450 exited: status 112 Child 1451 exited: status 113 Child 1452 exited: status 114 Child 1455 exited: status 117 Child 1454 exited: status 116 myth60$

33



We can do the same thing we did in the first program, but monitor and reap the childprocesses in the order they are forked.Check out the abbreviated program below (full program with error checking ):right here

int main(int argc, char *argv[]) { pid_t children[8]; for (size_t i = 0; i < 8; i++) { if ((children[i] = fork()) == 0) exit(110 + i); } for (size_t i = 0; i < 8; i++) { int status; pid_t pid = waitpid(children[i], &status, 0); assert(pid == children[i]); assert(WIFEXITED(status) && (WEXITSTATUS(status) == (110 + i))); printf("Child with pid %d accounted for (return status of %d).\n", children[i], WEXITSTATUS(status)); } return 0; }

34

http://web.stanford.edu/class/cs110/examples/processes/reap-in-fork-order.c



This version spawns and reaps processes in some first-spawned-first-reaped manner.The child processes aren't required to exit in FSFR order.In theory, the first child thread could finish last, and the reap loop could be held up on itsvery first iteration until the first child really is done. But the process zombies—yes, that'swhat they're called—are reaped in the order they were forked.Below is a sample run of the reapinforkorder executable. The pids change betweenruns, but even those are guaranteed to be published in increasing order.

myth60$ ./reap-as-they-exit Child with pid 4689 accounted for (return status of 110). Child with pid 4690 accounted for (return status of 111). Child with pid 4691 accounted for (return status of 112). Child with pid 4692 accounted for (return status of 113). Child with pid 4693 accounted for (return status of 114). Child with pid 4694 accounted for (return status of 115). Child with pid 4695 accounted for (return status of 116). Child with pid 4696 accounted for (return status of 117). myth60$

35

principles of computer systems lecturer: chris gregg ......stat, and so forth). the system call...

Documents