i/o in freebsd - syracuse universityhpdc.syr.edu/~chapin/cis657/io.pdf1 i/o in freebsd cis657...

34
1 I/O in FreeBSD I/O in FreeBSD CIS657 CIS657 Properties of the I/O Subsystem Properties of the I/O Subsystem Anonymizes Anonymizes (unifies the interface to) (unifies the interface to) hardware devices, both to users and to hardware devices, both to users and to other parts of the kernel other parts of the kernel General device-driver code General device-driver code Hardware-specific device drivers Hardware-specific device drivers Three Main Kinds of I/O Three Main Kinds of I/O Filesystem Filesystem Actual files (although this is too restrictive) Actual files (although this is too restrictive) Hierarchical name space, locking, quotas, Hierarchical name space, locking, quotas, attribute management, protection attribute management, protection More on this later (Ch. 8) More on this later (Ch. 8) Socket interface Socket interface Networking requests Networking requests

Upload: phamnhu

Post on 29-Apr-2018

229 views

Category:

Documents


3 download

TRANSCRIPT

1

I/O in FreeBSDI/O in FreeBSD

CIS657CIS657

Properties of the I/O SubsystemProperties of the I/O Subsystem

Anonymizes Anonymizes (unifies the interface to)(unifies the interface to)hardware devices, both to users and tohardware devices, both to users and toother parts of the kernelother parts of the kernel

General device-driver codeGeneral device-driver code Hardware-specific device driversHardware-specific device drivers

Three Main Kinds of I/OThree Main Kinds of I/O

FilesystemFilesystem Actual files (although this is too restrictive)Actual files (although this is too restrictive) Hierarchical name space, locking, quotas,Hierarchical name space, locking, quotas,

attribute management, protectionattribute management, protection More on this later (Ch. 8)More on this later (Ch. 8)

Socket interfaceSocket interface Networking requestsNetworking requests

2

Three Types of I/O (II)Three Types of I/O (II)

Character-device interfaceCharacter-device interface Unstructured access to devicesUnstructured access to devices Two typesTwo types

–– Actual character devices (e.g. terminals)Actual character devices (e.g. terminals)–– ““rawraw”” access to block devices (e.g. disks, tapes) access to block devices (e.g. disks, tapes)

I/O doesnI/O doesn’’t use the t use the filesystem filesystem or page cacheor page cache Requests are still in multiples of block sizeRequests are still in multiples of block size Requests might need to be alignedRequests might need to be aligned

Kernel I/O Structure (pg. 216)Kernel I/O Structure (pg. 216)

Internal I/O StructureInternal I/O Structure

Device drivers are modules of codeDevice drivers are modules of codededicated to handling I/O for adedicated to handling I/O for aparticular piece or kind of hardwareparticular piece or kind of hardware

Drivers define a set of operations viaDrivers define a set of operations viawell-known entry points (e.g., read,well-known entry points (e.g., read,write, etc.)write, etc.)

Character device interface is kept inCharacter device interface is kept incdevswcdevsw structure.structure.

3

Character Device Switch TableCharacter Device Switch Tablestruct cdevsw { int d_maj; u_int d_flags; const char *d_name; d_open_t *d_open; d_fdopen_t *d_fdopen; d_close_t *d_close; d_read_t *d_read; d_write_t *d_write; d_ioctl_t *d_ioctl; d_poll_t *d_poll; d_mmap_t *d_mmap; d_strategy_t *d_strategy; dumper_t *d_dump; d_kqfilter_t *d_kqfilter;};

Device NumbersDevice Numbers in 4.4BSDin 4.4BSD

Historically usedHistorically used to select the deviceto select the devicedriver and actual devicedriver and actual device

Device number was a (major, minor)Device number was a (major, minor)pairpair

Major number selected the device driverMajor number selected the device driver Minor number selects which (of possiblyMinor number selects which (of possibly

many) devicesmany) devices——interpreted by the driverinterpreted by the driver

Device Numbers in FreeBSD 5.2Device Numbers in FreeBSD 5.2

Not used to select device driverNot used to select device driver Present only forPresent only for applications to examineapplications to examine

Direct link in /dev file system to theDirect link in /dev file system to thedevicedevice’’s s cdevswcdevsw entryentry

On configuration:On configuration: New major device number assignedNew major device number assigned Entry made in /devEntry made in /dev

Minor device # still has meaningMinor device # still has meaning

4

Device DriversDevice Drivers

Autoconfiguration Autoconfiguration & Initialization& Initialization Probes device, sets initial software stateProbes device, sets initial software state Called once (system startup or deviceCalled once (system startup or device

connection)connection) Top HalfTop Half

System calls or VM systemSystem calls or VM system may call sleep()may call sleep()

Device Drivers (II)Device Drivers (II)

Bottom Half (Interrupt handlers)Bottom Half (Interrupt handlers) no per-process stateno per-process state does have kernel thread context, so candoes have kernel thread context, so can

block (subject to locking restrictions)block (subject to locking restrictions) Blocking in interrupt handlersBlocking in interrupt handlers probablyprobably

undesirable (performance reasons)undesirable (performance reasons) Error handling (optional)Error handling (optional)

Crash-dump routine, called by systemCrash-dump routine, called by systemwhen an unrecoverable error occurs.when an unrecoverable error occurs.

I/O QueuingI/O Queuing

How data moves between the top and bottomHow data moves between the top and bottomhalves of the kernelhalves of the kernel

Typical top half sequenceTypical top half sequence receives request from user through system callreceives request from user through system call records request in data structurerecords request in data structure places data structure in queueplaces data structure in queue starts device if necessarystarts device if necessary

5

Interrupt HandlersInterrupt Handlers

Device interrupts after completing theDevice interrupts after completing theI/O operationI/O operation

Removes request from device queueRemoves request from device queue Notifies requester that command hasNotifies requester that command has

completed (how?)completed (how?) Starts next request from the queueStarts next request from the queue

Top/Bottom CoordinationTop/Bottom Coordination

What might happen if an interruptWhat might happen if an interruptoccurred while a request was beingoccurred while a request was beingenqueued enqueued in the top half?in the top half?

The queue is just like any other sharedThe queue is just like any other shareddata structuredata structure

To enforce synchronization, useTo enforce synchronization, usemutexesmutexes..

Interrupt MechanismInterrupt Mechanism Device interruptsDevice interrupts Call glue routine throughCall glue routine through

interrupt table (glueinterrupt table (glueroutine allows ISR to beroutine allows ISR to bemachine independent)machine independent)

Glue routine savesGlue routine savesvolatile registersvolatile registers

…… updates statistics updates statistics …… calls ISR passing in calls ISR passing in

device unit numberdevice unit number …… restores volatile restores volatile

registersregisters …… returns from interrupt returns from interrupt

glueroutine

interruptservice routine

interrupt

interruptdispatch

table

6

Character DevicesCharacter Devices

Most peripherals (except networks)Most peripherals (except networks)have character interfaceshave character interfaces

For many devices, this is just a byteFor many devices, this is just a bytestreamstream /dev/lp0 (printer)/dev/lp0 (printer) /dev/tty00 (terminal)/dev/tty00 (terminal) /dev/mem /dev/mem (memory)(memory) /dev/null (sink/source of EOF)/dev/null (sink/source of EOF)

Non-Byte-Stream DevicesNon-Byte-Stream Devices

E.g., high-speed graphic interfacesE.g., high-speed graphic interfaces Maintain own buffers -or-Maintain own buffers -or- Do I/O directly to address space of userDo I/O directly to address space of user

User-level buffer must be wired during I/OUser-level buffer must be wired during I/O

Disk-like Character DevicesDisk-like Character Devices For For ““blockblock”” devices, the character interface is devices, the character interface is

called the called the ““rawraw”” interface interface Direct access to the deviceDirect access to the device I/O directly to user spaceI/O directly to user space User-level buffer must be wired during I/OUser-level buffer must be wired during I/O Not a byte-stream model; still have to talk in termsNot a byte-stream model; still have to talk in terms

understood by underlying deviceunderstood by underlying device–– Size, alignmentSize, alignment

Should only be used by programs that have anShould only be used by programs that have anintimate knowledge of the underlying device andintimate knowledge of the underlying device andformat; otherwise, chaos may ensueformat; otherwise, chaos may ensue

7

Character Device InterfaceCharacter Device Interface

Open, close: justOpen, close: justwhat you thinkwhat you think

Read, write: callRead, write: callphysiophysio() to do the() to do thetransfertransfer

IoctlIoctl: the : the ““swissswissarmy knifearmy knife”” call; call;highly-dependenthighly-dependenton deviceon device

Poll: see whether thePoll: see whether thedevice is ready fordevice is ready forreading/writing (cf.reading/writing (cf.select() select() syscallsyscall))

Stop: flow control onStop: flow control onterminal devicesterminal devices

MmapMmap: map device: map devicememory into addressmemory into addressspacespace

Reset: re-initialize theReset: re-initialize thedevice statedevice state

DiskDisk Driver Entry PointsDriver Entry Points

Strategy: start read/write op and returnStrategy: start read/write op and returnimmediately.immediately. A A bufbuf structure is passed, containingstructure is passed, containing

parametersparameters for the I/O requestfor the I/O request If synchronous, caller sleeps on address ofIf synchronous, caller sleeps on address ofbufbuf structurestructure

Dump: write physical memory to theDump: write physical memory to thedevice (used for controlled systemdevice (used for controlled systemcrashes)crashes)

Disk Scheduling: Disk Scheduling: disksortdisksort()()

Implements a variant of the elevatorImplements a variant of the elevatoralgorithm: which one is it?algorithm: which one is it?

Keeps two sorted lists of requests:Keeps two sorted lists of requests: The current one (the one weThe current one (the one we’’re servicing)re servicing) The other one (for the opposite direction)The other one (for the opposite direction) Switch whenever we hit the endSwitch whenever we hit the end

Modern disk controllers will allow multipleModern disk controllers will allow multipleoutstanding requests and will sort theoutstanding requests and will sort therequests themselvesrequests themselves

8

Pseudocode Pseudocode for for disksortdisksort()()DisksortDisksort(drive queue *(drive queue *dqdq, buffer *, buffer *bpbp))

{{

if (active list is empty) { if (active list is empty) {

place the buffer at the front of the active list; place the buffer at the front of the active list;

return; return;

} }

if (request lies before the first active request) { if (request lies before the first active request) {

locate the beginning of the next-pass list; locate the beginning of the next-pass list;

sort sort bp bp into the next-pass list;into the next-pass list;

} else { } else {

sort sort bp bp into the active list;into the active list;

} }

}}

DisksortDisksort() and() and Modern DisksModern Disks

Most modern disksMost modern disks can handle severalcan handle severalsimultaneous requestssimultaneous requests Sort them internallySort them internally

However, disksHowever, disks typically only buffertypically only bufferabout 15 requestsabout 15 requests During busy times, > 15 requestsDuring busy times, > 15 requests

simultaneouslysimultaneously Need to maintain longer list in kernelNeed to maintain longer list in kernel

Disk LabelsDisk Labels

The geometry of a disk is the number ofThe geometry of a disk is the number ofcylinders/tracks/sectors on a disk, and thecylinders/tracks/sectors on a disk, and thesize of the sectorssize of the sectors

Disks can have different geometries imposedDisks can have different geometries imposedby a low-level formatby a low-level format

Disk labels are written in a well-knownDisk labels are written in a well-knownlocation on the disk, and contain thelocation on the disk, and contain thegeometry, as well as information aboutgeometry, as well as information aboutpartition usage.partition usage.

9

BootstrappingBootstrapping

The first level bootstrap loader reads theThe first level bootstrap loader reads thefirst few disk sectors to findfirst few disk sectors to find The geometry of the diskThe geometry of the disk A second-level bootstrap routineA second-level bootstrap routine

–– Think of Think of lilolilo, , multibootmultiboot, et al., et al.

See, for example,See, for example,http://www.http://www.nomoanomoa..com/bsd/dualbootcom/bsd/dualboot..htmhtm

File DescriptorsFile Descriptors

System calls use file descriptors to referSystem calls use file descriptors to referto open files and devicesto open files and devices File entry contains file type and pointer toFile entry contains file type and pointer to

underlying objectunderlying object–– Socket (IPC)Socket (IPC)–– Vnode Vnode (device, file)(device, file)–– Virtual memoryVirtual memory

filedescriptor

descriptortable

file entry

interprocesscommunication

vnode

virtualmemoryuser process filedesc process

substructure

kernel list

Open File EntriesOpen File Entries

““object-orientedobject-oriented”” data structure data structure Methods:Methods:

ReadRead WriteWrite PollPoll IoctlIoctl StatStat Check for Check for kqueue kqueue eventsevents Close/deallocateClose/deallocate

Really just a type and a table of functionReally just a type and a table of functionpointerspointers

10

Open File Entries IIOpen File Entries II

Each file entry has a pointer to a data structureEach file entry has a pointer to a data structuredescribing the file objectdescribing the file object Reference passed to each function call operatingReference passed to each function call operating

on the fileon the file All object state is stored in the data structureAll object state is stored in the data structure Opaque to the routines that manipulate the fileOpaque to the routines that manipulate the file

entryentry——only interpreted by code for underlyingonly interpreted by code for underlyingoperationsoperations

File offset determines position in file for nextFile offset determines position in file for nextread or writeread or write Not stored in per-object data structure. Why not?Not stored in per-object data structure. Why not?

File Descriptor FlagsFile Descriptor Flags

Protection flags (R, W,Protection flags (R, W,RW)RW) Checked before systemChecked before system

call executedcall executed System calls themselvesSystem calls themselves

dondon’’t have to checkt have to check O_NONBLOCK: returnO_NONBLOCK: return

EAGAIN if the callEAGAIN if the callwould block thewould block the threadthread

O_ASYNC: send SIGIOO_ASYNC: send SIGIOto the process when I/Oto the process when I/Obecomes possiblebecomes possible

O_APPEND: set offsetO_APPEND: set offsetto EOF on each writeto EOF on each write

O_FSYNC: Writes toO_FSYNC: Writes tofile file syncsync’’ed ed with diskwith disk

O_DIRECT: Ask kernelO_DIRECT: Ask kernelto skip copyto skip copy

Close on exec:Close on exec: stopstopsharing with children onsharing with children onexec (saved with exec (saved with descdesc.).)

Locking informationLocking information(shared/exclusive)(shared/exclusive)

File Entry Structure CountsFile Entry Structure Counts

Reference CountsReference Counts Count the number of times processes point to theCount the number of times processes point to the

file structurefile structure–– A single process can have multiple references to (i.e.,A single process can have multiple references to (i.e.,

descriptors for) the file (dup, descriptors for) the file (dup, fcntlfcntl))–– Processes inherit file entries on fork (shared offset)Processes inherit file entries on fork (shared offset)

Message CountsMessage Counts Descriptors can be sent in messages to local Descriptors can be sent in messages to local procsprocs

–– Increment message count when descriptor sent, andIncrement message count when descriptor sent, anddecrement when receiveddecrement when received

11

File Structure Manipulation:File Structure Manipulation:fcntlfcntl()()

Duplicate a descriptor (same as dup())Duplicate a descriptor (same as dup()) Get/set flagsGet/set flags enumerated on earlier slideenumerated on earlier slide Send signal to process when urgent dataSend signal to process when urgent data

arisesarises Set or get the PID or PGID to which to sendSet or get the PID or PGID to which to send

the two signalsthe two signals Test or set status of a lock on a range ofTest or set status of a lock on a range of

bytes in the underlying filebytes in the underlying file

File Descriptor LockingFile Descriptor Locking Early Unix didnEarly Unix didn’’t have file lockingt have file locking

So, it used an atomic operation on the file systemSo, it used an atomic operation on the file systemto simulate it. How?to simulate it. How?

Downsides of this approach:Downsides of this approach: Processes consumed CPU time loopingProcesses consumed CPU time looping Locks left lying around have to be cleaned up byLocks left lying around have to be cleaned up by

handhand Root processes would override the lockRoot processes would override the lock

mechanismmechanism

Real locks were added to 4.2 BSDReal locks were added to 4.2 BSD

File Locking Options:File Locking Options:Lock the Whole FileLock the Whole File

Serializes access to the fileSerializes access to the file Fast (can just record the PID/PGRP ofFast (can just record the PID/PGRP of

the locking process and proceed)the locking process and proceed) Inherited by childrenInherited by children Released when last descriptor for a fileReleased when last descriptor for a file

is closedis closed When does it work well? When poorly?When does it work well? When poorly? 4.2 BSD and 4.3 BSD had this4.2 BSD and 4.3 BSD had this

12

File Locking Options:File Locking Options:Byte-Range LocksByte-Range Locks

Can lock a piece of a file (down to a byte)Can lock a piece of a file (down to a byte) Not powerful enough to do nestedNot powerful enough to do nested

hierarchical locks (databases)hierarchical locks (databases) Complex implementationComplex implementation Mandated by POSIX standard, so putMandated by POSIX standard, so put

into 4.4 BSDinto 4.4 BSD Released Released eacheach time a close is done on a time a close is done on a

descriptor referencing the filedescriptor referencing the file

Problem with Byte-Range LocksProblem with Byte-Range Locks(mandated by POSIX)(mandated by POSIX)

Inadvertant Inadvertant releaserelease Process locks fileProcess locks file Calls library routineCalls library routine

–– OpenOpen–– ReadRead–– CloseClose

Lock was just released!Lock was just released!

Must have file open for reading to exclusivelyMust have file open for reading to exclusivelylocklock

File Locking Options:File Locking Options:Advisory vs. MandatoryAdvisory vs. Mandatory

Mandatory:Mandatory: Locks enforced for all processesLocks enforced for all processes Cannot be bypassed (even by root)Cannot be bypassed (even by root)

Advisory:Advisory: Locks enforced only for processes thatLocks enforced only for processes that

request themrequest them

13

File Locking Options:File Locking Options:Shared vs. ExclusiveShared vs. Exclusive

Shared locks: multiple processes may accessShared locks: multiple processes may accessthe filethe file

Exclusive locks: only one processExclusive locks: only one process Request a lock when another process has anRequest a lock when another process has an

exclusive lock => you blockexclusive lock => you block Request an exclusive lock when anotherRequest an exclusive lock when another

process holds a lock => you blockprocess holds a lock => you block Think of the classic readers-writers problemThink of the classic readers-writers problem

FreeBSD LocksFreeBSD Locks

Both whole-file and byte-rangeBoth whole-file and byte-range Whole-file implemented as byte-range over the entire file,Whole-file implemented as byte-range over the entire file,

applied to descriptorapplied to descriptor Byte-range lock applied to fileByte-range lock applied to file Closing behavior handled as special case checkClosing behavior handled as special case check

Advisory locksAdvisory locks Both shared and exclusive may be requestedBoth shared and exclusive may be requested May request lock when opening a file (avoid raceMay request lock when opening a file (avoid race

condition)condition) Done on Done on per-filesystem per-filesystem basisbasis Process may specify to return with error instead ofProcess may specify to return with error instead of

blocking if lock is busyblocking if lock is busy

Doing I/O forDoing I/O forMultiple DescriptorsMultiple Descriptors

Consider the X ServerConsider the X Server Read from keyboardRead from keyboard Read from mouseRead from mouse Write to frame buffer (video)Write to frame buffer (video) Read from multiple network connectionsRead from multiple network connections Write to multiple network connectionsWrite to multiple network connections

If we try to read from a descriptor thatIf we try to read from a descriptor thatisnisn’’t ready, wet ready, we’’ll blockll block

14

Historical SolutionHistorical Solution

Use multiple processes communicatingUse multiple processes communicatingthrough shared facilitiesthrough shared facilities

Example:Example: Have a Have a ““mastermaster”” process fork off a separate process fork off a separate

process to handle each I/O device or client for theprocess to handle each I/O device or client for theX ServerX Server

For N children, N+1 pipes:For N children, N+1 pipes:–– One for all children to talk to X ServerOne for all children to talk to X Server–– One per child for the X Server to talk backOne per child for the X Server to talk back

High overhead in context switchesHigh overhead in context switches

FreeBSDFreeBSD’’s Solution:s Solution:Three I/O OptionsThree I/O Options

Polling I/O:Polling I/O: Repeatedly check a set of descriptors forRepeatedly check a set of descriptors for

available I/O; see discussion of Selectavailable I/O; see discussion of Select Nonblocking Nonblocking I/O:I/O:

Complete immediately w/partial resultsComplete immediately w/partial results Signal-driven I/O:Signal-driven I/O:

Notify process when I/O becomes possibleNotify process when I/O becomes possible

Possibilities to Avoid thePossibilities to Avoid theBlocking Problem (1-2)Blocking Problem (1-2)

Set all descriptors to Set all descriptors to nonblocking nonblocking modemode This ends up being a different form ofThis ends up being a different form of

polling (busy waiting)polling (busy waiting) Set all descriptors to send signalsSet all descriptors to send signals

Signals are expensive to catch; not goodSignals are expensive to catch; not goodfor a high I/O process like the X Serverfor a high I/O process like the X Server

15

Possibilities to Avoid thePossibilities to Avoid theBlocking Problem (3-4)Blocking Problem (3-4)

Have a query mechanism to find outHave a query mechanism to find outwhich descriptors are readywhich descriptors are ready Block process if none are readyBlock process if none are ready Two system calls: one for check, one for I/OTwo system calls: one for check, one for I/O

Have the process give the kernel a set ofHave the process give the kernel a set ofdescriptors to read from, and be told ondescriptors to read from, and be told onreturn which one provided the datareturn which one provided the data CanCan’’t mix reads and writest mix reads and writes

Which Possibilities are inWhich Possibilities are inFreeBSD?FreeBSD?

Possibility #1: Possibility #1: nonblocking nonblocking I/OI/O Possibility #2: signal-driven I/OPossibility #2: signal-driven I/O Possibility #3: select() or poll()Possibility #3: select() or poll()

Two different interfaces to the sameTwo different interfaces to the sameunderlying function.underlying function.

WeWe’’ll focus on select()ll focus on select() Possibility #4: not in FreeBSDPossibility #4: not in FreeBSD

Select()Select() Process passes in three bitmaps ofProcess passes in three bitmaps of

descriptors itdescriptors it’’s interested in: read, write,s interested in: read, write,and exceptionsand exceptions

Kernel returns as soon as any of thoseKernel returns as soon as any of thoseare ready for I/Oare ready for I/O

Process decides which descriptor to doProcess decides which descriptor to doI/O on (bitmap operations)I/O on (bitmap operations)

Process can set timeout, and kernel willProcess can set timeout, and kernel willreturn with no bits set at that time (if itreturn with no bits set at that time (if ithasnhasn’’t already)t already)

16

Device Polling: Low-levelDevice Polling: Low-levelselect() Code for FreeBSDselect() Code for FreeBSD

struct selinfo struct selinfo {{ struct struct thread *thread *si_threadsi_thread; /*; /* thread to be notified */thread to be notified */ short short si_flagssi_flags; /* SI_COLL ; /* SI_COLL –– collision occurred */ collision occurred */}}

struct tty struct tty **tptp;;

. . .. . .if (events & POLLIN) {if (events & POLLIN) { if ( if (nread nread > 0 || (> 0 || (tp-tp->t_state & TS_CARR_ON) == 0)>t_state & TS_CARR_ON) == 0) revents revents |= POLLIN;|= POLLIN; elseelse selrecordselrecord((curthreadcurthread, &, &tp-tp->>t_rselt_rsel););}}. . .. . .

Routine called with POLLIN, POLLOUT, or POLLRDBAND

Return success if operation is possible(success only means

won’t block)

Record tid with the device

FreeBSDFreeBSD 5.2 5.2 select()select()

Look in sys_generic.c at select() to seeLook in sys_generic.c at select() to seehow FreeBSDhow FreeBSD 5.2 has implemented this5.2 has implemented this

Awakening a ProcessAwakening a Process

Device/socket notices change in statusDevice/socket notices change in status selwakeupselwakeup() is called pointing to the () is called pointing to the selinfoselinfo

structstruct, and collision flag, and collision flag If the thread is sleeping on If the thread is sleeping on selwaitselwait, wake it up, wake it up If thread has its If thread has its selectingselecting flag set, clear it (cf. flag set, clear it (cf.

select() code and retry)select() code and retry) If collision occurred, wake up everyone whoIf collision occurred, wake up everyone who

is sleeping on is sleeping on selwaitselwait

17

Kernel Data Movement:Kernel Data Movement:I/O Vectors (I/O Vectors (ioveciovec))

iovec iovec is a base address and a lengthis a base address and a length

struct iovec struct iovec {{ char * char *iov_baseiov_base; /* Base address. */; /* Base address. */ size_t size_t iov_len iov_len; /* Length. */; /* Length. */};};

Same as used in Same as used in readvreadv() and () and writevwritev()()system calls.system calls.

The The uiouio Structure:Structure:Describing the I/O OperationDescribing the I/O Operation

A pointer to an array of A pointer to an array of ioveciovec Number of elements inNumber of elements in iovec iovec arrayarray File offset for start of operationFile offset for start of operation Flag showing whether source and destinationFlag showing whether source and destination

are both in kernel, or between user and kernelare both in kernel, or between user and kernel Sum of the length of the I/O vectorsSum of the length of the I/O vectors Flag whether data is going from Flag whether data is going from uio uio ⇒⇒ kernel kernel

(UIO_WRITE) or kernel (UIO_WRITE) or kernel ⇒⇒ uio uio (UIO_READ)(UIO_READ) Pointer to thread whose area is described byPointer to thread whose area is described by

the the uio uio structure (NULL for kernel area)structure (NULL for kernel area)

uiouio StructureStructure

uio_iovuio_iovcntuio_offsetuio_resid

uio_segflguio_rw

uio_procp

iov_base

iov_len

iov_baseiov_len

iov_baseiov_len

data

data

datastruct uio

struct iov

18

Use of Use of ioveciovec and and uiouio

All I/O in the kernel is described by All I/O in the kernel is described by uiouioand and iovec iovec structures.structures.

System calls that donSystem calls that don’’t take an t take an iovecioveccreate a create a uio uio structure to hold data, andstructure to hold data, andpass that down in the kernelpass that down in the kernel

Lower levels of kernel use Lower levels of kernel use uiomoveuiomove()() to tocopy data between copy data between uio uio structures andstructures andkernel bufferskernel buffers

The Virtual The Virtual Filesystem Filesystem InterfaceInterface

In older systems, file entries pointed directly toIn older systems, file entries pointed directly toa a filesystem filesystem index node (index node (inodeinode).). An index node describes the layout of a file on diskAn index node describes the layout of a file on disk

However, this tied file entries to the particularHowever, this tied file entries to the particularfilesystem filesystem (the Berkeley Fast File System, or(the Berkeley Fast File System, orFFS).FFS).

Add a layer to sit between the file entries andAdd a layer to sit between the file entries andthe the inodesinodes: the virtual node (: the virtual node (vnodevnode))

VnodesVnodes

An object-oriented interfaceAn object-oriented interface Vnodes Vnodes for local for local filesystems filesystems map ontomap onto

inodesinodes Vnodes Vnodes for remote for remote filesystem filesystem map tomap to

network protocol control blocksnetwork protocol control blocksdescribing how to find and access the filedescribing how to find and access the file

19

Vnode Vnode EntriesEntries Flags for locking and generic attributes (e.g.Flags for locking and generic attributes (e.g.

filesystem filesystem root)root) Reference countsReference counts Pointer to mount structure for Pointer to mount structure for filesystemfilesystem

containing the containing the vnodevnode File read-ahead informationFile read-ahead information A A vm_object vm_object referencereference Reference to state of special devices,Reference to state of special devices,

sockets, sockets, FIFOsFIFOs

Vnode Vnode Entries (II)Entries (II)

Mutex Mutex to protect flags/countersto protect flags/counters Lock manager lock to protect portions ofLock manager lock to protect portions of

vnode vnode changed by I/O operationschanged by I/O operations Fields used by name cacheFields used by name cache Pointer to Pointer to vnode vnode operationsoperations Pointer to private information (Pointer to private information (inodeinode, , nsfnodensfnode)) Type of underlying objectType of underlying object Clean and dirty buffer poolsClean and dirty buffer pools Count of write operations in progressCount of write operations in progress

Vnode Vnode OperationsOperations

At boot time, each At boot time, each filesystem filesystem registersregistersthe set of the set of vnode vnode operations it supports.operations it supports.

Kernel builds a table listing the union ofKernel builds a table listing the union ofall operations on any all operations on any filesystemfilesystem, and an, and anoperations vector for each operations vector for each filesystemfilesystem

Supported operations point to routines;Supported operations point to routines;unsupported point to a default routine.unsupported point to a default routine.

20

Filesystem Filesystem OperationsOperations

File systems provide two types ofFile systems provide two types ofoperations:operations: Name space operationsName space operations On-disk file storageOn-disk file storage

4.3 BSD had these tied together;4.3 BSD had these tied together;4.4BSD and FreeBSD 5.2 separate4.4BSD and FreeBSD 5.2 separatethemthem

Pathname TranslationPathname Translation

The pathname is copied in from theThe pathname is copied in from theuser process (or extracted from networkuser process (or extracted from networkbuffer for remote buffer for remote filesystem filesystem request)request)

Starting point of the pathname isStarting point of the pathname isdetermined (root for absolute pathdetermined (root for absolute pathnames, current directory for relativenames, current directory for relativenames)names)——this is the lookup directory inthis is the lookup directory inthe next stepthe next step

Pathname Translation IIPathname Translation II

Vnode Vnode layer calls layer calls filesystem-specificfilesystem-specificlookup() operation, passing in thelookup() operation, passing in theremaining pathname and the lookupremaining pathname and the lookupdirectorydirectory

Filesystem Filesystem code searches the lookupcode searches the lookupdirectory for the next component of thedirectory for the next component of thepath, and returns either the path, and returns either the vnode vnode forforthat component or an errorthat component or an error

21

Pathname Translation III:Pathname Translation III:Possible OutcomesPossible Outcomes

Error returned: propagate itError returned: propagate it Pathname not exhausted, but current Pathname not exhausted, but current vnodevnode

is not a directory: is not a directory: ““not a directorynot a directory”” error error Pathname exhausted, then the current Pathname exhausted, then the current vnodevnode

is the result; return itis the result; return it If If vnode vnode is the mount point for another fileis the mount point for another file

system, make the lookup directory thesystem, make the lookup directory themounted mounted filesystemfilesystem; else make the lookup; else make the lookupdirectory the returned directory the returned vnodevnode In either case, iteratively call lookup() with the newIn either case, iteratively call lookup() with the new

lookup directorylookup directory

Common Common VnodeVnode(Exported) (Exported) Filesystem Filesystem ServicesServices

Generic mount optionsGeneric mount options NoexecNoexec: don: don’’t execute any files found on this filet execute any files found on this file

system (useful in heterogeneous networkedsystem (useful in heterogeneous networkedenvironments)environments)

NosuidNosuid: don: don’’t honor t honor setuid setuid or or setgid setgid flags forflags forexecutables on this executables on this filesystem filesystem (security)(security)

NodevNodev: don: don’’t allow any special devices on thist allow any special devices on thisfilesystem filesystem to be opened (again, heterogeneousto be opened (again, heterogeneoussystems)systems)

NoatimeNoatime: don: don’’t update file access times on readst update file access times on reads(performance improvement)(performance improvement)

Sync: request I/OSync: request I/O to the FS be done synchronouslyto the FS be done synchronously

Common Common Vnode Vnode Services II:Services II:Informational servicesInformational services

statfsstatfs(): returns information about the(): returns information about thefilesystem filesystem resources (number of usedresources (number of usedand free disk blocks and and free disk blocks and inodesinodes,,filesystem filesystem mount point, and source ofmount point, and source ofthe mounted the mounted filesystemfilesystem))

getfsstatgetfsstat(): returns information about all(): returns information about allmounted mounted filesystemsfilesystems

22

Filesystem-IndependentFilesystem-IndependentServices: Inactive()Services: Inactive()

Called by the Called by the vnode vnode layer when lastlayer when lastreference to a file is closed, usagereference to a file is closed, usagecount goes to 0count goes to 0

Filesystem Filesystem can then write dirty blocks,can then write dirty blocks,etc.etc.

Vnode Vnode is placed on system-wide free listis placed on system-wide free list Floating free list allows Floating free list allows vnodes vnodes toto

migrate to the file system where theymigrate to the file system where they’’rereneededneeded

reclaim() disassociates the reclaim() disassociates the vnode vnode withwiththe underlying file systemthe underlying file system

vgonevgone() uses reclaim and then() uses reclaim and thenassociates the associates the vnode vnode with the deadwith the deadfilesystemfilesystem All calls except close() generate an errorAll calls except close() generate an error Used to disassociate a terminal when theUsed to disassociate a terminal when the

session leader exits, or when session leader exits, or when unmountingunmountinga a filesystem filesystem where processes have openwhere processes have openfilesfiles

Filesystem-IndependentFilesystem-IndependentServices: reclaim() & Services: reclaim() & vgonevgone()()

Name CacheName Cache

Associate a name with its Associate a name with its vnode vnode to saveto savelookups in the file systemlookups in the file system Add name and Add name and vnode vnode to cacheto cache Look up Look up vnode vnode by nameby name Delete name from the cacheDelete name from the cache Invalidate all names that reference a Invalidate all names that reference a vnodevnode

–– Vnode Vnode has list of names in the name cachehas list of names in the name cache–– Directory Directory vnode vnode also has list of cachealso has list of cache entries containedentries contained

within itwithin it

23

Negative CachingNegative Caching

Names of files that arenNames of files that aren’’t there are looked upt there are looked upin the cache, miss, then also not found in thein the cache, miss, then also not found in thefilesystemfilesystem——expensiveexpensive!! Pathname searching in the shellPathname searching in the shell

Alternative: put an entry in the cache with thatAlternative: put an entry in the cache with thatname, and a null name, and a null vnode vnode pointer.pointer.

This entry will be found on next lookup, short-This entry will be found on next lookup, short-circuiting the processcircuiting the process

If name added to directory, lookup in nameIf name added to directory, lookup in namecache and purge negative entry if presentcache and purge negative entry if present

Buffer Management:Buffer Management:the Buffer Cachethe Buffer Cache

In older BSD (In older BSD (pre-mmappre-mmap), there were), there weretwo caches of memory backed on disktwo caches of memory backed on disk Disk blocks forDisk blocks for explicit file accessexplicit file access Pages for VMPages for VM

What if a file isWhat if a file is opened by process Aopened by process A(traditional), while process B has it(traditional), while process B has itmmapedmmaped?? Two caches means a big, hairy messTwo caches means a big, hairy mess A should see changes B makesA should see changes B makes

Merging the CachesMerging the Caches

Buffer cache and VM cache are now aBuffer cache and VM cache are now asinglesingle page cachepage cache

VM: VM: named files (backed by file)named files (backed by file)identified by identified by vnode vnode and logical block #and logical block #

Old buffer-caching interfaceOld buffer-caching interfacereimplemented reimplemented as emulation layer overas emulation layer overVM cacheVM cache

24

Buffer Cache EmulationBuffer Cache Emulation

Same interface as old routinesSame interface as old routines Looks up requested file pages in VMLooks up requested file pages in VM

cachecache If not in memory, VM system reads it inIf not in memory, VM system reads it in Map into kernel VM space (so there isMap into kernel VM space (so there is

reference to it)reference to it)

The Buffer CacheThe Buffer Cache

Remember that the buffer cache speeds upRemember that the buffer cache speeds upaccess to disk blocksaccess to disk blocks 85% of transfers can be skipped85% of transfers can be skipped

Buffer formatBuffer format HeaderHeader ContentContent Stored separately (allows buffers to be on pageStored separately (allows buffers to be on page

boundaries, for efficient transfer and manipulation)boundaries, for efficient transfer and manipulation) Buffer data can be from 512 up to 64k bytesBuffer data can be from 512 up to 64k bytes

Buffer FormatBuffer Format

hash link

free-list link

flags

vnode pointer

file offset

byte count

buffer size

buffer pointer

(64 Kbyte)MAXBSIZE

buffer contentsbuffer header

25

Buffer Management IIIBuffer Management III

Each buffer initially gets 4096 bytes (multipleEach buffer initially gets 4096 bytes (multiplepages)pages) May give up part of its physical memory laterMay give up part of its physical memory later

Buffers identified by logical block # in fileBuffers identified by logical block # in file Used to be done by physical block #, butUsed to be done by physical block #, but

–– DidnDidn’’t always have access to physical devicet always have access to physical device–– Slow lookup through indirect blocks in Slow lookup through indirect blocks in filesystem filesystem to findto find

physical numberphysical number To avoid aliasing, kernel prohibits mounting aTo avoid aliasing, kernel prohibits mounting a

partition while its block device is open (and vicepartition while its block device is open (and viceversa)versa)

Buffer CallsBuffer Calls

bread(): takes a bread(): takes a vnodevnode, logical block, logical blocknumber, and size, and returns pointer tonumber, and size, and returns pointer tolocked bufferlocked buffer Further accesses to this buffer by otherFurther accesses to this buffer by other

processes cause them to sleepprocesses cause them to sleep Four ways for a node to be unlockedFour ways for a node to be unlocked

–– brelsebrelse(), (), bdwritebdwrite(), (), bawritebawrite(), (), bwritebwrite()()

Buffer Calls IIBuffer Calls II

brelsebrelse(): release an unmodified buffer; awaken(): release an unmodified buffer; awakenany threads sleeping on the bufferany threads sleeping on the buffer

bdwritebdwrite(): mark the buffer as dirty and put it back(): mark the buffer as dirty and put it backin the free list, and awaken processes waiting for itin the free list, and awaken processes waiting for it

bawritebawrite(): schedule a write on a full buffer, and(): schedule a write on a full buffer, andreturn (asynchronous; implicit return (asynchronous; implicit brelsebrelse())())

bwritebwrite(): ensure write is complete before returning(): ensure write is complete before returning(implicit (implicit brelsebrelse())())

bqrelsebqrelse(): used by (): used by filesystem filesystem to indicate buffer willto indicate buffer willbe used again soon; donbe used again soon; don’’t dissolvet dissolve

26

Buffer ListsBuffer Lists

BufhashBufhash: quick lookup for buffers with valid: quick lookup for buffers with validcontents (exactly one)contents (exactly one)

Four Four ““freefree”” lists: lists: LOCKED: buffers canLOCKED: buffers can’’t be flushed (used to bet be flushed (used to be

superblocksuperblock, now only used to hold buffers being, now only used to hold buffers beingwritten out in background)written out in background)

DIRTY: an LRU cache of used buffers, waiting toDIRTY: an LRU cache of used buffers, waiting tobe written to diskbe written to disk

CLEAN: unmodified buffers FS will want soonCLEAN: unmodified buffers FS will want soon EMPTY: buffers with no physical memoryEMPTY: buffers with no physical memory

Buffer PoolBuffer PoolLOCKED DIRTY CLEAN EMPTY

bufhash

bfreelist

V1,X0k

V1,X8k

V1,X16k

V1,X24k

V2,X16k

V2,X48k

V8,X40k

V4,X32k

V8,X48k

V7,X0k

V7,X8k

V7,X16k

V8,X56k

-- --

-- --

Buffer ManagementBuffer ManagementImplementationImplementation

bread() called with request for abread() called with request for adatablock datablock of specified size for aof specified size for aspecified specified vnodevnode breadnbreadn() gets block and starts () gets block and starts readaheadreadahead

bread()

allocbuf()getnewbuf()

bremfree() getblk()

VOP_STRATEGY() Do the I/O

Check for buffer on free list

Take buffer off of freelist and mark busy.

Adjust memory in buffer to requested size

Find next available buffer on the free list

27

Buffer ManagementBuffer ManagementImplementation IIImplementation II

getblkgetblk() finds out if the data block is already in() finds out if the data block is already ina buffera buffer If in a buffer, callIf in a buffer, call bremfree bremfree() to take it off of() to take it off of

whatever free list itwhatever free list it’’s on and mark it busys on and mark it busy If not in buffer, call If not in buffer, call getnewbufgetnewbuf(), then (), then allocbufallocbuf()()

getnewbufgetnewbuf() allocates a new buffer() allocates a new buffer allocbufallocbuf() makes sure buffer has physical() makes sure buffer has physical

memorymemory bread() then passes the buffer to the strategybread() then passes the buffer to the strategy

routine for the underlying routine for the underlying filesystemfilesystem..

Buffer AllocationBuffer Allocation

allocbufallocbuf() ensures that the buffer has enough() ensures that the buffer has enoughphysical memoryphysical memory

If there is excess physical memory, take someIf there is excess physical memory, take someof it and give it to a buffer on the EMPTY listof it and give it to a buffer on the EMPTY list(move it to CLEAN)(move it to CLEAN)——keep if none therekeep if none there

If not enough, call If not enough, call getnewbufgetnewbuf() and then steal() and then stealits memory (putting it on the EMPTY list if youits memory (putting it on the EMPTY list if yousteal all of its memory)steal all of its memory)

Keeps eachKeeps each disk block mapped into at mostdisk block mapped into at mostone buffer atone buffer at any timeany time

Overlapping BuffersOverlapping Buffers

The kernel must ensure that disk blocksThe kernel must ensure that disk blocksappear in at most one bufferappear in at most one buffer

If multiple dirty buffers held portions ofIf multiple dirty buffers held portions ofthe same block, the kernel wouldnthe same block, the kernel wouldn’’ttknow which one was more recentknow which one was more recent

Kernel purges buffers when files areKernel purges buffers when files areshortened or removedshortened or removed

28

Vnode Vnode EvolutionEvolution

Vnodes Vnodes were originally only used towere originally only used toprovide an OO-interface to theprovide an OO-interface to theunderlying underlying filesystemfilesystem

Concept then generalized to allowConcept then generalized to allowvnodes vnodes to point to other to point to other vnodes vnodes (see(seehttp://citeseer.nj.nec.com/rosenthal90evolving.htmlhttp://citeseer.nj.nec.com/rosenthal90evolving.htmland and http:http://citeseer//citeseer..njnj..necnec.com/heidemann94file.html).com/heidemann94file.html)

Stackable Stackable FilesystemFilesystem

With With vnodes vnodes being able to point to otherbeing able to point to othervnodesvnodes, we can treat , we can treat filesystems filesystems asaslayers, and stack themlayers, and stack them

So, we can add features (such asSo, we can add features (such astransparent encryption) to an underlyingtransparent encryption) to an underlyingfilesystem filesystem without changing thatwithout changing thatfilesystem filesystem at allat all

Mounting LayersMounting Layers

Mount introduces an alias for one Mount introduces an alias for one filesystemfilesystemin anotherin another

Traditional mount: map device into aTraditional mount: map device into afilesystemfilesystem, overlaying the old directory with a, overlaying the old directory with afile systemfile system

With layers, mount pushes a new layer on aWith layers, mount pushes a new layer on avnode vnode stackstack Can overlay old Can overlay old vnode vnode (use same name in file(use same name in file

space)space) Can appear separately (use different name)Can appear separately (use different name)

29

Layered Layered FilesystemsFilesystems

File access is dependent upon whichFile access is dependent upon whichname is used for the filename is used for the file New name: use new layerNew name: use new layer Old name: use old layerOld name: use old layer This assumes new mounts are at differentThis assumes new mounts are at different

points from old onespoints from old ones

Vnode Vnode Options on File AccessOptions on File Access

Perform requested operation and returnPerform requested operation and returnresultresult

Pass operation unchanged to next-lowerPass operation unchanged to next-lowervnodevnode; might (or might not) modify result; might (or might not) modify result

Modify operands and pass on to next-Modify operands and pass on to next-lower lower vnodevnode; might (or might not) modify; might (or might not) modifyresultresult

Operation passed to the bottom of theOperation passed to the bottom of thestack without action generates an errorstack without action generates an error

Old vs. New Old vs. New VnodesVnodes

Old Old vnodesvnodes Vnode Vnode operations implemented as indirectoperations implemented as indirect

function callsfunction calls

New New vnodesvnodes Intermediate layers pass operation to lower layersIntermediate layers pass operation to lower layers New operations added into the system at boot timeNew operations added into the system at boot time Pass operations that may not have been definedPass operations that may not have been defined

when the when the filesystem filesystem was writtenwas written Pass parameters of unknown type and numberPass parameters of unknown type and number

30

SolutionSolution

Kernel places Kernel places vnode vnode operation nameoperation nameand arguments into a and arguments into a structstruct

That That struct struct passed as single parameterpassed as single parameterto to vnode vnode operation (uniform interface)operation (uniform interface)

If the operation is supported, the If the operation is supported, the vnodevnodeknows how to decode the argumentknows how to decode the argument

If itIf it’’s not supported, then its not supported, then it’’s passed tos passed tothe next lower layer w/same argumentthe next lower layer w/same argument

Structure (for access() call)Structure (for access() call)

struct vop_access_args struct vop_access_args {{ struct vnodeop_desc struct vnodeop_desc **a_desca_desc;; struct vnode struct vnode **a_vpa_vp;; int a_mode int a_mode;; struct u_cred struct u_cred **a_creda_cred;; struct struct thread *thread *a_tda_td;;};}; The first element of every argumentThe first element of every argument

structure is a structure is a vnodeop_desc vnodeop_desc structure.structure.

The Null File System:The Null File System: nullfs nullfs

This would be better called the This would be better called the ““identityidentityfile systemfile system””

Just passes all requests and responsesJust passes all requests and responseswithout changing them at allwithout changing them at all

Can make a Can make a filesystem filesystem appear withinappear withinitselfitself

Useful as starting point forUseful as starting point forimplementing other file systemsimplementing other file systems

31

User Mapping File System:User Mapping File System:umapfsumapfs

umapfs umapfs mounts the same file system in two places (/localmounts the same file system in two places (/localinto /export) on the file serverinto /export) on the file server

/export is exported to external domains that use different/export is exported to external domains that use differentUID/GID.UID/GID.

Mapping is automatically done by the Mapping is automatically done by the umapfs umapfs layerlayer Request handed to /localRequest handed to /local

outside administrative exports local administrative exports

UFSFFSuid/gid mapping

NFS server

operation not supported

NFS server

/export/export/local/local

Benefits of Benefits of umapfsumapfs

No cost to local clients (they donNo cost to local clients (they don’’t got gothrough the through the umapfsumapfs))

No changes required to NFS or localNo changes required to NFS or localfilesystem filesystem codecode

We can expand this idea to give eachWe can expand this idea to give eachoutside domain its own mapping,outside domain its own mapping,accessed through its own mount pointaccessed through its own mount point

Union Mount Union Mount FilesystemFilesystem: : unionfsunionfs Traditional mount hides any files in theTraditional mount hides any files in the

directory before the mountdirectory before the mount Union Union filesystem filesystem does a transparent overlaydoes a transparent overlay

(you can still see the old)(you can still see the old) Both can be accessed simultaneouslyBoth can be accessed simultaneously

//usrusr//srcsrc

src src src

shell shell shell

Makefile shsh.h sh.c

//usrusr//srcsrc//tmptmp//srcsrc

shsh.h sh.cMakefile

32

More More unionfsunionfs Start with an empty top-level directoryStart with an empty top-level directory As the underlying As the underlying fs fs is traversed, entries are made inis traversed, entries are made in

the top levelthe top level Lower levels are read-onlyLower levels are read-only Any names appearing in top Any names appearing in top fs fs obscure lower obscure lower fsfs

//usrusr//srcsrc

src src src

shell shell shell

Makefile shsh.h sh.c

//usrusr//srcsrc//tmptmp//srcsrc

shsh.h sh.cMakefile

Still More Still More unionfsunionfs

Two conflicting requirements:Two conflicting requirements: Lower layers are read-onlyLower layers are read-only User should think this is a regular file systemUser should think this is a regular file system

We already copy files up on writes, but how doWe already copy files up on writes, but how dowe handle deletions?we handle deletions? Deletions of unmodified files (lower layer)Deletions of unmodified files (lower layer) Deletions of previously modified files (top layer)Deletions of previously modified files (top layer)

Use Use ““whiteoutwhiteout”” entries in the top level: entries in the top level: inode inode 11

On hitting a whiteout entry, kernel returns error.On hitting a whiteout entry, kernel returns error.

Tricky Issues With Whiteout:Tricky Issues With Whiteout:DirectoriesDirectories

What happens if I create a file with theWhat happens if I create a file with thesame name as a whiteout file?same name as a whiteout file?

What about a directory? Is there anyWhat about a directory? Is there anydifference?difference?

Compare what happens in the Compare what happens in the ““normalnormal””situation where the same directorysituation where the same directoryappears in multiple layers of theappears in multiple layers of thefilesystem filesystem stackstack

33

Uses of the Uses of the unionfsunionfs

Heterogeneous architectures can build from aHeterogeneous architectures can build from acommon source basecommon source base NFS mount source treeNFS mount source tree Union mount over the topUnion mount over the top Builds stay localBuilds stay local

Can compile sources Can compile sources ““onon”” read-only media read-only media(e.g., CD-ROM)(e.g., CD-ROM)

Private source directories from common codePrivate source directories from common codebasesbases

Other Common Other Common FilesystemsFilesystems::PortalPortal

Mount a process into the Mount a process into the filesystemfilesystem Pass remainder of path name to process asPass remainder of path name to process as

inputinput Process returns file descriptorProcess returns file descriptor

Could be socket to processCould be socket to process Could be to fileCould be to file

Can be used, e.g., to provide Can be used, e.g., to provide filesystemfilesystemaccess to network services (opening fileaccess to network services (opening filecreates TCP connection to server)creates TCP connection to server)

Other Common Other Common FilesystemsFilesystems::procfsprocfs

procfs procfs provides information on runningprovides information on runningprocesses, through the file system (processes, through the file system (viz psviz ps))

An An ls ls of /proc shows a directory for eachof /proc shows a directory for eachprocess, which contains:process, which contains: ctlctl: file allowing control of the process: file allowing control of the process file: the executable (compiled code)file: the executable (compiled code) memmem: virtual memory for the process: virtual memory for the process regsregs: registers for the process: registers for the process status: text file with information (e.g. state, status: text file with information (e.g. state, ppidppid, , ……))

34

Small Group DiscussionSmall Group Discussion

Encryption should be invisible to the users, but all files shouldEncryption should be invisible to the users, but all files shouldbe encrypted.be encrypted.

Files for multiple users are stored on the same underlying fileFiles for multiple users are stored on the same underlying filesystemsystem

How will you ensure that only the qualified user can see theHow will you ensure that only the qualified user can see theunencrypted files?unencrypted files?

How/when will a user authenticate itself?How/when will a user authenticate itself? How can users share files?How can users share files? What happens if a user program does a seek on a file in the topWhat happens if a user program does a seek on a file in the top

layer (are there issues with file offsets)?layer (are there issues with file offsets)?

Consider the design of, and explain how you wouldbuild, a vnode layer to automatically encrypt files for a

user. Briefly address the following issues: