my final seminar reprt

8/4/2019 My Final Seminar Reprt

1/33

FUTEX(Fast Userspace muTEX)

1

CHAPTER NO. 1

INTRODUCTION

In recent years, that is in past 5 years Linux has seen significant growth as a serveroperating system and has been successfully deployed as an enterprise for Web, file and print

servicing. With the advent of Kernel Version 2.4, Linux has seen a tremendous boost in

scalability and robustness which further makes it feasible to deploy even more demanding

enterprise applications such as high end database, business intelligence software ,application

servers, etc. As a result, whole enterprise business suites and middleware such as SAP,

Websphere, Oracle, etc., are now available on Linux. For these enterprise applications to run

efficiently on Linux, or on any other operating system, the OS must provide the proper

abstractions and services. Usually these enterprise applications and applications suites or

software are increasingly built as multi process / multithreaded applications. These application

suites are often a collection of multiple independent subsystems.

Despite functional variations between these applications often they require to

communicate with each other and also sometimes they need to share a common state. Examples

of this are database systems, which typically maintain shared I/O buffers in user space.Access to

such shared state must be properly synchronized. Allowing multiple processes to access the same

resources in a time sliced manner or potentially consecutively in the case of multiprocessor

systems can cause many problems. This is due to the need to maintain data consistency, maintain

true temporal dependencies and to ensure that each thread will properly release the resource as

required when it has completed its action. Synchronization can be established through locks.

There are mainly two types of locks: - Exclusive locks and shared locks. Exclusive locks are

those which allows only a single user to access the protected entity, while shared locks are those

which implements the multiple reader and single writer semantics. Synchronization implies a

shared state, indicating that a particular resource is available or busy, and a means to wait for its

availability. The latter one can either be accomplished through busy-waiting or through an

explicit / implicit call to the scheduling.


2/33


2

CHAPTER NO. 2

CONCURRENCY IN LINUX OPERATING SYSTEMAs different processes interact with each other they may often need to access and

modify shared section of code, memory locations and data. The section of code belonging to

a process or thread which manipulates a variable which is also being manipulated by another

process or thread is commonly called critical section. Proper synchronization problems

usually serialize the access over critical section. Processes operate within their own virtual

address space and are protected by the operating system from interference by other processes.

By default a user process cannot communicate with another process unless it makes use of

secure, kernel managed mechanisms. There are many times when processes will need to

share common resources or synchronize their actions. One possibility is to use threads, which

by definition can share memory within a process. This option is not always possible (or wise)

due to the many disadvantages which can be experienced with threads. Methods of passing

messages or data between processes are therefore required.

In traditional UNIX systems the basic mechanisms for synchronization were

System V IPC (inter process communication) such as semaphores, msgqueues, sockets and

the file locking mechanisms such as flock() and fcntl() functions. Message queues

(msgqueues) consist of a linked list within the kernel's addressing space. Messages are added

to the queue sequentially and may be retrieved from the queue in several different ways.

Semaphores are counters used to control access to shared resources by multiple processes.

They are most often used as a locking mechanism to prevent processes from accessing a

particular resource while another process is performing operations on it. Semaphores are

implemented as sets, though a set may have a single member. Shared memory is a mapping

of an area of memory into the address space of more than one process. This is the fastest

form of IPC as processes do not subsequently need access to kernel services in order to share

data. fcntl() locking implements locking directly via the kernel. This should work without

problems on a local machine, but via NFS this requires locking support in the NFS server.The (old) user space nfsd does not support locking, while the kernel nfs server (knfsd), which

comes with Linux 2.2 (and newer) supports locking. fcntl() locking will fail, if your NFS

server doesn't support locking, so choosing the optimal locking technique does not only

depend on the client machine but also on the NFS server machine.


3/33


3

While checking out for semaphores, the Famous Physicist and Computer Scientist

Edsger Wybe Dijkstra had proposed the concept of semaphores. Semaphores supports two

basic operations UP and DOWN. The initial value of semaphore is taken 1.

The codes that are given by Dijkstra on semaphores are as follows :-

UP()

{

Semaphore++;

}

DOWN()

{while(Semaphore==0);

Semaphore=0;

}

Actual Linux implementation of the semaphores in the kernel involves UP and DOWN to be

coded as system calls. The UP system call, after incrementing the value of semaphore sends a

wakeup signal to all the processes which are waiting for the same semaphore value to be 1 so

as they can acquire it. (Wake up implies that the process which does UP of semaphore also

change the task_structures field to TASK_RUNNING). DOWN has a similar behavior

proposed by Dijkstra. It simply sees whether the semaphore value is 1 else it sleeps within

the while() loop.

Mr. NUTT has provided the pseudo code as

DOWN (s) : [while (s==0) {wait}; s=s-1; ]

The code within square braces is indivisible and one in curly braces can be

interrupted. Now to understand how these functions really achieve concurrency we choose an

example where two processes A and B try to access the critical section. As the initial value of

semaphore is one, when the two processes calls DOWN() concurrently, only one process will

succeed in attaining the DOWN() operation. Say, suppose process A started executing

DOWN() operation first. At this point, process B preempt A because DOWN() is a system

call. Now process A sees that the value of semaphore is one and automatically decrements its

value to zero and comes out of DOWN() without blocking. Now process B can preempt A


4/33


4

because now A is off the system call. Now B starts executing DOWN() and checks the value

of the semaphore which is zero at this moment and hence it goes to sleep and doesnt come

out of the system call.

When process A has completed, it generates a wakeup call and thus B starts

executing. That was the whole story of Linux semaphores (These IPC mechanisms are coded

as C routines given in /usr/src/linux/ipc. It is coded by an Indian programmer Mr. Krishna

Balasubramaniam.)

Thus we see these mechanisms expose an opaque handle to a kernel object that

naturally provides the shared state and atomic operations in the kernel. Services must be

requested through system calls (eg :- semop()). The drawback of this approach is that every

lock access requires a system call. When locks have low contention rates, the system call canconstitute a significant overhead. A process may operate in one of two modes which are

known as 'user' mode and 'system' mode (or kernel mode). A single process may switch

between the two modes, i.e. they may be different phases of the same process. Processes

defaulting to user mode include most application processes, these are executed within an

isolated environment provided by the operating system such that multiple processes running

on the same machine cannot interfere with each other's resources. A user processs switches to

kernel mode when it makes a system call, generates an exception (fault) or when an interrupt

occurs (e.g. system clock). At this point the kernel is executing on behalf of the process. At

any one time during its execution a process runs in the context of itself and the kernel runs in

the context of the currently running process.

One solution to this problem is to deploy user level locking, which avoids some

of the overhead associated with purely kernel-based locking mechanisms. It relies on a user

level lock located in a shared memory region and modified through atomic operations to

indicate the lock status. Only the contended case requires kernel intervention. The exact

behavior and the obtainable performance are directly affected by how and when the kernel

services are invoked, such as Shared memory region, Atomic Operations,etc.

In this seminar, I am describing a particular fast user level locking mechanism

called futexes, that was developed in the context of the Linux operating system. It consists of

two parts, the user library and a kernel service that has been integrated into the Linux kernel

distribution version 2.5.7.


5/33


5

CHAPTER NO. 3

LINUX USERSPACE LEVEL LOCKING

3.1. HISTORY AND IMPLEMENTATION:

Having stated the requirements in the previous section, we now proceed to describe

the basic general implementation issues. As told before, futex is a fast user level locking

mechanism and hence it is called as Fast Userspace muTEX. Futexes are very basic and lend

themselves well for building higher level locking abstractions such as POSIX mutexes. Most

programmers will in fact not be using futexes directly but instead rely on system libraries built

on them, such as the NPTL pthreads implementation. A futex is identified by a piece of memory

which can be shared between different processes. In fast userlevel locking, there are mainly twocases upon which it has been implemented. They are: The uncontended case and the contended

case.

The uncontended case should be efficient and should avoid system calls by all means.

In the contended case we are willing to perform a system call to block in the kernel. Avoiding

system calls in the uncontended case requires a shared state in user space accessible to all

participating processes/task. This shared state, referred to as the user lock, indicates the status of

the lock, i.e., whether the lock is held or not and whether there are waiting tasks or not. This is in

contrast to the System V IPC mechanisms which merely exports a handle to the user, and

performs all operations in the kernel. The user lock is located in a shared memory region that

was create via shmat() or mmap().

As a result, it can be located at different virtual addresses in different address spaces.

In the uncontended case, the application atomically changes the lock status word without

entering into the kernel. Hence, atomic operations such as atomic_inc(), atomic_dec(),

cmpxchg(), and test_and_set() are neccessary in user space.

In the contended case, the application needs to wait for the release of the lock or

needs to wake up a waiting task in the case of an unlock operation. In order to wait in the kernel,

a kernel object is required, that has waiting queues associated with it. The waiting queues

provide the queueing and scheduling interactions. Of course, the aforementioned IPC

mechanisms can be used for this purpose. However, these objects still imply a heavy weight


6/33


6

object that requires a priori allocation and often does not precisely provide the required

functionality. Another alternative that is commonly deployed are spinlocks where the task spins

on the availability of the user lock until granted. To avoid too many cpu cycles being wasted, the

task yields the processor occasionally.

It is desirable to have the user lock be handlefree. In other words instead of handling

an oqaque kernel handle, requiring initialization and cross process global handles, it is desirable

to address locks directly through their virtual address. As a consequence, kernel objects can be

allocated dynamically and on demand, rather than apriori. A lock, though addressed by a virtual

address, can be identified conceptually through its global lock identity, which we define by the

memory object backing the virtual address and the offset within that object. It is given by the

tuple [B,O]. Since B represents the memory object backing the kernel object,it can be any of the

three fundamental types.

They are:

(a) Anonymous memory,

(b) Shared memory segment, and

Memory mapped files.

While (b) and can be used between multiple processes, (a) can only be used

between threads of the same process. Utilizing the virtual address of the lock as the kernel handle

also provides for an integrated access mechanism that ties the virtual address automatically with

its kernel object. Despite the atomic manipulation of the user level lock word, race conditions

can still exist as the sequence of lock word manipulation and system calls is not atomic. This has

to be resolved properly within the kernel to avoid deadlock and improper functioning.

Next the user lock object in futexes are described. For the purpose of this discussion

a general opaque datatype ulock_t is defined to represent the userlevel lock. At a minimum it

requires a status word :-

typedef struct ulock_t {

long status;

} ulock_t;

Here, it has got a long field which indicates the status of the user lock object. We


7/33


7

assume that a shared memory region has been allocated either through shmat() or through

mmap() and that any user locks are allocated into this region. Again, the addresses of the same

lock need not be the same across all participating address spaces.

The basic semaphore functions UP() and DOWN() can be implemented as follows:

static inline int

usema_down(ulock_t *ulock)

{

if (!__ulock_down(ulock))

return 0;

return sys_ulock_wait(ulock);

}static inline int

usema_up(ulock_t *ulock)

{

if (!__ulock_up(ulock))

return 0;

return sys_ulock_wakeup(ulock);

}

The ulock_down() and ulock_up() provide the atomic increment and decrement

operations on the lock status word. A non positive count (status) indicates that the lock is not

available. In addition, a negative count could indicate the number of waiting tasks in the kernel.

If a contention is detected, i.e. (ulock->status


8/33


8

3.2 PREVIOUS IMPLEMENTATIONS:

One early design suggested was the explicit allocation of a kernel object and the

export of the kernel object address as the handle. The kernel object was comprised of a wait

queue and a unique security signature. On every wait or wakeup call, the signature would be

verified to ensure that the handle passed indeed was referring to a valid kernel object. The

disadvantages of this approach have been mentioned in section 2, namely that a handle needs to

be stored in ulock_t and that explicit allocation and deallocation of the kernel object are required.

Furthermore, security is limited to the length of the key and hypothetically could be guessed.

Another prototype implementation, known as ulocks, implements general user semaphores with

both fair and convoy avoidance wakeup policy.

Mutual exclusive locks are regarded as a subset of the user semaphores. The

prototype also provides multiple reader/single writer locks (rwlocks). The user lock object

ulock_t consists of a lock word and an integer indicating the required number of kernel wait

queues.

User semaphores and exclusive locks are implemented with one kernel wait queue

and multiple reader/single writer locks are implemented with two kernel wait queues. This

implementation separates the lock word from the kernel wait queues and other kernel objects,

i.e., the lock word is never accessed from the kernel on the time critical wait and wakeup code

path. Hence the state of the lock and the number of waiting tasks in the kernel is all recorded in

the lock word. For exclusive locks, standard counting as described in the general ulock_t

discussion, is implemented. As with general semaphores, a positive number indicates the number

of times the semaphore can be acquired, 0 and less indicates that the lock is busy, while the

absolute of a negative number indicates the number of waiting tasks in the kernel.

The premature wakeup call is handled by implementing the kernel internalwaitqueues using kernel semaphores (struct semaphore) which are initialized with a value 0. A

premature wakeup call, i.e. no pending waiter yet, simply increases the kernel semaphores

count to 1. Once the pending wait arrives it simply decrements the count back to 0 and exits the

system call without waiting in the kernel. All the wait queues (kernel semaphores) associated

with a user lock are encapsulated in a single kernel object. In the rwlocks case, the lock word is


9/33


9

split into three fields: write locked (1 bit), writes waiting (15 bits), readers (16 bits). If write

locked, the readers indicate the number of tasks waiting to read the lock, if not write locked, it

indicates the numbers of tasks that have acquired read access to the lock.

Writers are blocking on a first kernel wait queue, while readers are blocking on a

second kernel wait queue associated with a ulock. To wakeup multiple pending read requests, the

number of task to be woken up is passed through the system call interface. To implement

rwlocks and ca-locks, atomic compare and exchange support is required. Unfortunately on older

386 platforms that is not the case. The kernel routines must identify the kernel object that is

associated with the user lock. Since the lock can be placed at different virtual addresses in

different processes, a lookup has to be performed. In the common fast lookup, the virtual address

of the user lock and the address space are hashed to a kernel object. If no hash entry exists, the

proper global identity [B;O] of the lock must be established. For this we first scan the callingprocesss vma list for the vma containing the lock word and its offset.

The global identity is then looked up in a second hash table that links global

identities with their associated kernel object. If no kernel object exists for this global identity,

one is allocated, initialized and added to the hash functions. The close() function associated with

a shared region holding kernel objects is intercepted, so that kernel objects are deleted and the

hash tables are cleaned up, once all attached processes have detached from the shared region.

While this implementation provides for all the requirements, the kernel infrastructure of multiple

hash tables and lookups was deemed too heavy. In addition, the requirement for compare and

exchange is also seen to be restrictive.


10/33


10

CHAPTER NO. 4

FUTEXES

4.1 WHAT IS FUTEX:

The Linux kernel provides futexes ('Fast Userspace muTexes') as a building block for

fast userspace locking and semaphores. Futexes were designed and worked on by Hubertus

Franke IBM Thomas J. Watson Research Center, Matthew Kirkwood, Ingo Molnar (Red Hat)

and Rusty Russell (IBM Linux Technology Center). Futexes are very basic and lend themselves

well for building higher level locking abstractions such as POSIX mutexes. Initial futex support

was merged in Linux 2.5.7 but with different semantics from those described below. Current

semantics are available from Linux 2.5.40 onwards.

A futex is identified by a piece of memory which can be shared between different

processes. In these different processes, it need not have identical addresses. In its bare form, a

futex has semaphore semantics; it is a counter that can be incremented and decremented

atomically; processes can wait for the value to become positive. Futex operation is entirely

userspace for the non-contended case. The kernel is only involved to arbitrate the contended

case. As any sane design will strive for non-contension, futexes are also optimised for thissituation. In its bare form, a futex is an aligned integer which is only touched by atomic

assembler instructions.

Processes can share this integer over mmap(), via shared segments or because they

share memory space, in which case the application is commonly called multithreaded. There are

three key points of the original futex implementation which was added to the 2.5.7 kernel:

1. We use a unique identifier for each futex (which can be shared across different address spaces,

so may have different virtual addresses in each): this identifier is the struct page pointer and the

offset within that page. We increment the reference count on the page so it cannot be swapped

out while the process is sleeping.

2. The structure indicating which futex the process is sleeping on is placed in a hash table, and is

created upon entry to the futex syscalls on the processes kernel stack.


11/33


11

3. The compression of Fast Userspace muTEX into futex gave a simple unique identifier to the

section of code and the function names used.

4.2 DESCRIPTION:

The Linux kernel provides futexes ('Fast Userspace muTexes') as a building block

for fast userspace locking and semaphores. Futexes are very basic and lend themselves well forbuilding higher level locking abstractions such as POSIX mutexes.

This page does not set out to document all design decisions but restricts itself to

issues relevant for application and library development. Most programmers will in fact not beusing futexes directly but instead rely on system libraries built on them, such as the NPTL

pthreads implementation.

A futex is identified by a piece of memory which can be shared between different processes. In

these different processes, it need not have identical addresses. In its bare form, a futex hassemaphore semantics; it is a counter that can be incremented and decremented atomically;

processes can wait for the value to become positive.

Futex operation is entirely userspace for the non-contended case. The kernel isonly involved to arbitrate the contended case. As any sane design will strive for non-contension,

futexes are also optimised for this situation.

In its bare form, a futex is an aligned integer which is only touched by atomic assembler

instructions. Processes can share this integer over mmap, via shared segments or because theyshare memory space, in which case the application is commonly called multithreaded.

4.3 SEMANTICS:

Any futex operation starts in userspace, but it may necessary to communicate with the

kernel using the futex(2) system call.

To 'up' a futex, execute the proper assembler instructions that will cause the host CPU to

atomically increment the integer. Afterwards, check if it has in fact changed from 0 to 1, inwhich case there were no waiters and the operation is done. This is the non-contended case

which is fast and should be common.

In the contended case, the atomic increment changed the counter from -1 (or some other negative

number). If this is detected, there are waiters. Userspace should now set the counter to 1 andinstruct the kernel to wake up any waiters using the FUTEX_WAKE operation.


12/33


12

Waiting on a futex, to 'down' it, is the reverse operation. Atomically decrement the counter andcheck if it changed to 0, in which case the operation is done and the futex was uncontended. In

all other circumstances, the process should set the counter to -1 and request that the kernel waitfor another process to up the futex. This is done using the FUTEX_WAIT operation.

The futex system call can optionally be passed a timeout specifying how long thekernel should wait for the futex to be upped. In this case, semantics are more complex and the programmer is referred to futex(2) for more details. The same holds for asynchronous futex

waiting.

4.4 READ / WRITE LOCKS:

We considered an implementation of FUTEX_ READ_DOWN et. al, which would be similar to

the simple mutual exclusion locks, but before adding these to the kernel, Paul Mackerras

suggested a design for creating read/write lock in userspace by using two futexes and a count: fast userspace read/write locks, or furwocks. This implementation provides the benchmark for anykernel-based implementation to beat to justify its inclusion as a first-class primitive, which can

be done by adding new valid op values.

4.5 OPERATIONS TO BE PERFORMED:

y FUTEX_WAIT:This operation atomically verifies that the futex address uaddrstill contains

the value val, and sleeps awaiting FUTEX_WAKE on this futex address. If the timeout

argument is non-NULL, its contents describe the maximum duration of the wait, which isinfinite otherwise. The arguments uaddr2 and val3 are ignored.

Example:

For futex(7), this call is executed if decrementing the count gave a negative value

(indicating contention), and will sleep until another process releases the futex and

executes the FUTEX_WAKE operation.

y FUTEX_WAKE:This operation wakes at most val processes waiting on this futex address

(ie. inside FUTEX_WAIT). The arguments timeout, uaddr2 and val3 are ignored.

Example:


13/33


13

For futex(7), this is executed if incrementing the count showed that there were waiters,once the futex value has been set to 1 (indicating that it is available).

y FUTEX_FDTo support asynchronous wakeups, this operation associates a filedescriptor with a futex. If another process executes a FUTEX_WAKE, the process will

receive the signal number that was passed in val. The calling process must close thereturned file descriptor after use. The arguments timeout, uaddr2 and val3 are ignored.

To prevent race conditions, the caller should test if the futex has been upped after

FUTEX_FD returns.

y FUTEX_REQUEUE (since Linux 2.5.70)This operation was introduced in order to avoid a "thundering herd" effect

when FUTEX_WAKE is used and all processes woken up need to acquire another futex.

This call wakes up val processes, and requeues all other waiters on the futex at addressuaddr2. The arguments timeoutand val3 are ignored.

y FUTEX_CMP_REQUEUE (since Linux 2.6.7)There was a race in the intended use of FUTEX_REQUEUE, so

FUTEX_CMP_REQUEUE was introduced. This is similar to FUTEX_REQUEUE, but

first checks whether the location uaddr still contains the value val3. If not, an errorEAGAIN is returned. The argument timeoutis ignored.

4.6 VALUES RETURNED BY THE ABOVE FUNCTIONS:

y FUTEX_WAITReturns 0 if the process was woken by a FUTEX_WAKE call. In case of timeout,

ETIMEDOUT is returned. If the futex was not equal to the expected value, the operation returnsEWOULDBLOCK. Signals (or other spurious wakeups) cause FUTEX_WAIT to return EINTR.

y FUTEX_WAKEReturns the number of processes woken up.

y FUTEX_FDReturns the new file descriptor associated with the futex.


14/33


14

y FUTEX_REQUEUEReturns the number of processes woken up.

y FUTEX_CMP_REQUEUEReturns the number of processes woken up.

4.7 ERRORS:

y EACCES:No read access to futex memory.

y EAGAIN:FUTEX_CMP_REQUEUE found an unexpected futex value. (This probably indicates arace; use the safe FUTEX_WAKE now.)

y EFAULT:Error in getting timeout information from userspace.

y EINVAL :An operation was not defined or error in page alignment.

y ENFILE:The system limit on the total number of open files has been reached

4.8 VERSIONS:

Initial futex support was merged in Linux 2.5.7 but with different semantics from those

described above. Current semantics are available from Linux 2.5.40 onwards, while additionalfunctionality became available around 2.5.70 (FUTEX_REQUEUE) and 2.6.7(FUTEX_CMP_REQUEUE).


15/33


15

CHAPTER NO.5

MUTEXES AND SEMAPHORES

5.1 WHAT ARE MUTEXES:

Mutual exclusion (often abbreviated to mutex) algorithms are used in concurrent

programming to avoid the simultaneous use of a common resource, such as a global variable, bypieces of computer code called critical sections. A critical section is a piece of code in which a

process or thread accesses a common resource. The critical section by itself is not a mechanismor algorithm for mutual exclusion. A program, process, or thread can have the critical section in

it without any mechanism or algorithm which implements mutual exclusion.

Examples of such resources are fine-grained flags, counters or queues, used to communicate between code that runs concurrently, such as an application and its interrupt handlers. The

synchronization of access to those resources is an acute problem because a thread can be stoppedor started at any time.

To illustrate: suppose a section of code is altering a piece of data over several program steps, when another thread, perhaps triggered by some unpredictable event, starts

executing. If this second thread reads from the same piece of data, the data, which is in theprocess of being overwritten, is in an inconsistent and unpredictable state. If the second thread

tries overwriting that data, the ensuing state will probably be unrecoverable. These shared data

being accessed by critical sections of code must, therefore, be protected, so that other processeswhich read from or write to the chunk of data are excluded from running.

A mutex is also a common name for a program object that negotiates mutualexclusion among threads, also called a lock.

5.2 DESCRIPTION:

The purpose of mutexes is to let threads share resources safely. If two or more

threads attempt to manipulate a data structure with no locking between them then the systemmay run for quite some time without apparent problems, but sooner or later the data structurewill become inconsistent and the application will start behaving strangely and is quite likely to

crash. The same can apply even when manipulating a single variable or some other resource. Forexample, consider:


16/33


16

static volatile int counter = 0;

voidprocess_event(void)

{

counter++;

}

Assume that after a certain period of time counter has a value of 42, and two

threads A and B running at the same priority call process_event(). Typically thread A will readthe value of counter into a register, increment this register to 43, and write this updated value

back to memory. Thread B will do the same, so usually counter will end up with a value of 44.However if thread A is timesliced after reading the old value 42 but before writing back 43,

thread B will still read back the old value and will also write back 43. The net result is that thecounter only gets incremented once, not twice, which depending on the application may prove

disastrous.Sections of code like the above which involve manipulating shared data are

generally known as critical regions. Code should claim a lock before entering a critical regionand release the lock when leaving. Mutexes provide an appropriate synchronization primitive for

this.

static volatile int counter = 0;static cyg_mutex_t lock;

void

process_event(void){

cyg_mutex_lock(&lock);counter++;

cyg_mutex_unlock(&lock);}

A mutex must be initialized before it can be used, by calling cyg_mutex_init.This takes a pointer to a cyg_mutex_t data structure which is typically statically allocated, and

may be part of a larger data structure. If a mutex is no longer required and there are no threadswaiting on it then cyg_mutex_destroy can be used.


17/33


17

The main functions for using a mutex are cyg_mutex_lock andcyg_mutex_unlock. In normal operation cyg_mutex_lock will return success after claiming the

mutex lock, blocking if another thread currently owns the mutex. However the lock operationmay fail if other code calls cyg_mutex_release or cyg_thread_release, so if these functions may

get used then it is important to check the return value. The current owner of a mutex should call

cyg_mutex_unlock when a lock is no longer required. This operation must be performed by theowner, not by another thread.

cyg_mutex_trylock is a variant of cyg_mutex_lock that will always returnimmediately, returning success or failure as appropriate. This function is rarely useful. Typical

code locks a mutex just before entering a critical region, so if the lock cannot be claimed thenthere may be nothing else for the current thread to do. Use of this function may also cause a form

of priority inversion if the owner owner runs at a lower priority, because the priority inheritancecode will not be triggered. Instead the current thread continues running, preventing the owner

from getting any cpu time, completing the critical region, and releasing the mutex.

cyg_mutex_release can be used to wake up all threads that are currently blocked inside a call tocyg_mutex_lock for a specific mutex. These lock calls will return failure. The current mutex

owner is not affected.

5.3 RECURRSIVE MUTEXES:The implementation of mutexes within the eCos kernel does not support recursive

locks. If a thread has locked a mutex and then attempts to lock the mutex again, typically as a

result of some recursive call in a complicated call graph, then either an assertion failure will be

reported or the thread will deadlock. This behaviour is deliberate. When a thread has just lockeda mutex associated with some data structure, it can assume that that data structure is in aconsistent state. Before unlocking the mutex again it must ensure that the data structure is again

in a consistent state.

Recursive mutexes allow a thread to make arbitrary changes to a data structure, then

in a recursive call lock the mutex again while the data structure is still inconsistent. The net resultis that code can no longer make any assumptions about data structure consistency, which defeats

the purpose of using mutexes.

5.4 WHAT ARE SEMAPHORES:In computer science, a semaphore is a protected variable or abstract data type that

provides a simple but useful abstraction for controlling access by multiple processes to acommon resource in a parallel programming environment.


18/33


18

A useful way to think of a semaphore is as a record of how many units of a particular resourceare available, coupled with operations to safely (i.e. without race conditions) adjust that record as

units are required or become free, and if necessary wait until a unit of the resource becomesavailable.

Semaphores are a useful tool in the prevention of race conditions and deadlocks;however, their use is by no means a guarantee that a program is free from these problems.Semaphores which allow an arbitrary resource count are called counting semaphores, whilst

semaphores which are restricted to the values 0 and 1 (or locked/unlocked, unavailable/available)are called binary semaphores.

The semaphore concept was invented by Dutch computer scientist Edsger Dijkstra, and the

concept has found widespread use in a variety of operating systems.

5.5 SYSTEMS OPERATED ON SEMAPHORES:1. Semaphore line, a system of long-distance communication based on towers with moving

arms.

2. Flag semaphore systems.3. Railway semaphore signals for railway traffic control.1. Semaphore Line:

A semaphore telegraph, optical telegraph, shutter telegraph chain, Chappe

telegraph, orNapoleonic semaphore is a system of conveying information by means of visual

signals, using towers with pivoting shutters, also known as blades or paddles. Information isencoded by the position of the mechanical elements; it is read when the shutter is in a fixedposition. These systems were popular in the late 18th to early 19th century. In modern usage,

"semaphore line" and "optical telegraph" may refer to a relay system using flag semaphore.

Semaphore lines were a precursor of the electrical telegraph. They were far faster thanpost riders for bringing a message over long distances, but far more expensive and less private

than the electrical telegraph lines which would replace them. The distance that an opticaltelegraph can bridge is limited by geography and weather; thus, in practical use, most optical

telegraphs used lines of relay stations to bridge longer distances.

2. Semaphore Flags:The Semaphore Flags is the system for conveying information at a distance by means

of visual signals with hand-held flags, rods, disks, paddles, or occasionally bare or glovedhands. Information is encoded by the position of the flags; it is read when the flag is in a fixed

position. Semaphores were adopted and widely used (with hand-held flags replacing themechanical arms of shutter semaphores) in the maritime world in the early 19th century.

Semaphore signals were used, for example, at the Battle of Trafalgar. This was the period in


19/33


19

which the modern naval semaphore system was invented. This system uses hand-held flags. It isstill used during underway replenishment at sea and is acceptable for emergency

communication in daylight or, using lighted wands instead of flags, at night.

3. Railway Semaphore Signals:One of the earliest forms of fixed railway signal is the semaphore. These signals display theirdifferent indications to train drivers by changing the angle of inclination of a pivoted 'arm'.

Semaphore signals were patented in the early 1840s by Joseph James Stevens, and soon becamethe most widely-used form of mechanical signal. Designs have altered over the intervening

years, and colour light signals have replaced semaphore signals in some countries, but in othersthey remain in use.

5.6 MUTEXES Vs SEMAPHORES:y Mutexes:

"Mutexes are typically used to serialise access to a section of re-entrant code thatcannot be executed concurrently by more than one thread. A mutex object only allows one thread

into a controlled section, forcing other threads which attempt to gain access to that section towait until the first thread has exited from that section."

y Semaphores:"A semaphore restricts the number of simultaneous users of a shared resource up to

a maximum number. Threads can request access to the resource (decrementing the semaphore),and can signal that they have finished using the resource (incrementing the semaphore)."


20/33


20

CHAPTER NO. 6

POSIX

6.1 WHAT IS POSIX:

"Portable Operating System Interface [for Unix] is the name of a family of relatedstandards specified by the IEEE to define the application programming interface (API), along

with shell and utilities interfaces for software compatible with variants of the Unix operatingsystem, although the standard can apply to any operating system.

Originally, the name "POSIX" referred to IEEE Std 1003.1-1988, released, as the name suggests,in 1988. The family of POSIX standards is formally designated as IEEE 1003 and the

international standard name is ISO/IEC 9945.

The standards, formerly known as IEEE-IX, emerged from a project that began circa 1985.Richard Stallman suggested the name POSIX in response to an IEEE request for a memorable

name.

The POSIX specifications for Unix-like operating system environments originallyconsisted of a single document for the core programming interface, but eventually grew to 17

separate documents.The standardized user command line and scripting interface were based onthe Korn shell. Many user-level programs, services, and utilities including awk, echo, ed were

also standardized, along with required program-level services including basic I/O (file, terminal,and network) services. POSIX also defines a standard threading library API which is supported

by most modern operating systems.

6.2 SCRIPTING INTERFACE:

A command-line interface (CLI) is a mechanism for interacting with a computeroperating system or software by typing commands to perform specific tasks. This text-only

interface contrasts with the use of a mouse pointer with a graphical user interface (GUI) to clickon options, or menus on a text user interface (TUI) to select options. This method of instructing a

computer to perform a given task is referred to as "entering" a command: the system waits for theuser to conclude the submitting of the text command by pressing the "Enter" key (a descendant

of the "carriage return" key of a typewriter keyboard). A command-line interpreter then receives,parses, and executes the requested user command. The command-line interpreter may be run in a

text terminal or in a terminal emulator window as a remote shell client such as PuTTY. Uponcompletion, the command usually returns output to the user in the form of text lines on the CLI.


21/33


21

6.3 KORN SHELL:The Korn shell (ksh) is a Unix shell which was developed by David Korn (AT&T

Bell Laboratories) in the early 1980s and announced at Toronto USENIX on July 14 1983.Otherearly contributors were AT&T Bell Labs developers Mike Veach, who wrote the emacs code,

and Pat Sullivan, who wrote the vi code. ksh is backwards-compatible with the Bourne shell andincludes many features of the C shell as well, such as a command history, which was inspired by

the requests of Bell Labs users.

6.4 BOURNE SHELL:The Bourne shell, or sh, was the default Unix shell of Unix Version 7, and

replaced the Thompson shell, whose executable file had the same name, sh. It was developed by

Stephen Bourne, of AT&T Bell Laboratories, and was released in 1977 in the Version 7 Unix

release distributed to colleges and universities. It remains a popular default shell for Unixaccounts. The binary program of the Bourne shell or a compatible program is located at /bin/shon most Unix systems, and is still the default shell for the root superuser on many current Unix

implementations.

Its command interpreter contained all the features that are commonly considered

to produce structured programs. Although it is used as an interactive command interpreter, it wasalways intended as a scripting language. It gained popularity with the publication of The UNIX

Programming Environment by Brian W. Kernighan and Rob Pike. This was the firstcommercially published book that presented the shell as a programming language in a tutorial

form.


22/33


22

CHAPTER NO. 7

THE 2.5.7 IMPLEMENTATION

7.1 THE IMPLEMENTATION OF FUTEX:

The initial implementation which was judged a sufficient basis for kernel inclusionused a single two-argument system call, sys_futex(struct futex *, int op). The first argument was

the address of the futex, and the second was the operation, used to further demultiplex the systemcall and insulate the implementation somewhat from the problems of system call number

allocation. The latter is especially important as the system call is expand as new operations arerequired. The sys_futex system call provides a method for a program to wait for a value at a

given address to change, and a method to wake up anyone waiting on a particular address whilethe addresses for the same memory in separate processes may not be equal, the kernel maps them

internally so the same memory mapped in different locations will correspond for sys_futex calls.

The two valid op numbers for this implementation were FUTEX_UP and

FUTEX_DOWN. The algorithm was simple, the file linux/kernel/futex.c containing 140 codelines, and 233 in total.

The algorithm implemented is as given below:

1. The user address was checked for alignment and that it did not overlap a page boundary.

2. The page is pinned: this involves looking up the address in the processs address space to find

the appropriate struct page *, and incrementing its reference count so it cannot be swapped out.

3. The struct page * and offset within the page are added, and that result hashed using the

recently introduced fast multiplicative hashing routines to give a hash bucket in the futex hashtable.

4. The op argument is then examined. If it is FUTEX_DOWN then:

(a) The process is marked INTERRUPTIBLE, meaning it is ready to sleep.

(b) A struct futex_q is chained to the tail of the hash bucket determined in step 3: the tail ischosen to give FIFO ordering for wakeups. This structures contains a pointer to the process andthe struct page * and offset which identify the futex uniquely.

(c) The page is mapped into low memory (if it is a high memory page), and an atomic decrement

of the futex address is attempted,4 then unmapped again. If this does not decrement the counterto zero, we check for signals (setting the error to EINTR and going to the next step), schedule,

and then repeat this step.


23/33


23

(d) Otherwise, we now have the futex, or have received a signal, so we mark this process

RUNNING, unlink ourselves from the hash table, and wake the next waiter if there is one, andreturn 0 or -EINTR. We have to wake another process so that it decrements the futex to -1 to

indicate that it is waiting (in the case where we have the futex), or to avoid the race where a

signal came in just as we were woken up to get the futex (in the case where a signal wasreceived).

4. If the op argument was FUTEX_UP:(a)Map the page into low memory if it is in a high memory page(b)Set the count of the futex to one (available).

Unmap the page if it was mapped from high memory

(d) Search the hash table for the first struct futex_q associated with this futex, and wake up thatprocess.

6. Otherwise, if the op argument is anything else, set the error to EINVAL.

7. Unpin the page.While there are several subtleties in this implementation, it gives a second major advantage over

System V semaphores: there are no explicit limits on how many futexes you can create, nor can

one futex user starve other users of futexes.

This is because the futex is merely a memory location like any other until the sys_futex syscall

is entered, and each process can only do one sys_futex syscall at a time, so we are limited topinning one page per process into memory, at worst.

7.2 PROBLEMS AND MODIFICATIONS IN 2.5.7 IMPLEMENTATION:

y PROBLEMS:1. Once the first implementation entered the mainstream experimental kernel, it drew the

attention of a much wider audience. In particular those concerned with implementing POSIXthreads, and attention also returned to the fairness issue.

2. There is no straightforward way to implement the pthread_cond_timedwait primitive: thisoperation requires a timeout, but using a timer is difficult as these must not interfere with

their use by any other code.


24/33


24

3. The pthread_cond_broadcast primitive requires every process sleeping to be woken up,which does not fit well with the 2.5.7 implementation, where a process only exits the kernel

when the futex has been successfully obtained or a signal is received.4. For N:M threading, such as the Next Generation Posix Threads project [5] an asynchronous

interface for finding out about the futex is required, since a single process (containing

multiple threads) might be interested in more than one futex.5. Starvation occurs in the following situation: a single process which immediately drops andthen immediately competes for the lock will regain it before any woken process will.

y MODIFICATIONS:With these limitations brought to light, we searched for another design which would be

flexible enough to cater for these diverse needs. After various implemenation attempts anddiscussions we settled on a variation of atomic_compare_and_swap primitive, with the atomicity

guaranteed by passing the expected value into the kernel for checking. With these modifications,futex has been implemented in the Linux kernel version 2.6.

To do this, two new op values namely, FUTEX_WAIT and FUTEX_WAKE replaced theoperations above, and the system call was changed to two additional arguments, int val and struct

timespec *reltime.

FUTEX_WAIT: This operation atomically verifies that the futex address still contains the valuegiven, and sleeps awaiting

FUTEX_WAKE on this futex address. If the timeout argument is non-NULL, its contents

describe the maximum duration of the wait, which is infinite otherwise. It is similar to the previous FUTEX_ DOWN, except that the looping and manipulation of the counter is left to

userspace.

This works as follows:

1. Set the process state to INTERRUPTIBLE, and place struct futex_q into the hash table as

before.

2. Map the page into low memory (if in high memory).

3. Read the futex value.

4. Unmap the page (if mapped at step 2).

5. If the value read at step 3 is not equal to the val argument provided to the system call, set the

return to EWOULDBLOCK.


25/33


25

6. Otherwise, sleep for the time indicated by the reltime argument, or indefinitely if that is

NULL.(a) If we timed out, set the return value to ETIMEDOUT.

(b) Otherwise, if there is a signal pending, set the return value to EINTR.

7. Try to remove our struct futex_q from the hash table: if we were already removed, return 0(success) unconditionally, as this means we were woken up, otherwise return the error code

specified above.

FUTEX_WAKE: This is similar to the previous FUTEX_UP, except that it does not alter thefutex value, it simple wakes one (or more) processes. The number of processes to wake is

controlled by the int val parameter, and the return value for the system call is the number of processes actually woken and removed from the hash table. It returns the number of

processes woken up.

FUTEX_AWAIT: This is proposed as an asynchronous operation to notify the process via aSIGIO-style mechanism when the value changes. The exact method has not yet been settled(see future work in Section 5). This new primitive is only slightly slower than the previous

one,6 in that the time between waking the process and that process attempting to claim thelock has increased (as the lock claim is done in userspace on return from the FUTEX_WAKE

syscall), and if the process has to attempt the lock multiple times before success, eachattempt will be accompanied by a syscall, rather than the syscall claiming the lock itself. On

the other hand, the following can be implemented entirely in the userspace library:

1. All the POSIX style locks, including pthread_cond_broadcast (which requires the wakeall operation) and pthread_cond_timedwait (which requires the timeout argument). One

of the authors (Rusty) has implemented a nonpthreads demonstration library which doesexactly this.

2. Read-write locks in a single word, on architectures which support cmpxchg-style

primitives.

3. FIFO wakeup, where fairness is guaranteed to anyone waiting. Finally, it is worthwhile

pointing out that the kernel implementation requires exactly the same number of lines as theprevious implementation.

FUTEX_FD: It is used to support asynchronous wakeups. This operation associates a file

descriptor with a futex. If another process executes a FUTEX_WAKE, the process willreceive the signal number that was passed in val. The calling process must close the returned

file descriptor after use. To prevent race conditions, the caller should test if the futex hasbeen upped after FUTEX_FD returns. It returns the new file descriptor associated with the

futex.


26/33


26

CHAPTER NO. 8

ADVANTAGES, DIS-ADVANTAGES AND REQUIREMENTS

y ADVANTAGES:1. Futexes makes the operating system fast and maintains quick access.2. User Locking is the special feature of the futexes which helps in avoidance of the

interruption of the other users working on the same system.

3. Large Kernel space which helps multiple users to operate on the System.

y DIS-ADVANTAGES:1. Block waiting to be notified for the lock to be released.2. Yield the processor until the thread is rescheduled and then the lock is tried to be

acquired again

3. Spin consuming CPU cycles until the lock is released.

y REQUIREMENTS:There are various behavioral requirements that need to be considered. Most center

around the fairness of the locking scheme and the lock release policy. In a fair locking scheme

the lock is granted in the order it was requested, i.e., it is handed over to the longest waiting task.

This can have negative impact on throughput due to the increased number of context switches.At the same time it can lead to the so called convoy problem. Since, the locks are granted in theorder of request arrival, they all proceed at the speed of the slowest process, slowing down all

waiting processes. A common solution to the convoy problem has been to mark the lockavailable upon release, wake all waiting processes and have them recontend for the lock.

This is referred to as random fairness, although higher priority tasks will usually

have an advantage over lower priority ones. However, this also leads to the well known


27/33


27

thundering herd problem. Despite this, it can work quite well on uni-processor systems if the firsttask to wake releases the lock before being preempted or scheduled, allowing the second herd

member to obtain the lock, etc. It works less spectacularly on Symmetric Multi ProcessingSystems (SMP). To avoid this problem, one should only wake up one waiting task upon lock

release. Marking the lock available as part of releasing it, gives the releasing task the opportunity

to reacquire the lock immediately again, if so desired, and avoid unnecessary context switchesand the convoy problem. Some refer to these as greedy, as the running task has the highestprobability of reacquiring the lock if the lock is hot.

However, this can lead to starvation. Hence, the basic mechanisms must enable both

fair locking, random locking and greedy or convoy avoidance locking (short ca-locking).Random fairness:

Wake up 1 process

Another requirement is to enable spin locking, i.e., have an application spin for the availablilty of

the lock for some user specified time (or until granted) before giving up and resolving to block inthe kernel for its availability. Spin-locks are useful for short-critical sections. If you cannot avoidhaving a long critical section, you should not use spin-lock primitives in the first place, but use

blocking synchronization primitives. Hence an application has the choice to either

A. Block waiting to be notified for the lock to be released

B. Yield the processor until the thread is rescheduled and then the lock is tried to be acquired

again

C. Spin consuming CPU cycles until the lock is released.

Thus with respect to performance, there are basically two overriding goals:1. Avoid system calls if possible, as system calls typically consume several hundred

instructions.

2. Avoid unnecessary context switches: context switches lead to overhead associated with TLBinvalidations.

3. Each time a process is removed from access to the processor, sufficient information on its

current operating state must be stored such that when it is again scheduled to run on theprocessor it can resume its operation from an identical position.

This operational state data is known as its context and the act of removing theprocess's thread of execution from the processor (and replacing it with another) is known as a

process switch or context switch.


28/33


28

A typical process switch or context switch involves the following:

Another requirement is that fast user level locking should be simple enough to provide the basic

foundation to efficiently enable more complicated synchronization constructs, e.g. semaphores,

rwlocks, blocking locks, or spin versions of these, pthread mutexes, DB latches..It should also allow for a clean separation of the blocking requirements towards the kernel, sothat the blocking only has to be implemented with a small set of different constructs.

This allows for extending the use of the basic primitives without kernel

modifications. Of interest is the implementation of mutex, semaphores and multiple reader/singlewriter locks.

Finally, a solution needs to be found that enables the recovery of dead locks. Deadlock is a

permanent blocking of a set of processes that either compute for system resources orcommunicate with each other. Deadlock may be addressed by mutual exclusion or by deadlock

avoidance.

Mutual exclusion prevents two threads accessing the same resource simultaneously.

Deadlock avoidance can include initiation denial or allocation denial, both of which serve toeliminate the state required for deadlock before it arises. We define unrecoverable locks as those

that have been acquired by a process and the process terminates without releasing the lock. Thereare no convenient means for the kernel or for the other processes to determine which locks are

currently held by a particular process, as lock acquisition can be achieved through user memorymanipulation.

Each process has some form of associated Process Identifier, (PID) through which it

may be manipulated. The process also carries the User Identifier (UID) of the person whoinitiated the process and will also have group identifier (GID). Registering the processs pid

after lock acquisition is not enough as both operations are not atomic. If the process dies before itcan register its pid or if it cleared its pid and before being able the release the lock, the lock is

unrecoverable.


29/33


29

CHAPTER NO. 9

FUTURE WORK

A lot of time is being wasted between a process releasing the lock and another

process acquiring the lock. Inorder to avoid that, an asynchronous wait extension will be addedto consume less time.

A Currently, in the uncontended case, only the system calls are avoided. It would bemuch more good if the kernel interactions are also removed. For this purpose, they are working

with the NGPT team to build Global POSIX mutexes over futex. NGPT supports a M : Nthreading model, i.e., M user level threads are executed over N tasks. Conceptually, the N tasks

provide virtual processors on which the M user threads are executing. When a user level thread,executing on one of these N tasks, needs to block on a futex, it should not block the task, as this

task provides the virtual processing. Instead only the user thread should be descheduled by thethread manager of the NGPT system.

Nevertheless, a waitobj must be attached to the waitqueue in the kernel, indicating thata user thread is waiting on a particular futex and that the task group needs a notification with

respect to the continuation on that futex. Once the thread manager receives the notification it canreschedule the previously blocked user thread. For this we provide an additional operator

AFUTEX_WAIT to the sys_futex system call. Its task is to append a waitobj to the futexskernel waitqueue and continue. This waitobj cannot be allocated on the stack and must be

allocated and de-allocated dynamically.

Dynamic allocations have the disadvantage that the waitobjs must be freed evenduring an irregular program exit. It further poses a denial of service attack threat in that a

malicious applications can continuously call sys_futex (AFUTEX_WAIT).

The general solutions seem to convert to the usage of a /dev/futex device to control resourceconsumption. The first solution is to allocate a file descriptor fd from the /dev/futex device for

each outstanding asynchronous waitobj. Conveniently these descriptors should be pooled toavoid the constant opening and closing of the device. The private data of the file would simply

be the waitobj. Upon completion a SIGIO is sent to the application.

The advantage of this approach is that the denial of service attack is naturally limited

to the file limits imposed on a process. Furthermore, on program death, all waitobjs stillenqueued can be easily dequeued. The disadvantage is that this approach can significantly

pollute the fd space. Another solution pro-posed has been to open only one fd, but allowmultiple waitobj allocations for this fd.

This approach removes the fd space pollution issue but requires an additional tuning

parameter for how many outstanding waitobjs should be allowed per fd. It also requires properresource management of the waitobjs in the kernel. At this point no definite decisions has been

reached on which direction to proceed. The question of priorities in futexes has been raised: thecurrent implementation is strictly FIFO order. The use of nice level is almost certainly too


30/33


30

restrictive, so some other priority method would be required. Expanding the sys-tem call to add apriority argument is possible if there were demonstrated application advantage.


31/33


31

CONCLUSION

In this paper, we described a fast userlevel locking mechanism, called futexes, that were

integrated into the Linux 2.5 development kernel and also in the kernel version 2.6 of Linux.

Also the various requirements for such a package,previous various solutions and the currentfutex package were outlined. In the performance futexes can provide significant performance

advantages over standard System V IPC semaphores in all case studies.


32/33


32

REFERENCES:

y www.google.comy www.linuxforu.comy Lse.sourceforge.comy Oss.software.ibm.com


33/33


my final seminar reprt

Documents