b.tech cs s8 high performance computing module notes module 4

33
MODULE 4 Microprocessor Architecture and Programming Micro Processor Architecture and Programming A number of processors (two or more) are connected in a manner that allows them to share the simultaneous execution of a single task. In addition, a multiprocessor consisting of a number of single uni-processors is expected to be more cost effective than building a high-performance single processor. A multiprocessor is expected to reach faster speed than the fastest uni-processor. Multiprocessor characteristics are Interconnection Structures, Interprocessor Arbitration, Interprocessor Communication and Synchronization, Cache Coherence. Multiprocessing sometimes refers to the execution of multiple concurrent software processes in a system as opposed to a single process at any one instant. The 2 architectural models of Multiprocessor are: Tightly Coupled Multiprocessor are defined as Tasks and/or processors communicate in a highly synchronized fashion, Communicates through a common shared memory, Shared memory system Loosely Coupled System is defined as Tasks or processors do not communicate in a synchronized fashion, Communicates by message passing packets, Overhead for data exchange is high, Distributed memory system. Functional Structures Multiprocessors are characterized by 2 attributes: First a multiprocessor is a single computer that includes multiple processors Second processors may communicate and cooperate at different levels in solving a given problem. The communication may occur by sending messages from one processor to the other by sharing a common memory. 2 Different Architectural Models of Multiprocessor are

Upload: jisha-t-shaji

Post on 12-Nov-2014

79 views

Category:

Documents


2 download

DESCRIPTION

B.tech CS S8 High Performance Computing Module Notes Module 4

TRANSCRIPT

Page 1: B.tech CS S8 High Performance Computing Module Notes Module 4

MODULE 4

Microprocessor Architecture and Programming

Micro Processor Architecture and Programming

A number of processors (two or more) are connected in a manner that allows them to share the

simultaneous execution of a single task. In addition, a multiprocessor consisting of a number of

single uni-processors is expected to be more cost effective than building a high-performance

single processor.

A multiprocessor is expected to reach faster speed than the fastest uni-processor. Multiprocessor

characteristics are Interconnection Structures, Interprocessor Arbitration, Interprocessor

Communication and Synchronization, Cache Coherence. Multiprocessing sometimes refers to

the execution of multiple concurrent software processes in a system as opposed to a single

process at any one instant. The 2 architectural models of Multiprocessor are:

� Tightly Coupled Multiprocessor are defined as Tasks and/or processors

communicate in a highly synchronized fashion, Communicates through a common

shared memory, Shared memory system

� Loosely Coupled System is defined as Tasks or processors do not communicate in a

synchronized fashion, Communicates by message passing packets, Overhead for data

exchange is high, Distributed memory system.

Functional Structures

Multiprocessors are characterized by 2 attributes:

� First a multiprocessor is a single computer that includes multiple processors

� Second processors may communicate and cooperate at different levels in solving a given

problem.

The communication may occur by sending messages from one processor to the other by sharing

a common memory.

2 Different Architectural Models of Multiprocessor are

Page 2: B.tech CS S8 High Performance Computing Module Notes Module 4

� Tightly Coupled System

- Tasks and/or processors communicate in a highly synchronized fashion

- Communicates through a common shared memory

- Shared memory system

� Loosely Coupled System

- Tasks or processors do not communicate in a synchronized fashion

- Communicates by message passing packets

- Overhead for data exchange is high

- Distributed memory system

Loosely Coupled Multiprocessors

� In loosely coupled Multiprocessors each processor has a set of input-output devices and a

large local memory from where instructions and data are accessed.

� Computer Module is a combination of

• Processor

• Local Memory

• I/O Interface

� Processes which executes on different computer modules communicate by exchanging

messages through a Message Transfer System (MTS).

� The Degree of coupling is very loose. Hence it is often referred as distributed system.

� Loosely coupled system is usually efficient when the interactions between tasks are

minimal

Page 3: B.tech CS S8 High Performance Computing Module Notes Module 4

It consists of a

� processor

� local memory

� Local Input-output devices

� Interface to other computer Modules

� The interface may contain a channel and arbiter switch (CAS)

� If requests from two or more different modules collide in accessing a physical segment of

the MTs, the arbiter is responsible for choosing one of the simultaneous requests

according to a given service discipline.

Page 4: B.tech CS S8 High Performance Computing Module Notes Module 4

� It is also responsible for delaying requests until the servicing of the selected request is

completed.

� The channel within the CAS may have a high speed communication memory which is

used for buffering block transfers of messages.

� The message transfer system is a time shared bus or a shared memory system.

Tightly Coupled Multiprocessors (TCS)

The throughput of the hierarchical loosely coupled multiprocessor may be too slow for some

applications that require fast response times. If high speed or real time processing is required the

TCS may be used.

Two typical models are discussed:

In the first model, it consists of

• p processors

• l memory modules

• d input-output channels

The above units are connected through a set of three interconnection networks namely the

• processor-memory interconnection network (PMIN)

• I/O processor interconnection network (IOPIN)

• Interrupt Signal Interconnection Network (ISIN)

The PMIN is a switch which can connect every processor to every module. It has pl set of cross

points. It is a multistage network. A memory can satisfy only one processor’s request in a given

memory cycle. Hence if more than one processors attempt to access the same memory module, a

conflict occurs and it is resolved by the PMIN.

Another method to resolve the conflict is to associate a reserved storage area with each processor

and it is called as ULM unmapped local memory. It helps in reducing the traffic in PMIN.Since

each memory references goes through the PMIN, it encounters delay in the processors memory

switch so this can be overcome by using a private cache with each processor.

Page 5: B.tech CS S8 High Performance Computing Module Notes Module 4

A multiprocessor organization which uses a private cache with each processor is shown. This

multiprocessor organization encounters the cache coherence problem. More than one consistent

copy of data may exist in the system.

In the figure there is a module attached to each processor that directs the memory reference to

either the ULM or the private cache of that processor. This module is called the memory map

and is similar in operation to the Slocal.

(a) Without private cache

Page 6: B.tech CS S8 High Performance Computing Module Notes Module 4

(b) With private cache

Example of Tightly Coupled Multiprocessor: The Cyber – 170 Architecture

This configuration has 2 subsystems.

• The central processing sub system

• The peripheral processing sub system

These subsystems have to access to a common central memory (CM) through a central memory

controller, which is essentially a high speed cross bar switch. In addition to the central memory

there is an optional secondary memory called the extended core memory (ECM) which is a low

speed random access read-write memory. The ECM and CM form a 2 level memory hierarchy.

Here the CMC becomes the switching center, which performs the combined functions of

ISIN,IOPIN, and PMIN.

Page 7: B.tech CS S8 High Performance Computing Module Notes Module 4

CM: Central Memory

CMC: Central Memory Controller

CPi: ith central processor

CPS: Central Processing System

PPS: peripheral processing subsystem

Processor Characteristics for Multiprocessing

A number of desirable architectural features are described below for a processor to be effective

in multiprocessing environment.

Page 8: B.tech CS S8 High Performance Computing Module Notes Module 4

Processor recoverability

The process and processor are 2 different entities. If the processor fails there should be another

processor to take up the routine. Reliability of the processor should be present.

Efficient Context Switching

A general purpose register is a large register file that can be used for multi-programmed

processor. For effective utilization it is necessary for the processor to support more than one

addressing domain and hence to provide a domain change or context switching operation.

Large Virtual and Physical address space

A processor intended to be used in the construction of a general purpose medium to large scale

multiprocessor must support a large physical address space. In addition a large virtual space is

also needed.

Effective Synchronization Primitives

The processor design must provide some implementation of invisible actions which serve as the

basis for synchronization primitives.

Interprocessor Communication mechanism

The set of processor used in multiprocessor must have an efficient means of interprocessor

mechanism.

Instruction Set

The instruction set of the processor should have adequate facilities for implementing high level

languages that permit effective concurrency at the procedural level and for efficiently

manipulating data structures.

Interconnection Networks ( MULTIPROCESSOR INTERCONNECTION N/W)

Page 9: B.tech CS S8 High Performance Computing Module Notes Module 4

Interconnection networks helps in sharing resources, one is between the processor and the

memory modules and the other between the processor and the I/O Systems. There are different

physical forms available for interconnection networks.

They are:

• Time shared or common bus,

• Crossbar switch

• Multistage networks for multiprocessors.

The simplest interconnection system for multiprocessors is a communication path connecting all

of the functional units. The common path is called as time shared or common bus.

The organization is least complex and easier to configure. Such an interconnection is often a

totally passive unit having no active components such as switches. Transfer operations are

completely controlled by the bus interfaces of the sending and receiving units. Since the bus is

shared, a mechanism must provide to resolve contention. The conflict resolution methods include

static or fixed priorities, FIFO queues and daisy chaining.

A unit that wishes to initiate transfer must first determine the availability of status of the bus, and

then address the destination unit to determine its availability and capability to receive the

transfer. A command is also issued to inform the destination unit what operation it is to perform

with the data being transferred, after which the data transfer is finally initiated. A receiving unit

recognizes its address placed on the bus and responds of the control signals from the sender.

Example of Time shared bus is PDP – 11.

Page 10: B.tech CS S8 High Performance Computing Module Notes Module 4

Multiprocessor with unidirectional buses uses both the buses and not much is actually gained

This method increases the complexity. The interconnection subsystem becomes an active device.

Characteristics that affects the performance of the bus

� Number of active devices on the bus

� Bus arbitration Algorithm

� Centralization

� Data Bandwidth

� Synchronization of data transmission

� Error detection

Various Bus arbitration Algorithms are:

The static Priority Algorithm

When multiple devices concurrently request the use of bus, the device with highest priority is

granted the bus. This method is called as daisy chaining and all the services are assigned static

priorities according to their locations along a bus grant control line. The device closest to the bus

control unit will get highest priority.

Page 11: B.tech CS S8 High Performance Computing Module Notes Module 4

Requests are made through a common line, BRQ Bus request. If the central bus control accepts

the request it passes a BGT Bus grant signal to the concerned device if the SACK Bus busy line

is free or idle.

The Fixed Time Slice Algorithm

The available bus band widths divided into fixed time slices and then offered to various devices

in around robin fashion. If the allotted device does not use the time slice then the time slice is

wasted. This is called as fixed time slicing or time division multiplexing.

Dynamic Priority Algorithm

The priorities allocated to the devices are done dynamically and thus every device gets a chance

to use the bus and does not suffer longer turnaround time.

The two algorithms are

� Least Recently used (LRU)

� Rotating daisy chain (RDC)

The LRU gives the highest priority to the requesting device that has not used the bus for the

longest interval.

Page 12: B.tech CS S8 High Performance Computing Module Notes Module 4

In a RDC scheme, no central controller exists, and the bus grant line is connected from the last

device back to the first in a closed loop. Whichever device is granted access to the bus serves as

a bus controller for the following arbitration (an arbitrary device is selected to have initial access

to the bus). Each device’s priority for a given arbitration is determined by that device’s distance

along the bus grant line from the device currently serving as a bus controller; the latter device

has the lowest priority. Hence the priorities are dynamically with each bus cycle.

The first come First Served Algorithm

In the FCFS, requests are simply honored in the order received. It favors no particular processor

or device on the bus. The performance measures will be degraded in case of FCFS since the

device waiting for the bus for a longer period of time just because the request is made late.

Crossbar Switch and Multiport Memories

If the number of buses in a time shared bus is increased, a point is reached at which there is a

separate path available for each memory unit. The interconnection networking is called as a

nonblocking crossbar.

Page 13: B.tech CS S8 High Performance Computing Module Notes Module 4

Multiport Memory Module

- Each port serves a CPU

Memory Module Control Logic

- Each memory module has control logic

- Resolve memory module conflicts fixed priority among CPUs

Advantages

- Multiple paths -> high transfer rate

Disadvantages

- Memory control logic

- Large number of cables and connections

Page 14: B.tech CS S8 High Performance Computing Module Notes Module 4

Multistage Networks for Multiprocessors

In order to design multistage networks, the basic principles involved in the construction and

control of simple crossbar has to be understood.

This 2 x 2 switch has the capability of connecting the input A to either the output labeled 0 or to

the output labeled 1, depending on the value of some control bit CA of the input A. If CA = 0 the

input is connected to the upper output, and if CA = 1, the connection is made to the lower output.

Terminal B of the switches behaves similarly with a control bit CB. The 2 x 2 module also has the

capability to arbitrate between conflicting requests. If both inputs A and B require the same

output terminal, then only one of them will be connected and the other will be blocked or

rejected.

Page 15: B.tech CS S8 High Performance Computing Module Notes Module 4

Comparison of three multiprocessor hardware organizations.

Multiprocessor with Time Shared Bus

1. Lowest overall system cost for hardware and least complex

2. Very easy to physically modify the hardware system configuration by adding or

removing functional units

3. Overall system capacity limited by bus transfer rate. Failure of the bus is a catastrophic

failure.

4. Expanding the system by addition of functional units may degrade overall system

performance

5. The system efficiency attainable is the lowest of all three basic interconnection systems.

6. This organization is usually appropriate for smaller systems only.

Multiprocessor with Crossbar Switch

1. This is the most complex interconnection system. There is a potential for the highest total

transfer rate.

2. The functional units are the simplest and cheapest since the control and switching logic is

in the switch

3. Because a basic switching matrix is required to assemble any functional units into a

working configuration, this organization is usually cost effective for multiprocessors

only.

4. System expansions usually improve overall performance.

5. Theoretically, expansion of the system is limited only by the size of the switch matrix,

which can often be modularly expanded within initial design or other engineering

limitations.

6. The reliability of the switch, and therefore the system can be improved by segmentation

and / or redundancy within the switch.

Multiprocessors with Multiport Memory

Page 16: B.tech CS S8 High Performance Computing Module Notes Module 4

1. Requires the most expensive memory units since most of the control and switching

circuitry is included in the memory unit

2. The characteristics of the functional units permit a relatively low cost uniprocessor to be

assembled from them.

3. There is potential for a very high total transfer rate in the overall system.

4. The size and configuration options possible are determined by the number and type of

memory ports available; this design decision is made quite early in the overall design and

is difficult to modify.

5. A large number of cables and connectors are required.

EXPLOITING CONCURRENCY FOR MULTIPROCESSING

A parallel program for a multiprocessor consists of two or more interacting processes. A process

is a sequential program that executes concurrently with other processes. In order to understand a

parallel program, it is first necessary to identify the processes and the objects that they share. In

this section we study two approaches to designing parallel programs. One approach to be

introduced below is to have explicit concurrency, by^ which the programmer specifies the

concurrency using certain language constructs. The other approach is to have implicit

concurrency In this case. computer determines what can be executed in parallel

7.5.1 LANGUAGE FEATURES TO EXPLOIT PARALLELISM

In order to solve problems in an MIMD multiprocessor system, we need an efficient computation

for expressing concurrent operations. Processes are concurrent if their executions to overlap in

time More precisely, two processes are concurrent if the last operation of one process starts

before the last operation of the other process . In case no prior knowledge is available about the

speed at which concurrent Processes are executed. In this section, we will discuss the

concurrency indicated by the programmer. One way to denote concurrency is to use FORK and

JOIN statements. FORK spawns a new Process and JOIN waits for a previously created process

to A. FORK operation may be specified in three ways: FORK A,FORK A,J and FORK A J, N.

The execution of the FORK A statement initiates other process at address A and continues the

current process.

Page 17: B.tech CS S8 High Performance Computing Module Notes Module 4

execution of the FORK A. J statement causes the same action as I FORK A and also increments a counter at address J. FORK A. J, N causes the same action as FORK A and sets the counter at address J to N. In all usages of the FORK statements, the corresponding JOIN statement is expressed as JOIN J. The execution of this statement decrements the counter at J b> one If the result is (I. the processed at address J + 1 is initiated, otherwise the processor executing the JOIN statement is released. Hence, all processes execute the JOIN terminals, except the very last one. Application of these instructions for the control of three concurrent process is shown in Figure 7.50. These instructions do not allow a path to terminal without encountering a junction point. The problem with FORK and JOIN that, unless it is judiciously used, it blurs the distinction between statements are executed sequentially and those that may be executed concurrently. F( and JOIN statements are to parallel programming what the GO TO state is to sequential programming. Also, because FORK and JOIN can appe conditional statements and loops, a detailed understanding of program exec is necessary to enable the parallel activities. Nevertheless, when used in a discip manner, the statements are practical to enable parallelism explicitly. For exai FORK provides a direct mechanism for dynamic process creation, incli multiple activation of the same program text.

An equivalent extension of the FORK-JOIN concept is the block-struct language originally proposed by Dijkstra. In this case, each process in a set processes S„ S2, • ■ ■ Smcan be executed concurrently by using the folio cobegin-cuend (orparbegin-parend)constructs:

The cobegin declares explicitly the parts of a program that may execute currently. This makes it possible to distinguish between shared and local varia which in turn makes clear from the program text the potential source of interfere Figure 7.51 illustrates the precedence graph of the concurrent program g above. In this case, the block of statements between the cobegin-coend are exec concurrent!) only after the execution of statement S 0 . Statement S„. | is executed onlv after all executions of the statements S,.S 2 S. have been terminated.Since a concurrent statement has a single entry and a single exit

Page 18: B.tech CS S8 High Performance Computing Module Notes Module 4

PROCESS SYNCHRONIZATION MECHANISM

Problems with concurrent execution

• Concurrent processes (or threads) often need to share data (maintained either in shared

memory or files) and resources

• If there is no controlled access to shared data, some processes will obtain an inconsistent

view of this data

• The action performed by concurrent processes will then depend on the order in which

their execution is interleaved

An example

• Process P1 and P2 are running this same procedure and have access to the same variable

"a"

• Processes can be interrupted anywhere

• If P1 is first interrupted after user input and P2 executes entirely

• Then the character echoed by P1 will be the one read by P2 !!

static char a;

void echo()

{

cin>> a;

cout<< a;

Page 19: B.tech CS S8 High Performance Computing Module Notes Module 4

} Race Conditions

• Situations like this where processes access the same data concurrently and the outcome of

execution depends on the particular order in which the access takes place is called a race

condition • How must the processes coordinate (or synchronize) in order to guard against race

conditions?

The critical section problem

• When a process executes code that manipulates shared data (or resource), we say that the

process is in its criticalsection (CS) (for that shared data)

• The execution of critical sections must be mutually exclusive: at any time, only one

process is allowed to execute in its critical section (even with multiple CPUs)

• Then each process must request permission to enter its critical section (CS)

The critical section problem

• The section of code implementing this request is called the entry section

• The critical section (CS) might be followed by an exit section

• The remaining code is the remainder section

• The critical section problem is to design a protocol that the processes can use so that their

action will not depend on the order in which their execution is interleaved (possibly on

many processors)

Framework for analysis of solutions

• Each process executes at nonzero speed but no assumption on the relative speed of n

processes

• General structure of a process:

while(true) {

entry section

critical section

exit section

remainder section

}// loop forever

• Many CPUs may be present but memory hardware prevents simultaneous access to the

same memory location

• No assumption about order of interleaved execution

• For solutions: we need to specify entry and exit sections

Requirements for a valid solution to the critical section problem

Page 20: B.tech CS S8 High Performance Computing Module Notes Module 4

• Mutual Exclusion

o At any time, at most one process can be in its critical section (CS)

• Progress

o Only processes that are not executing in their RS can participate in the decision of

which process will enter next in the CS

o This selection cannot be postponed indefinitely

• Bounded Waiting

o After a process has made a request to enter its CS, there is a bound on the number

of times that the other processes are allowed to enter their CS

� otherwise the process will suffer from starvation

o Of course, also no deadlock

Types of solutions

• Software solutions

o algorithms whose correctness does not rely on any other assumptions (see

framework)

• Hardware solutions

o rely on some special machine instructions

• Operation System solutions

o provide some functions and data structures to the programmer

Software solutions

• We consider first the case of 2 processes

o Algorithm 1 and 2 are incorrect

o Algorithm 3 is correct (Peterson’s algorithm)

• Then we generalize to n processes

o the bakery algorithm

• Notation

o We start with 2 processes: P0 and P1

o When presenting process Pi, Pj always denotes the other process (i != j)

Algorithm 1

• The shared variable turn is initialized (to 0 or 1) before executing any Pi

• Pi’s critical section is executed iff turn = i

• Pi is busy waiting if Pjis in CS: mutual exclusion is satisfied

• Progress requirement is not satisfied since it requires strict alternation of CSs

• If a process requires its CS more often then the other, it cannot get it.

Algorithm 1 global view

Algorithm 2

Page 21: B.tech CS S8 High Performance Computing Module Notes Module 4

• Keep one bool variable for each process: flag[0] and flag[1]

• Pi signals that it is ready to enter its CS by: flag[i]=true

• Mutual Exclusion is satisfied but not the progress requirement

• If we have the sequence:

o T0: flag[0]=true

o T1: flag[1]=true

• Both process will wait forever to enter their CS: we have a deadlock

Process Pi:

while(true)

{

flag[i] = true;

while(flag[j]){/* do nothing */};

CS

flag[i] = false;

RS

} // loop forever Algorithm 3 (Peterson’s algorithm)

• Initialization: flag[0]=flag[1]=false

turn = 0 or 1

• Willingness to enter CS specified by flag[i] = true

• If both processes attempt to enter their CS simultaneously, only one turn value will last

• Exit section: specifies that Pi is unwilling to enter CS

Process Pi:

while(true){

flag[i] = true;

// I want in

turn = j;

// but I let the other in

while (flag[j]&&turn==j){/* do nothing */};

CS

flag[i] = false;

// I no longer want in

RS

} // loop forever Algorithm 3: proof of correctness

• Mutual exclusion is preserved since:

o P0 and P1 are both in CS only if flag[0] = flag[1] = true and only if turn = i for

each Pi (impossible)

• We now prove that the progress and bounded waiting requirements are satisfied:

o Pi cannot enter CS only if stuck in while() with condition flag[ j] = true and turn =

j.

Page 22: B.tech CS S8 High Performance Computing Module Notes Module 4

o If Pj is not ready to enter CS then flag[ j] = false and Pi can then enter its CS

o If Pj has set flag[ j]=true and is in its while(), then either turn=i or turn=j

o If turn=i, then Pi enters CS. If turn=j then Pj enters CS but will then reset flag[

j]=false on exit: allowing Pi to enter CS

o but if Pj has time to reset flag[ j]=true, it must also set turn=i

o since Pi does not change value of turn while stuck in while(), Pi will enter CS

after at most one CS entry by Pj (bounded waiting)

What about process failures?

• If all 3 criteria (ME, progress, bounded waiting) are satisfied, then a valid solution will

provide robustness against failure of a process in its remainder section (RS)

o since failure in RS is just like having an infinitely long RS

• However, no valid solution can provide robustness against a process failing in its critical

section (CS)

o A process Pi that fails in its CS does not signal that fact to other processes: for

them Pi is still in its CS

n-process solution: bakery algorithm

• Before entering their CS, each Pi receives a number. Holder of smallest number enter CS

(like in bakeries, ice-cream stores...)

• When Pi and Pj receives same number:

o if i<j then Pi is served first, else Pj is served first

• Pi resets its number to 0 in the exit section

• Notation:

o (a,b) < (c,d) if a < c or if a = c and b < d

o max(a0,...ak) is a number b such that

� b >= ai for i=0,..k

• Shared data:

o choosing: array[0..n-1] of boolean;

� initialized to false

o number: array[0..n-1] of integer;

� initialized to 0

• Correctness relies on the following fact:

o If Pi is in CS and Pk has already chosen its number[k]!= 0, then (number[i],i) <

(number[k],k)

o but the proof is somewhat tricky...

Process Pi:

repeat

choosing[i]:=true;

number[i]:=max(number[0]..number[n-1])+1;

Page 23: B.tech CS S8 High Performance Computing Module Notes Module 4

choosing[i]:=false;

for j:=0 to n-1 do {

while (choosing[j]) {};

while (number[j]!=0 and (number[j],j)<(number[i],i)){};

}

CS

number[i]:=0;

RS

forever Drawbacks of software solutions

• Processes that are requesting to enter in their critical section are busy waiting

(consuming processor time needlessly)

• If Critical Sections are long, it would be more efficient to block processes that are

waiting...

Hardware solutions: interrupt disabling

Process Pi:

repeat

disable interrupts

critical section

enable interrupts

remainder section

forever

• On a uniprocessor: mutual exclusion is preserved but efficiency of execution is degraded:

while in CS, we cannot interleave execution with other processes that are in RS

o On a multiprocessor: mutual exclusion is not preserved

o CS is now atomic but not mutually exclusive

o Generally not an acceptable solution

Hardware solutions: special machine instructions

• Normally, access to a memory location excludes other access to that same location

• Extension: designers have proposed machines instructions that perform 2 actions

atomically (indivisible) on the same memory location (ex: reading and writing)

• The execution of such an instruction is also mutually exclusive (even with multiple

CPUs)

• They can be used to provide mutual exclusion but need to be complemented by other

mechanisms to satisfy the other 2 requirements of the CS problem (and avoid starvation

and deadlock)

The test-and-set instruction

• A C++ description of test-and-set: (this instruction is atomic, that is indivisible)

Page 24: B.tech CS S8 High Performance Computing Module Notes Module 4

booltestset(int&i)

{

if (i==0) {

i=1;

return true;

}

else {

return false;

}

}

• An algorithm that uses test-and-set for Mutual Exclusion:

• Shared variable b is initialized to 0

• Only the first Pi that sets b enters CS

Process Pi:

repeat

repeat{}

until testset(b);

CS

b:=0;

RS

forever

• Mutual exclusion is preserved: if Pi enter CS, the other Pj are busy waiting

• Problem: still using busy waiting

• When Pi exit CS, the selection of the Pj who will enter CS is arbitrary: no bounded

waiting. Hence starvation is possible

• Processors (ex: Pentium) often provide an atomic xchg(a,b) instruction that swaps the

content of a and b.

• But xchg(a,b) suffers from the same drawbacks as test-and-set

Using xchg for mutual exclusion

• Shared variable b is initialized to 0

• Each Pi has a local variable k

• The only Pi that can enter CS is the one who finds b=0

• This Pi excludes all the other Pj by setting b to 1

Process Pi:

repeat

k:=1

repeat

xchg(k,b)

until k=0;

CS

Page 25: B.tech CS S8 High Performance Computing Module Notes Module 4

b:=0;

RS

forever Semaphores

• Synchronization tool (provided by the OS) that do not require busy waiting

• A semaphore S is an integer variable that, apart from initialization, can only be accessed

through 2 atomic and mutually exclusive operations:

o wait(S)

o signal(S)

• To avoid busy waiting: when a process has to wait, it will be put in a blocked queue of

processes waiting for the same event

• Hence, in fact, a semaphore is a record (structure):

type semaphore = record

count: integer;

queue: list of process

end;

var S: semaphore;

• When a process must wait for a semaphore S, it is blocked and put on the semaphore’s

queue

• The signal operation removes (acc. to a fair policy like FIFO) one process from the queue

and puts it in the list of ready processes

Semaphore’s operations

wait(S):

S.count--;

if (S.count<0) {

block this process

place this process in S.queue

}

signal(S):

S.count++;

if (S.count<=0) {

remove a process P from S.queue

place this process P on ready list

}

• S.count must be initialized to a nonnegative value (depending on application)

Page 26: B.tech CS S8 High Performance Computing Module Notes Module 4

Semaphores: observations

• When S.count>=0: the number of processes that can execute wait(S) without being

blocked = S.count

• When S.count<0: the number of processes waiting on S is = |S.count|

• Atomicity and mutual exclusion: no 2 process can be in wait(S) and signal(S) (on the

same S) at the same time (even with multiple CPUs)

• Hence the blocks of code defining wait(S) and signal(S) are, in fact, critical sections

• The critical sections defined by wait(S) and signal(S) are very short: typically 10

instructions

• Solutions:

o uniprocessor: disable interrupts during these operations (i.e.: for a very short

period). This does not work on a multiprocessor machine.

o multiprocessor: use previous software or hardware schemes. The amount of busy

waiting should be small.

Using semaphores for solving critical section problems

• For n processes

• Initialize S.count to 1

• Then only 1 process is allowed into CS (mutual exclusion)

• To allow k processes into CS, we initialize S.count to k

Process Pi:

repeat

wait(S);

CS

signal(S);

RS

forever Using semaphores to synchronize processes

• We have 2 processes: P1 and P2

• Statement S1 in P1 needs to be performed before statement S2 in P2

• Then define a semaphore "synch"

• Initialize synch to 0

• Proper synchronization is achieved by having in P1:

S1;

signal(synch);

• And having in P2:

wait(synch);

S2;

The producer/consumer problem

Page 27: B.tech CS S8 High Performance Computing Module Notes Module 4

• A producer process produces information that is consumed by a consumer process

o Ex1: a print program produces characters that are consumed by a printer

o Ex2: an assembler produces object modules that are consumed by a loader

• We need a buffer to hold items that are produced and eventually consumed

• A common paradigm for cooperating processes

P/C: unbounded buffer

• We assume first an unbounded buffer consisting of a linear array of elements

• in points to the next item to be produced

• out points to the next item to be consumed

• We need a semaphore S to perform mutual exclusion on the buffer: only 1 process at a

time can access the buffer

• We need another semaphore N to synchronize producer and consumer on the number N

(= in - out) of items in the buffer

o an item can be consumed only after it has been created

• The producer is free to add an item into the buffer at any time: it performs wait(S) before

appending and signal(S) afterwards to prevent customer access

• It also performs signal(N) after each append to increment N

• The consumer must first do wait(N) to see if there is an item to consume and use

wait(S)/signal(S) to access the buffer

Solution of P/C: unbounded buffer

P/C: unbounded buffer

• Remarks:

o Putting signal(N) inside the CS of the producer (instead of outside) has no effect

since the consumer must always wait for both semaphores before proceeding

o The consumer must perform wait(N) before wait(S), otherwise deadlock occurs if

consumer enter CS while the buffer is empty

• Using semaphores is a difficult art...

P/C: finite circular buffer of size k

• can consume only when number N of (consumable) items is at least 1 (now: N!=in-out)

• can produce only when number E of empty spaces is at least 1

P/C: finite circular buffer of size k

• As before:

o we need a semaphore S to have mutual exclusion on buffer access

o we need a semaphore N to synchronize producer and consumer on the number of

consumable items

Page 28: B.tech CS S8 High Performance Computing Module Notes Module 4

• In addition:

o we need a semaphore E to synchronize producer and consumer on the number of

empty spaces

Solution of P/C: finite circular buffer of size k

The Dining Philosophers Problem

• 5 philosophers who only eat and think

• each need to use 2 forks for eating

• we have only 5 forks

• A classical synchron. problem

• Illustrates the difficulty of allocating resources among process without deadlock and

starvation

• Each philosopher is a process

• One semaphore per fork:

o fork: array[0..4] of semaphores

o Initialization: fork[i].count:=1 for i:=0..4

• A first attempt:

• Deadlock if each philosopher start by picking his left fork!

Process Pi:

repeat

think;

wait(fork[i]);

wait(fork[i+1 mod 5]);

eat;

signal(fork[i+1 mod 5]);

signal(fork[i]);

forever

A solution: admit only 4 philosophers at a time that tries to eat

• Then 1 philosopher can always eat when the other 3 are holding 1 fork

• Hence, we can use another semaphore T that would limit at 4 the numb. of philosophers

"sitting at the table"

• Initialize: T.count:=4

Process Pi:

repeat

think;

wait(T);

wait(fork[i]);

wait(fork[i+1 mod 5]);

eat;

signal(fork[i+1 mod 5]);

Page 29: B.tech CS S8 High Performance Computing Module Notes Module 4

signal(fork[i]);

signal(T);

forever Binary semaphores

• The semaphores we have studied are called counting (or integer) semaphores

• We have also binary semaphores

o similar to counting semaphores except that "count" is Boolean valued

o counting semaphores can be implemented by binary semaphores...

o generally more difficult to use than counting semaphores (e.g.: they cannot be

initialized to an integer k > 1)

Problems with semaphores

• semaphores provide a powerful tool for enforcing mutual exclusion and coordinate

processes

• But wait(S) and signal(S) are scattered among several processes. Hence, difficult to

understand their effects

• Usage must be correct in all the processes

• One bad (or malicious) process can fail the entire collection of processes

Monitors

• Are high-level language constructs that provide equivalent functionality to that of

semaphores but are easier to control

• Found in many concurrent programming languages

o Concurrent Pascal, Modula-3, uC++, Java...

• Can be implemented by semaphores...

Monitor

• Is a software module containing:

o one or more procedures

o an initialization sequence

o local data variables

• Characteristics:

o local variables accessible only by monitor’s procedures

o a process enters the monitor by invoking one of it’s procedures

o only one process can be in the monitor at any one time

• The monitor ensures mutual exclusion: no need to program this constraint explicitly

• Hence, shared data are protected by placing them in the monitor

o The monitor locks the shared data on process entry

Page 30: B.tech CS S8 High Performance Computing Module Notes Module 4

• Process synchronization is done by the programmer by using condition variables that

represent conditions a process may need to wait for before executing in the monitor

Condition variables

• are local to the monitor (accessible only within the monitor)

• can be access and changed only by two functions:

o cwait(a): blocks execution of the calling process on condition (variable) a

� the process can resume execution only if another process executes

csignal(a)

o csignal(a): resume execution of some process blocked on condition (variable) a.

� If several such process exists: choose any one

� If no such process exists: do nothing

Monitor

• Awaiting processes are either in the entrance queue or in a condition queue

• A process puts itself into condition queue cn by issuing cwait(cn)

• csignal(cn) brings into the monitor 1 process in condition cn queue

• Hence csignal(cn) blocks the calling process and puts it in the urgent queue (unless

csignal is the last operation of the monitor procedure)

Producer/Consumer problem

• Two types of processes:

o producers

o consumers

• Synchronization is now confined within the monitor

• append(.) and take(.) are procedures within the monitor: are the only means by which P/C

can access the buffer

• If these procedures are correct, synchronization will be correct for all participating

processes

Monitor for the bounded P/C problem

• Monitor needs to hold the buffer:

o buffer: array[0..k-1] of items;

• needs two condition variables:

o notfull: csignal(notfull) indicates that the buffer is not full

o notempty: csignal(notempty) indicates that the buffer is not empty

• needs buffer pointers and counts:

o nextin: points to next item to be appended

o nextout: points to next item to be taken

o count: holds the number of items in buffer

Message Passing

Page 31: B.tech CS S8 High Performance Computing Module Notes Module 4

• Is a general method used for interprocess communication (IPC)

o for processes inside the same computer

o for processes in a distributed system

• Yet another mean to provide process synchronization and mutual exclusion

• We have at least two primitives:

o send(destination, message)

o received(source, message)

• In both cases, the process may or may not be blocked

Synchronization in message passing

• For the sender: it is more natural not to be blocked after issuing send(.,.)

o can send several messages to multiple dest.

o but sender usually expect acknowledgment of message receipt (in case receiver

fails)

• For the receiver: it is more natural to be blocked after issuing receive(.,.)

o the receiver usually needs the info before proceeding

o but could be blocked indefinitely if sender process fails before send(.,.)

• Hence other possibilities are sometimes offered

• Ex: blocking send, blocking receive:

o both are blocked until the message is received

o occurs when the communication link is unbuffered (no message queue)

o provides tight synchronization (rendezvous)

Addressing in message passing

• direct addressing:

o when a specific process identifier is used for source/destination

o but it might be impossible to specify the source ahead of time (ex: a print server)

• indirect addressing (more convenient):

o messages are sent to a shared mailbox which consists of a queue of messages

o senders place messages in the mailbox, receivers pick them up

Mailboxes and Ports

• A mailbox can be private to one sender/receiver pair

• The same mailbox can be shared among several senders and receivers

o the OS may then allow the use of message types (for selection)

• Port: is a mailbox associated with one receiver and multiple senders

o used for client/server applications: the receiver is the server

Ownership of ports and mailboxes

• A port is usually own and created by the receiving process

• The port is destroyed when the receiver terminates

Page 32: B.tech CS S8 High Performance Computing Module Notes Module 4

• The OS creates a mailbox on behalf of a process (which becomes the owner)

• The mailbox is destroyed at the owner’s request or when the owner terminates

Message format

• Consists of header and body of message

• In Unix: no ID, only message type

• control info:

o what to do if run out of buffer space

o sequence numbers

o priority...

• Queuing discipline: usually FIFO but can also include priorities

Enforcing mutual exclusion with message passing

• create a mailbox mutex shared by n processes

• send() is non blocking

• receive() blocks when mutex is empty

• Initialization: send(mutex, "go");

• The first Pi who executes receive() will enter CS. Others will be blocked until Pi resends

msg.

The bounded-buffer P/C problem with message passing

• We will now make use of messages

• The producer place items (inside messages) in the mailbox mayconsume

• mayconsume acts as our buffer: consumer can consume item when at least one message

is present

• Mailbox mayproduce is filled initially with k null messages (k= buffer size)

• The size of mayproduce shrinks with each production and grows with each consumption

• can support multiple producers/consumers

Page 33: B.tech CS S8 High Performance Computing Module Notes Module 4