fault-tolerant distributed shared memory on a broadcast ...48/datastream/obj... · thrashing. in...

Fault-Tolerant Distributed Shared Memory on a

Broadcast-based Interconnection Architecture

A Thesis

Submitted to the Faculty

of

Drexel University

by

Diana Lynn Hecht

in partial fulfillment of the

requirements for the degree

of

Doctor of Philosophy

December 2002

ii

Dedications

I dedicate this dissertation to my husband, Stephen. I would not have been able to achieve this

goal without his faith in me and his sacrifice, patience, and understanding.

iii

Acknowledgements

I would like to give a special thanks to my advisor, Dr Constantine Katsinis, for his invaluable

guidance, encouragement, advice and friendship over many years. I am very grateful for the

impact he has had in my life. In addition, I would like to thank my parents for the love, support,

and encouragement that helped me to be where I am today. I would like to thank my husband,

Stephen, for his unwavering confidence in me and his encouragement, patience, understanding

and sacrifice throughout this endeavor.

iv

Table of Contents

LIST OF TABLES vi

LIST OF FIGURES vii

ABSTRACT x

1. INTRODUCTION 1

1.1. INTRODUCTION 1

1.2. PREVIOUS WORK 2

1.3. ORGANIZATION OF THE THESIS 6

2. THE SOME-BUS ARCHITECTURE 7

2.1. THE SOME-BUS ARCHITECTURE 7

2.2. DESIGN COMPLEXITY AND SCALABILITY 12

3. DISTRIBUTED SHARED MEMORY ON THE SOME-BUS 15

3.1. BASIC DSM OPERATION ON THE SOME-BUS 15

3.2. ORGANIZATION OF CACHE AND DIRECTORY CONTROLLERS 18

3.3. INPUT QUEUE STRUCTURE AND RESOLVER DESIGN 22

3.4. PERFORMANCE ANALYSIS 25

3.4.1. Trace-Driven Simulation 26

3.4.2. Decision Tree 36

3.4.3. Queuing Network Theoretical Model 49

3.4.4. Comparison Between the Simulators and the Theoretical Model 59

4. FAULT TOLERANCE 67

4.1. FAULT TOLERANCE AND DISTRIBUTED SHARED MEMORY ON THE SOME-BUS 67

4.2. FT0 PROTOCOL 69


v4.3.1. Home2 Consistency 87

4.3.2. Using Home2 Memory to Fill Read Requests 96

4.3.3. Performance 103

4.3.4. Implementation Issues 111





5. CONCLUSIONS 129

6. LIST OF REFERENCES 131

7. VITA 134

vi

List of Tables

3.1 Comparison of Multiprocessor Trace Files 29

3.2 Simulation Data, after Relocation 31

3.3 Decision Tree Path Distribution for Read Accesses 40

3.4 Decision Tree Path Distribution for Write Hit Accesses 45

3.5 Decision Tree Path Distribution for Write Miss Accesses 45

3.6 Net Cost for Paths A Through F 46

3.7 Net Cost for Paths H Through L 46

3.8 Net Cost for Paths M Through S 47

3.9 Decision Tree Traffic vs. Channel Utilization 48

3.10 Distribution of Read or Write Misses Through the Decision Tree Paths 48

4.1 Average Number of Cache Blocks Written back at Checkpoint Time 84

4.2 Comparison of the Network Cost of the Tree Paths for FT0 and FT1 102

4.3 Distribution of Tree Path Usage for Cache Misses 105

vii

List of Figures

2.1 Parallel Receiver Array and Output Coupler 7

2.2 Optical Interface 10

2.3 Processor Interface 12

2.4 Extending the SOME-Bus from N to 2N Nodes 14

3.1 Typical System Architecture 19

3.2 Computer System Architecture for DSM and Message Passing 20

3.3 Computer System Architecture for Message Passing 20

3.4 SOME-Bus Channel Controller, Cache and Directory 21

3.5 Single Resolver Implementation 23

3.6 Separate Cache and Directory Message Chains 24

3.7 Dual Resolver Implementation 25

3.8 Distribution of Address References 30

3.9 Distribution of Address References, after Relocation 30

3.10 Processor Utilization and Channel Utilization: Low Level of Locality 34

3.11 Processor Utilization and Channel Utilization: High Level of Locality 35

3.12 Decision Tree Branch for Read Accesses 37

3.13 Decision Tree Path Distribution for Write Accesses to Blocks in Cache 42

3.14 Decision Tree Path Distribution for Write Accesses to Blocks not in Cache 43

3.15 Processor, Cache, Directory and Channel Queues 50

3.16 Message Flow in Queue System 50

3.17 Four-node Queuing Network 53

3.18 Traffic Through a Single Node in the Queuing Network 54

3.19 Channel Utilization: Case N01 61

3.20 Channel Utilization: Case N10 61

viii3.21 Channel Utilization: Case S01 62

3.22 Channel Utilization: Case S10 62

3.23 Channel Queue Waiting Time: Case N01 63

3.24 Channel Queue Waiting Time: Case N10 63

3.25 Channel Queue Waiting Time: Case S01 64

3.26 Channel Queue Waiting Time: Case S10 64

3.27 Processor Utilization: Case N01 65

3.28 Processor Utilization: Case N10 65

3.29 Processor Utilization: Case S01 66

3.30 Processor Utilization: Case S10 66

4.1 FT0 Memory Organization of SOME-Bus Node 70

4.2 Checkpoint Intervals of 10,000 and 30,000 for the Case N01 Applications 75

4.3 Checkpoint Intervals of 10,000 and 30,000 for the Case N10 Applications 75

4.4 Checkpoint Intervals of 10,000 and 30,000 for the Case S01 Applications 76

4.5 Checkpoint Intervals of 10,000 and 30,000 for the Case S10 Applications 77

4.6 State-transition-rate Diagram 80

4.7 State Probabilities, Pw= .10 and Pbfr = .0003 83

4.8 State Probabilities, Pw= .10 and Pbfl = .9 84

4.9 FT1 Memory Organization of SOME-Bus Node 86

4.10 Local and Remote Requests for Data Block DataB 87

4.11 Ownership Request Message Transfer with FT1 Protocol 90

4.12 Message Time Line 93

4.13 Messages at the Receiver Queues of the Home2 Node 95

4.14 Decision Tree for FT1 Read Request 98

4.15 Decision Tree for FT1 Write Miss 100

4.16 Decision Tree for FT1 Write Hit 101

ix4.17 Processor Utilization, Case N01 106

4.18 Execution Time, Case N01 106

4.19 Processor Utilization, Case N10 107


4.21 Processor Utilization, Case S01 109

4.22 Execution Time, Case S01 109

4.23 Processor Utilization, Case S10 110


4.25 Ownership Request scenarios 115





4.30 Total Execution Time, Case N01 125

4.31 Total Execution Time, Case N10 126

4.32 Total Execution Time, Case S01 126

4.33 Total Execution Time, Case S10 127

4.34 Total Execution Time, Trace Files 128

x

AbstractFault-Tolerant Distributed Shared Memory on aBroadcast-based Interconnection Architecture

Diana Lynn HechtConstantine Katsinis, Ph.D.

This thesis focuses on the issue of reliability and fault tolerance in Distributed Shared

Memory Multiprocessors, and on the performance impact of implementing fault tolerance

protocols that allow for Backward Error Recovery through the use of synchronized

checkpointing. High Performance Parallel computing systems that implement Distributed Shared

Memory (DSM) require interconnection networks capable of providing low latency and high

bandwidth and efficient support for multicast and synchronization operations. Software-based

DSM systems rely on the operating system to manage the replicated memory pages and

consequently their performance suffers due to operating system overhead, false sharing and page

thrashing. In order to obtain high levels of performance, the activities related to maintaining the

consistency of shared data in a DSM should be implemented in hardware so that latencies for data

access can be minimized. The recoverable DSM system examined in this thesis is intended for

the class of broadcast-based interconnection networks in order to provide the low latencies

required for the application workloads characteristic of DSM.

An example of this class of interconnection network is the Simultaneous Optical

Multiprocessor Exchange Bus (SOME-Bus). The unique architecture of the SOME-Bus provides

for strong integration of the transmitter, receiver, and cache controller hardware to produce a

highly integrated system-wide coherence mechanism. This thesis presents four protocols for

fault-tolerant DSM and uses simulation and theoretical analysis to examine the performance of

the protocols on the SOME-Bus multiprocessor. The proposed fault tolerance protocols exploit

the inherent data distribution operations that occur as part of the management of shared data in

DSMs in order to hide the overhead of fault tolerance. The increased availability of shared data

xifor the support of fault tolerance can be used to enhance the performance of the DSM by

increasing the likelihood that a request for data can be filled locally without requiring

communication with remote nodes.

1

CHAPTER 1: INTRODUCTION

1.1 INTRODUCTION

High-performance computing is required for many applications, including the modeling

of weather patterns, atomic structure of materials and other physical phenomena as well as image

processing, simulation of integrated circuits and other applications known as “Grand Challenge”

problems. Scalable systems capable of addressing these application classes are formed by

interconnecting large numbers of microprocessor-based processing nodes in order to create

distributed-memory multiprocessor systems. The effectiveness of these types of multiprocessing

systems is determined by the interconnection network architecture, the programming model

supported by the system, and the level of reliability and fault-tolerance provided by the system.

The types of applications used with these multiprocessors affects the level of performance that

can be provided due to differences in the inherent parallelism of the application, which affects the

ability to balance the workload across the processors, differences in communication patterns, and

the implementation of synchronization operations.

Shared address programming, message passing, and data parallel processing are examples

of the programming models that can be supported by multiprocessor systems. The programming

model is important because it affects the amount of operating system overhead involved in

communication operations as well as the level of involvement required by the programmer to

specify the processor interaction required by the application. The message passing paradigm

requires a higher level of programmer involvement and knowledge of the details of the

underlying communication subsystem in order to explicitly direct the interprocessor

communication. Distributed Shared memory systems (DSM) offer the application programmer a

model for using shared data that is identical with that used when writing sequential programs,

thereby reducing the complexity involved in developing distributed applications.

2The memory consistency model supported by a DSM has a large impact on the system

performance. Sequential consistency provides the programmer with most intuitive memory

model but results in increased access latency and network bandwidth requirements. Relaxed

consistency models such as Release Consistency [13] and Entry Consistency [3] improve

performance by allowing hardware optimizations such as pipelining and reordering as well as

overlapping of memory accesses. However, the weaker consistency models rely heavily on

synchronization operations when accessing shared data, requiring a higher level of complexity for

the programmer. The success of DSM depends on its ability to free the programmer from any

operations that are necessary for the only purpose of supporting the memory model. It is

imperative that interconnection networks be designed with high bisection bandwidth and low

latency in order to provide the best possible performance in DSM systems.

As the number of interconnected nodes in a DSM system increases, the probability of

node failures also increases. For this reason, tolerating node failures becomes essential for

parallel applications with large execution times. Backward Error Recovery enables an

application that encounters an error to restart its execution from an earlier, error-free state. In

order for this to be possible, the state or checkpoint of a process be periodically saved so that it

can be used to restart the application in the case of a node failure.

1.2 PREVIOUS WORK

A survey of Recoverable DSM systems is provided in [28]. In [18] ICARE, recoverable

DSM is presented in which the basic Write-Invalidate coherence protocol is extended in order to

manage both active data and recovery data. The system uses globally consistency checkpointing

implemented as a two-phase commit protocol. In [29] the implementation of ICARE on a COMA

architecture (hardware-based DSM using a cache line as the unit of coherence) and a Shared

Virtual Memory architecture (software-base DSM using a memory page as the unit of coherence)

is compared.

3In [36] [38] present the Boundary-Restricted class of coherence protocols aimed at

providing a mechanism to guarantee that the number of copies of a memory page remain in a

range that can be specified. In [12], several instances of the Boundary-Restricted coherence

protocols are compared with Write-Invalidate, Write-Invalidate with Downgrading, Write

Broadcast in terms of the level of availability and the operating costs for update or invalidation

operations.

[19] presents a recoverable DSM system based on the competitive update protocol. The

goal of the competitive update protocol is to remove copies of pages in nodes that are not actively

using them in order to reduce the system overhead of keeping the pages updated. The original

competitive update protocol was modified in order to guarantee that at least two copies of each

pages exists in the system enabling recovery from a single node failure. The system is

implemented on a page-base software DSM implementing Release Consistency.

DSMPI, a parallel library implemented on top of MPI that provides the abstraction of

globally accessible shared memory is presented in [34]. The unit of sharing is the program

variable or data structure and non-blocking coordinated checkpointing is provided. A globally

consistent state is achieved through the use of a checkpoint identifier included with every

message sent. Before processing messages, a receiver compares it’s own checkpoint Identifier

with the one received with a message in order to determine if the message was sent before or after

the checkpoint round in which the receiver is currently in.

In [2] a recoverable shared memory (RSM) is proposed in which the memory in the

system is divided into two equal parts, the current data blocks and the recovery data blocks. The

RSM uses a dependency-tracking matrix in order to create a consistent recovery state, using a

two-phase commit protocol, without requiring a global, fully synchronized mechanism for

creating a new recovery point. Advantages of this approach are that only processors directly

affected by a failure are required to perform a rollback, processor failures are tolerated

transparently, and both sequential and relaxed consistency are supported. Disadvantages of this

4approach are that write operations can be delayed because the copy-on-write mechanism for

updating the recovery data requires that the dependency tracking be performed on every write.

A major reason for the moderate performance that occurs with many modern large-scale

applications is the mismatch between interconnection architecture and application structure.

DSM systems typically have a large percentage of multicast traffic due to the invalidation

messages in Write-Invalidate cache coherence protocols and for update messages in Write-

Update cache coherence protocols. In addition to high bisection bandwidth and low latency, the

interconnection architecture should provide support for stronger levels of memory consistency,

efficient multicast and broadcast operations for cache-coherence related messages, and facilitate

hardware-based DSM through close integration with the cache and memory subsystems. The

performance of a DSM system is shown to be adversely affected by the latency in the network in

simulations performed in [21].

There is a large amount of research in multicast communications in popular architectures

with path-based broadcasting [16], and trees and (multidestination) wormhole routing

[6][23][27]. Large efforts are focused on development of extensive algorithms to alleviate the fact

that intense multicast communications cause wormhole routing to resemble store-and-forward

routing.

Barrier synchronization occurs frequently in programs for multiprocessors, especially in

those problems that can be solved by using iterative methods. The ability to support efficient

synchronization operations is important not only for application-level synchronization but also for

activities relating to fault tolerance such as globally synchronized checkpointing. Efficient

implementation of barrier synchronization in scalable multiprocessors and multicomputers has

received a lot of attention. Synchronization based on multidestination wormhole routing and

additional hardware support at the network routers is presented in [31]. Additional hardware

designs in support of efficient synchronization are presented in [33] and [40].

5A network architecture that offers an alternative to present switch-based networks relies

on one-to-all broadcast, where each processor can directly communicate with any other

processor; from the point of view of any processor, all other processors appear the same. Such a

network architecture allows the user, or the compiler, to structure the data and operations in the

application code to better reflect the parallelism inherent in the applications with the resulting

benefit that messages experience smaller latencies. It also allows the operating system to perform

extensive thread placement and migration dynamically to successfully manage the level of

parallelism present in large applications. The most useful properties of such a network

architecture are high bandwidth (scaling directly with the number of workstations), low latency,

no arbitration delay, and non-blocking communication.

Broadcast-based networks can be constructed using optoelectronic devices (and multiple-

wavelength data representations), relying on sources, modulators and arrays of detectors, all

being coupled to local electronic processors. This thesis presents a set of protocols for fault-

tolerance that are optimized for a multiprocessor system interconnected via a broadcast-based

network architecture. Although the protocols presented can be applied to the general class of

broadcast-based networks, in this thesis, the protocols are described in terms of their

implementation on the Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) [22].

The memory architecture required to support the different protocols is also presented.

The work presented in this thesis differs from previous work in that the DSM system is

hardware-based and the communication that occurs as part of the underlying DSM as well as the

Fault Tolerance protocols is optimized to take advantage of the efficient broadcast capability of

the network. In addition, the Fault Tolerance protocols have been integrated with the existing

DSM operations in a manner that causes little or no reduction in system performance during

normal operation and eliminates most of the overhead at checkpoint creation. Under certain

conditions, data blocks which are duplicated for fault tolerance purposes can be utilized by the

6basic DSM protocol, reducing network traffic, and increasing the processor utilization

significantly.

1.3 ORGANIZATION OF THE THESIS

In Chapter 2, the architecture of the SOME-Bus is presented, including the design of the

Transmitter/Receiver, the optical interface, the processor interface and a discussion of the design

complexity and scalability of the SOME-Bus. Chapter 3 presents the details of Distributed

Shared Memory on the SOME-Bus including the major components in each node (processor,

cache controller, directory controller and channel controller) and the supporting input queue

structure and resolver design. In addition, three methods for evaluating the performance of the

system, two simulators and a theoretical model are described in detail. Chapter 4 presents the

details of four protocols which provide fault tolerance on the SOME-Bus DSM. A summary and

conclusions are presented in Chapter 5.

7

CHAPTER 2. THE SOME-BUS ARCHITECTURE

This chapter presents the architecture of the SOME-Bus including the design of the

Transmitter/Receiver, the optical interface and the processor interface. A discussion of the design

complexity and scalability of the SOME-Bus is provided as well.

2.1 THE SOME-BUS ARCHITECTURE

The Simultaneous Optical Multiprocessor Exchange Bus (SOME-Bus) [22] incorporates

optoelectronic devices into a very high-performance network architecture. It is a low-latency,

high-bandwidth, fiber-optic network that directly connects each processing node to all other

nodes without contention. Its properties distinguish it from other optical networks examined in

the past [14][30][32][35]. One of its key features is that each of N nodes has a dedicated

broadcast channel that can operate at several GBytes/sec, depending on the configuration. In

general, the SOME-Bus contains K fibers, each carrying M wavelengths organized in M/W

channels, where each channel is composed of W wavelengths. The total number of fibers is K =

NW/M. A simple configuration with 128 nodes (N = 128 channels) and W=1 wavelength per

channel would require K=32 fibers with M=4 wavelengths per fiber. Current or foreseeable

technology allows a configuration of 128 nodes (channels) and K=8 fibers, with W=2

wavelengths per channel and M=32 wavelengths per fiber. As discussed later, each fiber can

carry several Gbits/sec, resulting in channels with GBytes/sec bandwidth. Each of N nodes also

has an input channel interface based on an array of N receivers (each with W detectors) that

simultaneously monitors all N channels.

The physical implementation of SOME-Bus is motivated by recent progress in optical

communication, Dense Wavelength Division Multiplexing (DWDM), and optoelectronics. Slant

Bragg gratings [5][10][11][24] are written directly into the fiber core and are used as narrow-

band, inexpensive output couplers. This coupling of the evanescent field allows the traffic to

continue and eliminates the need for regeneration. Figure 2.1 shows the parallel receiver array

8and output coupler. The SOME-Bus uses amorphous silicon (a-Si) photodetectors built as

superstructures on the surface of electronic processing devices. Due to the low conductivity of the

a-Si layer, no subsequent patterning is required, and therefore the yield and cost of the receiver is

determined by the yield and cost of the CMOS device itself. Optical power budget analysis of a

system with 128 nodes, 32 wavelengths per fiber and 10 mW of power inserted into the fiber

shows that the worst case for output power, occurring where light from the first node is coupled

out by the receiver at the last node, is 21 nW which is more than sufficient for present detectors.

Detectors have been demonstrated that have very low dark current, a carrier lifetime of a few

picoseconds, and a dynamic range of more than 4 orders of magnitude under only 5 nW of optical

power [9].

Receiver Arrays

Micro-mirror

Bragg Gratings

Detectors

Bragg Gratings

Detectors

Node 1 Node 2

Optical Fiber

Figure 2.1: Parallel Receiver Array and Output Coupler

Since the receiver array does not need to perform any routing, its hardware complexity

(including detector, logic, and packet memory storage) is small. This organization eliminates the

need for global arbitration and provides bandwidth that scales directly with the number of nodes

in the system. No node is ever blocked from transmitting by another transmitter or due to

contention for shared switching logic. The ability to support multiple simultaneous broadcasts is

a unique feature of the SOME-Bus that efficiently supports high-speed, distributed barrier

9synchronization mechanisms and cache consistency protocols and allows process group

partitioning within the receiver array.

Once the logic level signal is restored from the optical data, it is directed to the input

channel interface, which consists of two parts: the optical interface and the processor interface.

Figure 2.2 shows the optical interface, which includes physical signaling, address filtering, barrier

processing, length monitoring and type decoding. Each receiver generates a data stream that is

examined to detect the start of the packet and the packet header. The header decode circuitry

examines the header field, which includes information on the message type, destination address

(or addresses) and length, to determine whether or not the message is a synchronization message.

If the message is a synchronization message, it is handled by the barrier circuitry, otherwise the

destination address is compared to the set of valid addresses contained in the address decode

circuitry. In addition to recognizing its own individual address, a processor can recognize

multicast group addresses as well as broadcast addresses. Once a valid address has been

identified, the message is placed in a queue. If the address does not match, the message is

ignored.

10

Destuff

HeaderDecode

TypeDecode

AddressDecode

BarrierMsg

DataMsg

MsgLengthMonitor

BufferControl

Figure 2.2: Optical Interface

The optical interface shown in Figure 2.2 provides for the handling of two basic types of

messages, a data message and a barrier message. In order to support these message types, at a

minimum, the message header should contain a type field, destination address field and a length

field. It is not necessary to include a source address in the header since the originator of the

message can be determined directly by the input channel queue on which the message was

received. The type field can be used not only to indicate whether the message is a data or barrier

message but can also indicate a particular module within the node to which the message should be

directed. For example, in a DSM multiprocessor, the type field could indicate that the message

should be delivered to either the cache controller or the directory controller (in the case of

Directory-Based Cache coherence). The destination address field can contain either the address

of an individual node or the address of a pre-determined multicast address. Another possibility

for multicast messages would be to place a bit vector (number of bits = number of nodes) in the

11header which identifies the individual nodes to be included in the multicast. The advantage of the

later approach is that it is not necessary to set up multicast groups ahead of time. The

disadvantage of the bit vector approach is that for systems with a large number of nodes, the size

of the message header could increase significantly. The Length field is used to detect underflow

or overflow conditions when the message is being received.

Figure 2.3 [17] shows the processor interface which includes a routing network and a

queuing system. One queue is associated with each input channel, allowing messages from any

number of processors to arrive and be buffered simultaneously, until the local processor is ready

to remove them. A resolver circuit receives a request signal (Ri) from each non-empty queue and

produces the index of the next queue to be accessed under either the limited or the exhaustive

service disciplines. The local processor can force the next queue selection through the Pin input. A

straightforward implementation of the resolver as a selection tree, using logic gates to select the

next queue and multiplexers to forward the corresponding queue index, requires only several

hundred gates organized in log2(N) levels. The time required to select the next queue (polling

walk time) is consequently very small and can be overlapped with the queue access time.

Arbitration may be required only locally in a receiver array when multiple input queues contain

messages.

12

Address Filter

Address Filter

Address Filter

Resolver

Enables

Data BusR0RN-1

ROUT

ES

Channel

Dec

Rec 0

Rec N-1

RIN

Figure 2.3: Processor Interface

Etched micro-mirrors [25][26] can be used to insert a signal into a fiber. Each node uses

a separate mirror and a separate laser source to insert each wavelength of its channel. It is

possible to integrate the transmitter sources on the same chip with the detectors and the associated

electronic circuits.

2.2 DESIGN COMPLEXITY AND SCALABILITY

The SOME-Bus has much more functionality than traditional crossbar architectures. A

major advantage of this architecture is that, due to the multiple broadcast capability, no node is

ever blocked from transmitting by another transmitter, no arbitration is required and network

bandwidth scales directly with the number of nodes. No communication is ever blocked through

contention for shared switching logic. With N nodes, the diameter of the SOME-Bus is 1, the

time needed for all-to-all communication with distinct messages is O(N) and the time needed for

synchronization is O(1). Unlike a fully-connected point-to-point network, where the number of

transmitters and channels increases O(N2), the number of transmitters and channels of the SOME-

13Bus is O(N), quite smaller than the number required in other popular architectures, such as the

hypercube or the torus. The number of receivers is N2, which is larger than the number required in

other architectures. They are arranged so that N receivers are fabricated as amorphous silicon

structures constructed as a thin film directly on the surface of a digital CMOS device, with no

lithography required. Because of the low conductivity of the amorphous silicon layer, no

subsequent patterning is required, and therefore the yield and cost of the receiver is determined by

the yield and cost of the CMOS device itself. Since the receiver does not need to perform any

routing, its hardware complexity (including detector, logic, and packet memory storage) is small

keeping their cost small, too. The full receiver array can be implemented on a single chip even for

large values of N (N > 128). Therefore, the total receiver cost is approximately O(N) instead of

O(N2).

Network scalability, whether based on switches or the SOME-Bus, is a very important

issue. Since achieving “supercomputing” performance requires a large number of processors, and

since the topology of the network becomes critical as the network size increases, it is important to

examine the issue of scalability in the sense of doubling the number of processing nodes, rather

than incrementally increasing the network size.

While it may be easier to incrementally expand a switch-based network by adding

switches and nodes, the resulting hot spots and/or bottlenecks in such a system would cause a

dramatic loss of performance. Only under very strong assumptions of locality is it possible to

incrementally scale up a switch-based network with no loss of performance. If the application

does not exhibit complete locality, then incremental expansion of this network adversely affects

its performance.

The SOME-Bus is by comparison easier to scale than a switch-based network. It relies on

a set of N receivers integrated on a single chip. Since most of the network cost is in the fiber optic

ribbon, a system with fewer than N nodes may be constructed by using a SOME-Bus segment

that accommodates the number of needed nodes (and would leave a portion of the receiver chip

14unused). Such a system may scale up to N by optically joining additional SOME-Bus segments.

Once N is reached, it is not possible to add one more node. System expansion is then achieved by

incorporating a second N-receiver chip in each node (and using additional SOME-Bus networks

for the necessary channels). Figure 2.4 shows the expansion of such a system from N to 2N

nodes, which is accomplished by using four SOME-Bus segments to create twice the number of

channels (and each channel is twice as long to accommodate the additional nodes). Since

information flows only in one direction on each SOME-Bus between the two halves of the

network, amplifiers may be easily placed between SOME-Bus segments as Figure 2.4 shows.

Que-Net

P 0 P N-1

Que-NetQue-Net Que-Net Que-Net

P N P 2N-1

Que-NetQue-Net Que-Net

Channel N

Channel 2N-1

Channel 0

Channel N-1

SO

ME

-Bus

0

SO

ME

-Bus

1Figure 2.4: Extending the SOME-Bus from N to 2N Nodes

15

CHAPTER 3: DISTRIBUTED SHARED MEMORY ON THE SOME-BUS

Communication between processors in a distributed system can occur via message

passing or Distributed shared memory (DSM). Message passing systems tend to have a high

level of integration between the processor and network. The user specifies the communication

operations through operating system or library calls that perform basic send and receive

operations. The message passing approach to interprocessor communication places a significant

burden on the programmer and introduces a considerable software overhead.

DSM systems can be easier to program than systems that use a message-passing model.

A DSM system hides the mechanism of interprocessor communication by presenting the

programmer with a shared memory model logically implemented in a physically distributed

memory system. The DSM programming model decreases the burden placed on the programmer

and provides a higher-level of portability for applications.

This chapter describes the implementation of Distributed Shared Memory on the SOME-

Bus. The basic DSM operation is described and then details about the Cache and Directory

controllers are provided. Several designs for the Input Queue Structure and Resolver design are

described as well. Finally, three mechanisms used to evaluate the performance of proposed fault

tolerance protocols (Trace-Driven Simulator, Statistical Simulator, and Queuing Network

Theoretical Model) are described in detail.

3.1 BASIC DSM OPERATION ON THE SOME-BUS

A natural implementation of a DSM interconnected by the SOME-Bus architecture is a

multiprocessor consisting of a set of nodes organized as a CC-NUMA system where the shared

virtual address space is distributed across the local memories of the processing nodes. The

SOME-Bus provides an advantage over software-based cache coherence approaches because it

16supports a strong integration of the transmitter, receiver and cache controller hardware producing

an effective system-wide hardware-based cache coherence mechanism.

There are two main approaches for maintaining cache coherence, snooping and directory-

based techniques. A disadvantage for snooping approaches is the fact that the interconnection

network saturates quickly as the number of processors increases. The SOME-Bus

interconnection network will not saturate in this manner since every node has a dedicated output

channel and there is no contention with other nodes when sending messages. The cache

controller, however, can become saturated as the number of nodes in the system increases due to

the increase in cache consistency traffic. When directory-based techniques are used to maintain

cache coherence, it is not necessary for the cache controller in every node to receive all messages

relating to cache consistency, but only the messages that pertain to data blocks resident in that

node’s cache. The Directory controller will multicast Invalidation and Downgrade requests only

to the affected nodes rather than simply broadcasting them. In the SOME-Bus, messages are still

broadcast over the sending node's output channel, but the decision to accept or reject an input

message is performed at the receiver input rather than the cache controller of each remote node.

In order to multicast a message on the SOME-Bus, a list of destinations can be added to the

message header. The receiver logic of a particular node can determine if a message is intended

for that node by examining the header of the incoming message, and reject the message if

necessary by not placing it in the input queue.

The multiprocessor examined in this thesis consists of a set of nodes organized as a CC-

NUMA system interconnected by the SOME-Bus architecture. The shared virtual address space is

distributed across local memories that can be accessed both by the local processor and by

processors from remote nodes, with different access latencies. Each node contains a processor

with cache, memory, an output channel and a receiver that can receive messages simultaneously

on all N channels.

17A sequential-consistency model is adopted and statically distributed directories are used

to enforce a Write-Invalidate protocol. Cache blocks can be in the state INVALID, SHARED, or

EXCLUSIVE. Directory entries can be in the state UNOWNED, SHARED, or EXCLUSIVE.

Each directory entry is associated with a bit vector (the copyset) that identifies the processors

with a copy of the data block corresponding to that entry. A node that contains the portion of

global memory and the corresponding directory entry for a particular data block is known as the

Home node for that block.

Multithreaded execution is assumed where each processor executes a program that

contains a set of parallel threads. A thread remains in the RUNNING state until it encounters a

cache miss. If the miss can be serviced locally without coordinating with other nodes through

Invalidation or Downgrade messages, the thread waits for the memory access to complete and

then continues running. If the cache miss requires a message to be sent to the home directory

controller, the thread is blocked and another thread is chosen from the pool of threads that are

ready to run. Each node contains four major components:

1. The processor handles all activities related to the scheduling of the threads.

2. The cache controller fills requests for data from the threads. When it encounters a cache miss

that cannot be serviced locally, the cache controller sends a Data or Ownership request

message to the home node that contains the directory entry for that block. A cache block

chosen for replacement in the cache results in a victim message to the home directory. If the

block was in EXCLUSIVE state, the victim message contains the writeback data. Otherwise

the victim message is sent so the directory controller can remove the node from the copyset.

3. A directory controller maintains the directory information for the portion of main memory

that is located at its node and receives and processes Data and Ownership requests from the

cache controllers. When processing the requests, the directory controller issues any required

Invalidation or Downgrade messages and then collects the acknowledge messages. A

18waiting-message queue is used to store requests waiting for acknowledge messages so the

directory can service multiple requests simultaneously.

4. The channel controller receives messages from the cache or directory controllers and delivers

them to the destination node. If the source and destination nodes of the message are different,

the message is considered to be remote and is placed on the output queue associated with the

output channel of the source node. When the channel becomes available, the message is

transmitted and arrives at the input queue of the cache or directory controller at the

destination node. Messages that are broadcast or multicast arrive simultaneously at the

destination input queues. If the source and destination node of a message are the same, the

message is considered to be local and the channel controller places the message directly on

the input queue of the controller (directory or cache) to which the message is directed.

Messages delivered locally in this manner are not placed on the node’s output queue and do

not contribute to the channel utilization of the node.

3.2 ORGANIZATION OF CACHE AND DIRECTORY CONTROLLERS

Incoming messages may be either application-related messages destined for the processor

or coherence messages directed to either the cache or the directory controller. Achieving high

performance depends critically on the way in which messages are removed from the input queues

at the node receiver and delivered to the processor, cache controller or directory controller.

In a typical system architecture, shown in Figure 3.1, the channel interface is connected

to an I/O bus, which is connected to the processor bus through a bridge. Because the I/O bus is

designed to support several I/O devices, its bandwidth is usually relatively small compared to the

main processor-memory bus.

In the SOME-Bus architecture, in order to implement DSM as efficiently as possible, the

channel controller must be integrated as closely as possible with the cache and the directory. This

organization, shown in Figure 3.2, offers the fastest access to data in the cache and the directory

19and results in smaller latencies. The flow of typical DSM messages is also shown on the figure.

Message passing can easily be done using this organization as Figure 3.3 shows.

Figure 3.4 shows the major blocks of the channel controller as well as the relating

functionality of the cache and the directory. A processor reference may result in a cache miss and

a subsequent interaction with the directory. If necessary, a request is created by the directory and

is forwarded to the channel controller, which creates a request message and enqueues, it in the

output channel queue for transmission.

CPU

CACHE MEM

BRIDGE

CHAN

Processor Bus

I/O Bus

Figure 3.1: Typical System Architecture

20

CPU

CACHEMEM

BRIDGE

CHANDMA

DATA-REQ

DIR

DATA-ACK DATA-REQ

DATA-ACK

Processor Bus

I/O Bus

Figure 3.2: Computer System Architecture for DSM and Message Passing

CPU

CACHEMEM

BRIDGE

CHANDMA

MESSAGES

Figure 3.3: Computer System Architecture for Message Passing

21

Figure 3.4: SOME-Bus Channel Controller, Cache and Directory

To support message passing, a direct connection is created between the channel controller

and the main memory. Small messages may be read by the processor directly from the selected

queue. Large messages can be transferred from the proper queue into the regular memory in a

cut-through manner by the DMA controller. Figure 3.4 shows DMA channels connecting the

transmitter and receiver to the memory bus.

To support DSM, the channel receiver must pass incoming messages to either the cache

or the directory. As Figure 3.4 shows, incoming Data or Ownership request messages are sent to

the directory, while invalidation requests are sent to the cache. Similarly, incoming invalidation

acknowledgments are collected by the directory and Data-Ack or Ownership-Ack are sent to the

cache. In addition there is communication between cache and directory as needed to complete the

protocol.

22

3.3 INPUT QUEUE STRUCTURE AND RESOLVER DESIGN

Figure 3.5 shows a simple implementation of a receiver with a single resolver. The

queue structure has a status register with an attribute that indicates if there exist messages

targeted for either the cache or the directory. The resolver determines the next queue from which

a message is removed by generating the index of the next queue to be polled among the queues

with the corresponding attribute. Both cache and directory keep requesting the next message as

long as the corresponding status bit indicates that there are such messages.

The message data is placed on a bus shared by the Cache and Directory controller and

then received by the controller to which the message is destined. There are several disadvantages

of this implementation however due to the contention between the cache and directory controller.

Assume the cache controller is busy processing a previously received message and the directory

controller is idle and waiting for the next message to process. If the input buffers at the receiver

are implemented as a circular queue, a situation known as the head-of-the-line (HOL) blocking

can occur. In this case, if the first message in all of the queues is a message destined for the

cache controller, HOL blocking occurs when the queues contain messages that could be delivered

to the directory controller immediately but must remain in the queue until the messages directed

to the cache controller are removed. The result of this is that the directory controller remains idle

unnecessarily because it cannot reach into the queue to pull out the required message. In

addition, only one message can be removed from the input channel queues at a time because the

Cache and Directory controller are sharing the same data path.

23ResolverBinary Codeof Selected QueueCommon Queue Data Bus A

ttri

bute

Net

wor

k

Att

ribu

te

Att

ribu

te

Att

ribu

te

DataDataDataFrontRearIncoming DataData0Req0Ack0Enable0 Att

ribu

te N

etw

ork

Att

ribu

te

Att

ribu

te

Att

ribu

te

DataDataDataFrontRearIncoming DataDatan-1Reqn-1Ackn-1Enablen-1Input Channel Queue 0Input Channel Queue n-1

Figure 3.5: Single Resolver Implementation

One solution to the HOL blocking problem is to organize the input buffers as a linked list

rather than a circular queue. This approach allows messages to be removed from the middle of

the queue thereby avoiding the HOL blocking. Figure 3.6 illustrates an implementation of this

approach. In order to ensure that each controller removes its messages from the input buffers in

the order in which they were received, separate cache and directory linked lists can be

implemented on top of the link list used to maintain the input channel buffers. Separate pointers,

one for the cache controller and one for the directory controller, are used to obtain the next

message to be removed by each controller. Unlike a FIFO queue structure, the linked list

implementation of the input message queue requires the ability to extract data from any buffer.

The separate buffers are connected to an internal queue data bus. Information from the resolver

and the Attribute network is used to select which buffer will place its data on the Internal Queue

data bus.

24

Att

ribu

te N

etw

ork

DataRearQueueDataData CA

Att

r

DataFrontQueueFrontDirFrontCacheRear CacheRearDir DIR

Att

r

CA

Att

r

DIR

Att

r

CA

Att

r

DIR

Att

r

CA

Att

r

DIR

Att

r

Internal QueueData Bus

Figure 3.6: Separate Cache and Directory Message Chains

To maintain high performance, the receiver can be capable of extracting two messages

from the channel input queues simultaneously. One message is directed to the cache and one to

the directory and a demultiplexer is used to direct the message to one of two data paths, each one

dedicated to one of the controllers. In this case, two resolvers, one per controller, are required to

select the next queue from which a message will be removed and placed on the associated data

path. Figure 3.7 illustrates the dual resolver Implementation.

The queue structure has a status register with separate attributes to indicate if there exist

messages targeted for the cache or for the directory. As before, both cache and directory keep

requesting the next message as long as the corresponding status bit indicates that there are such

messages. Each resolver generates the index of the next queue to be polled among the queues

25with the corresponding attribute. If the two resolvers decide to read the same queue, one

controller will have to wait or the resolvers must be capable of coordinating so that the same

queue is not selected. If the resolvers are not coordinated and it is possible for both controllers to

request access to the same queue, the cache controller should have priority because it is also

serving the processor (Data-Ack or Ownership-Ack messages) and the amount of time the

processor is stalled waiting for data should be minimized.

Enable0DirEnable0DEMUXCache CommonQueue BusDirectory CommonQueue BusCache ResolverDirectory Resolver CA

Att

r

Data DIR

Att

r

CA

Att

r

Data DIR

Att

r

Inte

rnal

Que

ue

Dat

a B

us

CacheAck0CacheReq0DirAck0DirReq0DataDataCacheAckn-1CacheReqn-1DirAckn-1DirReqn-1DEMUXCacheEnable0CacheEnablen-1DirEnable0DirEnablen-1 CA

Att

r

DIR

Att

r

CA

Att

r

DIR

Att

r

Inte

rnal

Que

ue

Dat

a B

us

Input Channel Queue 0Input Channel Queue n-1Attribute NetworkAttribute Network

Figure 3.7: Dual Resolver Implementation

3.4 PERFORMANCE ANALYSIS

In the remaining sections of this chapter, three distinct set of results are presented:

1) The first set of results is obtained by the Trace-driven simulator described above. The Trace-

driven simulation provides a detailed model of the processes and memory and the DSM

operation of every node on the SOME-Bus and keeps track of every memory access by each

processor and its effect on individual data blocks.

262) A statistical simulation of a simplified queuing network model of the SOME-Bus is also

presented. In this simulation, the DSM operation is modeled in a probabilistic manner; an

event may have one or more specific outcomes, each with an associated probability. When an

event occurs, one of the related outcomes is selected according to the associated probabilities.

The statistical simulation is a simplified version of the Trace-driven simulation and does not

require maintaining the detailed state of the DSM system.

3) The final set of results is provided by a theoretical model that closely approximates the

distribution-driven simulation model, and makes additional simplifying assumptions to allow

a closed-form solution based on traditional queuing-network theory.

The three approaches described above are used so that they may validate each other. One

important result is that a system with such complex behavior can be modeled by relatively simple

and tractable models, which can provide reasonable estimates of processor utilization and

message latencies, using parameters that characterize real applications.

3.4.1 Trace-Driven Simulation

A trace-driven simulator was created in order to examine the performance of Fault

Tolerance and DSM on the Some-bus architecture. The parameters of the CC-NUMA

multiprocessor to be simulated are specified as inputs to the simulator. Examples of these

parameters include the number of nodes, the number of threads per node, amount of memory

allocated to each node and the cache structure of each node (cache size, number of cache blocks

and number of bytes per block).

The applications executed by the simulator consist of a sequence of memory references

and a type of access (read or write) for each address. The applications used with the simulator

come from actual multiprocessor address trace files and from artificially generated address

streams. Each thread has its own address sequence and stops running when it has finished

processing all of its memory references. The execution of the application is complete when all

27threads have finished processing their respective memory references. The simulated execution of

the application is briefly described below. A more detailed description is provided later in this

chapter.

A memory reference from the currently running thread’s address sequence is examined to

see if the memory reference will hit in the cache. If the address causes a cache hit the thread

remains in the RUNNING state and the next address in the thread’s sequence will be examined on

the next simulator clock cycle.

If the memory reference causes a miss in the cache but can be obtained from the local

memory, the thread waits for the access to complete in the SUSPENDED state. When the access

completes the thread transitions back to the RUNNING state and the next address in the thread’s

sequence will be examined on the next simulator clock cycle.

If the memory reference causes a miss in the cache and the memory access cannot be

filled by the local node, the thread is placed in the BLOCKED state and another thread from the

pool of ready threads is chosen and placed in the RUNNING state. The new thread will begin

running by processing the next address in its sequence on the next simulator clock cycle.

The operation of the processor, directory controller, cache controller and channel

controller is simulated in detail in order to provide information needed to compare the

performance of each component for the DSM system before and after the proposed Fault

Tolerance approaches have been applied.

All Simulation results are based on time units equal to one clock cycle. The performance

of the multiprocessor is evaluated in terms of processor and channel utilization, the number of

simulation cycles required to execute an application (address traces or synthetic workload),

average waiting times for the queues, and average round-trip times for Data and Ownership

request messages. In addition, performance characteristics relating to the checkpointing

procedures are also presented. All simulations reported in this paper assume a SOME-Bus system

configuration with 32 nodes (2 threads per node), 64 cache blocks per node, 64 bytes per block

28and either 2785 or 6828 memory blocks per node. The amount of memory per node was chosen in

order to satisfy the memory requirements of the applications for which the multiprocessor address

trace files were obtained.

There are two mechanisms for providing the simulator with multiprocessor applications:

address trace files and artificially generated address streams. A set of three multiprocessor

address trace files was obtained from the trace database supported by the Parallel Architecture

Research Laboratory (PARL) at New Mexico State University. Multiprocessor traces for the

programs called speech, simple and weather were obtained from the TraceBase website1. Details

about these three applications are provided in [7]. The weather application models the

atmosphere around the globe using finite difference methods to solve a set of partial differential

equations describing the state of the atmosphere. Speech uses a modified Viterbi search algorithm

to find the best match between paths through a directed graph representing a dictionary and

another through a directed graph representing spoken input in order to provide the lexical

decoding stage for a spoken language understanding system developed at MIT. Simple is an

application that models fluids. using finite difference methods to solve a set of behavior-

describing equations.

Table 3.1 provides a comparison between trace files weather, speech and simple. The

processor utilization is very low for all three traces. The reasons for this can be seen in Figure

3.8, which provides the distribution of the memory references by showing the number of

references that were directed to each node's local memory space for each of the three applications.

It can be seen from this figure that all three of the applications contain hotspots. An analysis of

the trace address references indicated that in both, the weather and simple applications, there are a

very large number of references directed to a single data byte. The cause of this behavior is a

barrier synchronization that occurs in both of the applications. Repeated accesses are made to a

1 http://tracebase.nmsu.edu/tracebase.html

29barrier flag as processors spin-lock on it. In order to reduce the hot spots created by the

synchronization variable, the simulator code was modified in order to redirect accesses to these

variables to another node that was the target for no other memory references.

Hot spots were also observed in the Speech application trace, but for different reasons.

The Hot spots were due to the frequent accesses to the data structures comprising the two

dictionaries used in the application. Instead of a single data block being continuously accessed,

the entire dictionary structure is scanned repeatedly. In this case, the hot spot can be avoided by

dividing the dictionary data structure into smaller units that can be distributed among the nodes.

Figure 3.9 provides the resulting distribution of the address references, after the highly accessed

data blocks were relocated as described above and Table 3.2 shows the corresponding

performance characteristics. The barrier synchronization, prevents the processor utilization in the

weather and simple from improving significantly from the reorganization of the highly accessed

data blocks. The processor utilization in speech was raised 3% but it still comparatively low

although the total execution time was shorted by 80%. The total execution time of the weather

and simple applications were lowered by 8 and 10 % respectively.

Table 3.1: Comparison of Multiprocessor Trace Filesweather speech simp

Address References per Node 496,313 183,932 422,345

Simulation Time 7,129,800 76,281,918 14,077,177

Processor Utilization 11.38% 0.64% 6.16%

Channel Utilization 31.84% 6.58% 20.28%

Avg Time Msg Spends in Ouput Q 103.046 575.627 222.37

Cache Read Ratio 90.34% 78.24% 91.80%

Cache Miss Ratio 2.46% 16.33% 3.83%

Cache Remote Miss Ratio 2.97% 18.03% 4.06%

30

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 1516 17 18 19 20 2122 23 24 25 26 2728 29 30 31

Node

Nu

mb

er o

f R

efer

ence

s

weather speech simp

Figure 3.8: Distribution of Address References

0

1,000,000

2,000,000

3,000,000

4,000,000

5,000,000

6,000,000

7,000,000

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

Node

Nu

mb

er o

f R

efer

ence

s

weather speech simp

Figure 3.9: Distribution of Address References, after Relocation

31Table 3.2: Simulation Data, after Relocation

weather speech simp

Address References per Node 496,313 183,932 422,345

Simulation Time 6,540,632 14,678,029 12,691,795

Processor Utilization 11.59% 3.56% 7.53%

Channel Utilization 32.32% 41.37% 26.62%

Avg Time Msg Spends in Ouput Q 95.122 75.713 196.353

Cache Read Ratio 90.34% 78.25% 91.80%

Cache Miss Ratio 2.39% 16.22% 3.89%

Cache Remote Miss Ratio 2.89% 18.28% 4.11%

Characteristics specific to certain applications such as those just described can have a

very large impact on the performance of a system on which the application is being simulated.

As can be seen in Tables 3.1 and 3.2, applications with similar basic characteristics such as

read/write ratio and miss ratio can provide very different performance due to other contributing

factors. The behavior of multiprocessor applications is a complex interaction of read and write

memory operations, cache miss rate, memory access patterns (temporal and spatial locality) and

interprocessor sharing of data. There is no such thing as a "typical" application and therefore

system evaluation must include a wide-range of different types of applications in order to assure

satisfactory performance under most conditions. The limited availability of multiprocessor traces

that cover a wide range of application behaviors is a major disadvantage to this approach.

When using simulation as a tool for evaluating new architectures or approaches, the use

of synthetically generated workloads can provide a much larger degree of flexibility than actual

traces. For this reason, there is a need for effective synthetic workload generation that can

provide reasonable coverage over a large range of application behaviors.

The behavior of real applications can be approximated by adjusting input parameters used

for the address generation scheme. Differences between the application behavior for synthetic

workloads and the address trace files can be attributed to temporal variations in the applications,

32such as bursts of accesses to memory locations on the same node, to asymmetrical memory

reference patterns (some nodes get a much higher or much lower percentage of the memory

references than other nodes), or to the pattern of alternating read and write operations to a

particular memory block. These types of differences can also be found in real world applications.

Artificially generated workloads also provide an opportunity to experiment with other system

parameters such as the number of threads or number of nodes for which the address trace files

have fixed values.

For synthetic workloads used in our simulations, the memory access behavior of the

application can be specified by associating probabilities with a set of options for the next address

to be accessed. The set of available options for our synthetic workload generator was chosen to

show the differences in the proposed fault tolerance protocols described in the next chapter. The

application characteristics that are to be compared are the cache hit ratio, the degree of locality

and the degree of data sharing between the nodes.

In order to study the behavior of the system under varying degrees of loads and

application behaviors, it is necessary to characterize the traffic in the network and through the

processing nodes, in terms of the address patterns caused by the application. The miss rate (and

especially the remote miss rate) is not sufficient to characterize the application behavior. In the

analysis described here, several parameters are used to indicate the degree of locality and sharing

of data between threads at different nodes.

When a node is ready to generate a new memory reference, that memory reference can be

an address that is currently in the node’s cache, an address that falls within the node’s local

memory, or an address that falls within another node’s memory. In addition, the memory

reference could specify a block that is currently being accessed by or has recently been accessed

by another node. Selection of proper probability values associated with each of the behaviors

described above, is used to control the degree of spatial locality for the memory references and

the amount of interaction between threads. Specifically the following parameters are used:

331. Parameter L indicates the probability that the reference is to the node’s local memory

2. Parameter N indicates the probability that the reference is to a designated neighboring

node’s memory2.

3. Parameter S indicates the probability that the reference belongs to any remote node’s

memory space and has been accessed by another thread in the system in the past.

4. The probability that the reference is to one of the blocks already in the cache is 1-

(L+N+S).

Applications of interest can be described by the following cases:

1. Threads tend to access arbitrary blocks at a designated neighboring node and the current

node. This case indicates a smaller degree of locality within the current node and a

smaller degree of sharing between threads. The parameters of interest are L=0.01, S=0.01

and N=[0.01,...,0.10]. (Case N01).

2. There is little sharing between threads. Within the current node there is a larger degree of

locality. The parameters of interest are L=0.10, S=0.02 and N=[0.01,...,0.10]. (Case N10).

3. Threads tend to access blocks at any remote node that have already been accessed by

another thread. This case indicates a smaller degree of locality within the current node

but a larger degree of sharing between threads. The parameters of interest are L=0.01,

N=0.01 and S=[0.01,...,0.10]. (Case S01).


another thread. Within the current node there is a larger degree of locality. The

parameters of interest are L=0.10, N=0.02 and S=[0.01,...,0.10]. (Case S10).

2 Although there are no neighbors in the SOME-Bus, a node X may be neighboring some other Node Y in afunctional sense. As discussed in following chapters, Node Y is a neighbor to Node X, if Node Y contains acopy of Node-X recovery memory

34Figure 3.10 and 3.11 show the processor and channel utilization for these four types of

application behavior. The horizontal axis is the remote cache miss ratio, i.e. the percentage of

cache misses that result in a request to a remote node. The figures show the effect of placing

emphasis on each one of the parameters. The S01 and S10 application cases have a larger channel

utilization and a slightly lower processor utilization than the N01 and N10 application cases

causing an increase in the job queue waiting time. The reason for this is that additional messages

must be sent in a DSM system upon a write request to a block that is shared by other nodes. In

addition to the Data or Ownership request message and corresponding acknowledge message,

invalidation and invalidation-acknowledge messages would be required. If the requested block

were in the Exclusive state, a writeback message would also be required. When there is no

sharing of data, the message exchange is limited to a Data or Ownership request and the

corresponding acknowledge.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00%

Remote Cache Miss Ratio

Uti

lizat

ion

ProcUtil:N01

ProcUtil:S01

ChanUtil:N01

ChanUtil:S01

Figure 3.10: Processor Utilization and Channel Utilization: Low Level of Locality

35

The Case S01 and S10 experiments represent the more conventional behavior of a

program. The Case N01 and N10 are more specialized because they reflect direct interaction

between two nodes. The N10 case shows more local activity which causes less channel

utilization than N01. The differences between Case S01 and S10 are due to the probability that

the memory reference can be found in the requesting node’s local memory. When the data can be

read from the node’s local memory, no external messages are generated. This can be seen in the

lower channel utilization of Case S01.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

0.00% 2.00% 4.00% 6.00% 8.00% 10.00% 12.00% 14.00%

Remote Cache Miss Ratio

Uti

lizat

ion

ProcUtil:N10

ProcUtil:S10

ChanUtil:N10

ChanUtil:S10

Figure 3.11: Processor Utilization and Channel Utilization: High Level of Locality

A memory reference may be processed locally (without sending messages to another

node) depending upon the type of request (Data or Ownership), the state of the block (SHARED

or EXCLUSIVE), and if the block is SHARED, on how many nodes are sharing at the time the

request for the block is made.

36

3.4.2 Decision Tree

A Decision Tree provides all possible sequences of events that can occur during the

processing of a memory reference for any combination of access type (read or write), location of

the memory block (local or remote) and state of the memory block (UNOWNED, SHARED,

EXCLUSIVE). The tree provides a mechanism for comparing the behavior of different

application in terms of the network traffic generated as a result of the processing of the address

references.

Each path in the tree provides a weighted cost in terms of network traffic for all messages

generated along that path. Identifying the paths that are more frequently taken by different types

of applications allows the different fault tolerance algorithms discussed in the following chapters

to be optimized in order to reduce the additional network traffic due to fault tolerance.

As mentioned previously, each thread has associated with it, a sequence of memory

references. These references can be predetermined (and read from a file) as in the case of the

multiprocessor traces, or can be generated at runtime using probabilistic values for specified

behavior (cache hit, locality, sharing of data between nodes).

When the thread begins to process the next memory reference, it determines whether the

memory access is a read or write operation. If the access type is a read operation, the branch of

the tree that will be traversed while processing the memory reference is shown in Figure 3.12.

This section of the tree can be exited via six separate paths identified by the letters A-F. The

particular branch taken at each fork in the tree will affect the size and number of messages that

will appear on the SOME-Bus channels as part of the processing of this memory reference.

37

F

REFERENCE

READ

MODIFIEDBLOCK

HOMEADDRESS

NOT HOMEADDRESS

NOT MODIFIEDBLOCK

DATA_REQTO LOC DIR

(NODE H = NODE A)(net cost 0)

DATA_REQTO REM DIR

(NODE H)(net cost SH)

IN CACHEHIT

NOT IN CACHE

NODE A

LOC DIR SENDSDOWNGR_REQ

TO REMOTE NODE C(net cost SH)

REMOTE NODE CSENDS DOWNGR_WB_ACK

TO THIS NODE A(net cost SD)

REMOTE DIR HASBLOCK IN STATE:

NOT MODIFIED MODIFIEDBY NODE C

REM DIR SENDSDATA_ACK TO NODE A

(net cost SD)

HOME ISNODE C

HOME IS NOTNODE C

REM DIR (NODE H)SENDS DATA_ACK

TO NODE A(net cost SD)

REM DIR (NODE H)SENDS DOWNGR_REQ

TO NODE C(net cost SH)

NODE C SENDSDOWNGR_WB_ACK TO

REM DIR (NODE H)(net cost SD)

NO MESSAGELocal Read miss


TO NODE H CACHE(net cost 0)

NODE H CACHE SENDSDOWNGR_WB_ACK TO

NODE H DIR(net cost 0)

A

B

C

D

E

Network Cost:0

Network Cost:0

LOC DIR SENDS DATA_ACK TO NODE A CACHE

(net cost 0)

Network Cost:SH + SD




Network Cost:2 * (SH + SD)


Figure 3.12: Decision Tree Branch for Read Accesses

38There are three major possibilities for the number and type of messages generated during

the progression through the tree. One possibility is that no messages are generated at all. If the

reference causes a cache hit, the thread will remain in the RUNNING state and will process a new

memory reference on the next clock cycle. The second possibility occurs when the reference is a

cache miss, but can be serviced from the local memory on the node without coordinating with

other nodes through Invalidation or Downgrade messages. In this case, the thread is suspended

during the reading/writing of the local memory and then returns to the RUNNING state and will

process a new memory reference on the next clock cycle. A context switch does not occur in this

situation. The third possibility occurs when the memory access causes a cache miss and the state

of the data block requires either Invalidation or Downgrade messages to be sent to nodes with a

copy of the data block. In this case a message must be sent to the home directory for the data

block. If the home node for the data block is the same node that is requesting the data, the

message will be local (does not appear on the output queue or channel) otherwise the message is

remote and will be placed on the sending node’s output queue and then eventually transmitted

across the node’s output channel

The size and number of messages that are generated along a particular tree path

determines the network cost associated with that path. The network cost of sending a message is

found by multiplying the size of the message in bytes by the number of clock cycles required to

transmit a single byte. Local messages do not contribute to the network cost. There are two types

of messages in terms of message size. Messages that contain a header and no data payload, such

as request or Invalidation messages have a message size of SH. Messages that have both a header

and a data payload are of size SD. In general, SD = SH + number of bytes in a cache block. The

network cost associated with each of the paths A-F is provided in Figure 3.12.

The tree branch shown in Figure 3.12 contains six possible paths for the processing of the

memory reference depending on the state of the data block and the location of the home node for

the data block. Tree Path A is taken if Node A is processing a read request for an address that is

39found in the cache in either SHARED or EXCLUSIVE state. This path ends with a cache hit and

no messages are generated. For this reason, the network traffic cost of Path A is 0.

If a thread from Node A, begins to process a read request for an address that misses in the

cache and Node A is the Home node for the memory block and the block is currently in the

EXCLUSIVE state (owned by Node C), Tree Path B is taken. The messages generated along this

path include a local Data request message to the directory controller of Node A (network cost =

0), a remote Downgrade request message sent to the cache controller on Node C (network cost =

SH), a remote DowngradeWB-Ack (containing the data block’s current value) sent from Node C

to the directory controller on Node A (network cost = SD), and a local Data-Ack message

containing the data block which is sent to the cache controller on Node A (network cost = 0).

Since local messages do not appear in the output queue or on the channel, the total network cost

of this path is SH + SD for the Downgrade request and DowngradeWB-Ack messages

In the Tree Paths for C and D, the memory reference misses in the cache, but is found in

the SHARED or UNOWNED state in the memory and therefore no Downgrade request messages

are required. In Path C, the data is found in the local memory of Node A and therefore no

messages are exchanged between the cache and directory controller. The network cost of Path C

is 0. In Path D, the home for the data block is not Node A so the messages generated along this

path include a remote Data Request message and a remote Data-Ack message, both of which

contribute to the network cost of this path.

Tree Path E and F are similar to Tree Path B except that the home node for the data block

is not Node A. In Path E, a remote Data request message, a local Downgrade message, a local

DowngradeWB message and a remote Data-Ack message is generated. Only the remote Data

request and Data-Ack contribute to the network cost. The Downgrade and DowngradeWB

messages are local in this case because the cache in the home node of the data block is the current

owner of the data block. This results in a network cost of SH + SD for Path E. In Path F, the

current owner is Node C and all message generated along this path (Data request, Downgrade,

40DowngradeWB, Data-Ack) are remote messages that contribute to the network cost of this path,

resulting in a total network cost of 2 * (SH + SD).

In Table 3.3, the average number of times a particular path A-F is taken by a node is

provided for the Application Cases N01, S01, N10, and S10 described above. Columns A-F show

the number of memory references that were processed on a particular path and the column

entitled Total References shows the number of memory references executed by each node in the

application after the first 1000 clock cycles. The first 1000 clock cycles are ignored in each

experiment in order to avoid the startup transient behavior.

The previous discussion dealt with the portion of the Decision Tree dedicated to read

accesses. The write access portion of the tree has many more paths and is presented here in two

parts. The first part is shown in Figure 3.13 and contains the paths for write accesses that hit in

the cache. The second part is shown in Figure 3.14 and contains the paths for write accesses for

blocks that are not in the cache.

Table 3.3: Decision Tree Path Distribution for Read AccessesTotal References Application A B C D E F

19629.126 Case N01: N = .01 17148.250 0.313 180.469 315.125 0.031 11.71919719.253 Case N01: N = .02 17044.469 0.281 176.438 505.000 0.063 9.28119835.565 Case N01: N = .05 16595.813 0.313 179.781 1056.281 0.000 6.06319899.345 Case N01: N = .10 15777.063 0.313 181.875 1945.281 0.000 4.59419719.253 Case S01: S = .01 17044.469 0.281 176.438 505.000 0.063 9.28119796.501 Case S01: S = .02 16950.406 0.250 178.594 663.844 0.094 16.43819888.283 Case S01: S = .05 16479.469 0.313 177.594 1195.875 0.281 36.84419930.939 Case S01: S = .10 15623.375 0.344 182.563 2059.156 1.250 74.40619726.471 Case N10: N = .01 15474.188 1.906 1782.875 471.969 0.094 19.50019769.784 Case N10: N = .02 15326.000 2.844 1781.813 669.375 0.000 17.87519852.627 Case N10: N = .05 14830.750 3.156 1781.844 1225.531 0.125 10.40619915.096 Case N10: N = .10 14005.156 2.875 1790.750 2121.344 0.031 7.87519722.22 Case S10: S = .01 15458.969 3.219 1756.125 511.375 0.031 9.65619769.784 Case S10: S = .02 15326.000 2.844 1781.813 669.375 0.000 17.87519853.876 Case S10: S = .05 14857.875 2.719 1795.063 1196.781 0.125 24.00019938.909 Case S10: S = .10 13986.688 2.750 1804.688 2121.875 1.594 33.375

992166.6563 weather 959301.688 9.406 1326.375 8937.500 1.094 618.156367841.0313 speech 228797.313 305.594 2375.219 27607.594 2.344 9925.156844409.4063 simple 803736.125 24.969 1934.656 21050.219 1.219 526.219

41Figure 3.13 contains 5 paths, H-L. Path H is taken when the data block is found in the

cache in EXCLUSIVE state and results in a cache hit with a network cost of 0. Paths I and J are

taken when the block was found in the cache, but it was in the SHARED state, requiring an

upgrade to the EXCLUSIVE state. In this case, only the Ownership for the block is desired, it is

not necessary to provide the cache with the data for the block since the block is already contained

in the cache. In Path I, Node A has the only copy of the block and is also the home for the data

block. The messages generated on this path include a local Upgrade request and a local Upgrade-

Ack. The network cost of Path I is 0.

Path J is similar to Path I except that more than one copy of the data block exists in the

caches of other nodes. This requires an Invalidation message to be multicast to the sharing nodes

and the corresponding Invalidation-Ack messages to be collected. The messages generated along

this path include a local Upgrade request, a single Invalidation message multicast, multiple

remote Invalidation-Ack messages, and a local Upgrade-Ack. The network cost of this path is

related to the invalidation of the copies of the data block. There can be multiple Invalidation-Ack

messages that result from a single Invalidation request for a shared data block. The parameter

Ninv shown in Figure 3.13 is the average number of nodes found sharing a data block when an

Invalidation request is made for a block in the SHARED state. If the data block is in the state

EXCLUSIVE, there can only be one owner and therefore only one Invalidation-WB message. If

there are Ninv sharers of the data block, there will be Ninv Invalidation-Ack messages. The

difference between an Upgrade-Ack message and an Ownership-Ack message is that the later

contains the data block (network cost of SD) and the Upgrade-Ack does not (network cost of SH).

Paths K and L are similar to paths I and J except that Node A is not the home node for the

data block. The messages generated along path K include a remote Upgrade request and a remote

Upgrade-Ack. The messages generated along path L include a remote Upgrade request, a remote

Invalidation message, multiple remote Invalidation-Ack messages and a remote Upgrade-Ack.

42

IN CACHE

MODIFIEDBLOCK

HITSHAREDBLOCK

HOMEADDRESS

UPGRADEOWNR_REQTO LOC DIR(net cost 0)

LOCAL DIR HASBLOCK IN STATE:

SHARED ONLYBY NODE A

LOC DIR (NODE H)SENDS OWN_ACK

TO NODE A(net cost 0)

SHARED BYREM NODE(S)

NOT HOMEADDRESS

UPGRADEOWNR_REQTO REM DIR(net cost SH)

SHARED


SHARED ONLY BYNODE A

H

IJ

K

L

LOC DIR (NODE H)SENDS INV_REQ

TO SHARING NODES(net cost SH)

REFERENCENODE A

WRITE

NOT IN CACHE

Network Cost:0

Network Cost:0

Network Cost:(1 +Ninv)* SH



NODES IN COPYSETSEND INV_ACK TO NODE H

(net cost Ninv * SH)

REM DIR (NODE H)SENDS OWN_ACK

TO NODE A(net cost SH)

Network Cost:2 * SH

REM DIR (NODE H)SENDS INV_REQ

TO SHARING NODES(net cost SH)


TO NODE A(net cost SH)




Figure 3.13: Decision Tree Path Distribution for Write Accesses to Blocks in Cache

43

M

REFERENCE

NODE A

WRITE

SHAREDMODIFIEDBY NODE C

UNOWNED

HOME ISNODE C

HOME IS NOTNODE C

REM DIR (NODE H)SENDS INV REQ




REM DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET

(net cost SH)

REM DIR (NODE H=C)SENDS INV REQ TO NODE (H=C) CACHE

(net cost 0)

NODE (H=C) CACHE SENDSINV_WB_ACK TO DIR (NODE H=C)

(net cost 0)

HOMEADDRESS

SHR/MODBLOCK

OWNR_REQTO LOC DIR H(net cost 0)

UNOWNEDBLOCK



TO NODE A AND HOME2(net cost 0)



TO OWNER(net cost SH)

MODIFIED BYREM NODE

NOT IN CACHE

NOT HOMEADDRESS

OWNR_REQTO REM DIR H

(net cost SH)



TO NODE A (net cost 0)

P Q R

ON

S

OWNR_REQTO LOC DIR H

(net cost 0)

IN CACHE

Network Cost:SH + SD Network Cost:

( 2 * SH) + (2 * SD)

NODE C SENDSINV_WB_ACK TO

NODE H(net cost SD)





Network Cost:((Ninv+2) * SH) + SD

NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)(net cost (Ninv *SH))



NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)(net cost (Ninv *SH))

LOC DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET

(net cost SH)

Network Cost:((Ninv+1) * SH)

REM NODE SENDS INV_WB_ACK TO

NODE H(net cost SD)



Network Cost:0



Figure 3.14: Decision Tree Path Distribution for Write Accesses to Blocks not in Cache

Figure 3.14 contains 8 paths, M-T. Paths M-P are taken when Node A is not the home

node for the data block. In paths M and N the block is found to be in EXCLUSIVE state. Path M

44generates a remote Ownership request, a local Invalidation message, a single local

InvalidationWB-Ack message, and a remote Ownership-Ack message. The network cost is due to

the Ownership request and Ownership-Ack messages. In Path N, the owner of the block is not the

home node and therefore all messages generated along this path (Ownership request, Invalidation,

InvalidationWB, and Ownership-Ack) are remote messages and contribute to the network cost of

this path.

In Path O, the data block is found to be UNOWNED so the only messages are the remote

Ownership request and Ownership-Ack. In Path P, the data block is shared by caches on nodes

other than Node A. This path generates a remote Ownership request, a single remote Invalidation

message multicast, multiple remote Invalidation-Ack messages, and a remote Ownership-Ack.

Paths Q, R, and S are similar to paths P, N, and O, respectively except that the Ownership

request and Ownership-Ack messages are local and do not contribute to the network cost of the

path. In the case of Path S, since the data block is UNOWNED, no Invalidation messages are

necessary and it is not necessary to send an Ownership request message because the home for the

data block is local, no messages are generated at all on this path, resulting in a network cost of 0.

Tables 3.4 and 3.5 provides the average number of times a node takes a particular path H-

S for the Application Cases N01, S01, N10, and S10. Columns H-S show the number of Total

References shows the number of memory references executed by each node in the application

after the first 1000 clock cycles.

The net cost for each path in the system is shown in Tables 3.6, 3.7 and 3.8. The total

network cost is obtained by multiplying the number of times that each path was taken by the

network cost associated with taking that path. The values used in the tables are SH =10 and SD=74

. The column Ninv contains the average number of nodes containing a copy of the data block when

an Invalidation message was sent for a block in the SHARED state.

45Table 3.4: Decision Tree Path Distribution for Write Hit Accesses

Total References Application H I J K L 19629.126 Case N01: N = .01 1912.313 0.031 0.000 0.156 3.62519719.253 Case N01: N = .02 1904.938 0.000 0.000 0.313 2.09419835.565 Case N01: N = .05 1859.875 0.031 0.000 0.719 0.68819899.345 Case N01: N = .10 1749.563 0.000 0.000 1.250 0.65619719.253 Case S01: S = .01 1904.938 0.000 0.000 0.313 2.09419796.501 Case S01: S = .02 1888.125 0.000 0.000 0.313 3.03119888.283 Case S01: S = .05 1835.625 0.031 0.000 0.438 2.43819930.939 Case S01: S = .10 1731.188 0.063 0.000 1.469 0.68819726.471 Case N10: N = .01 1720.406 0.188 0.000 0.063 5.81319769.784 Case N10: N = .02 1695.313 0.188 0.000 0.281 3.37519852.627 Case N10: N = .05 1663.344 0.000 0.000 0.719 1.06319915.096 Case N10: N = .10 1551.594 0.063 0.000 1.281 0.37519722.22 Case S10: S = .01 1723.500 0.000 0.000 0.219 1.84419769.784 Case S10: S = .02 1695.313 0.188 0.000 0.281 3.37519853.876 Case S10: S = .05 1638.281 0.156 0.000 0.156 3.68819938.909 Case S10: S = .10 1552.250 0.031 0.000 0.625 0.781

992166.6563 weather 2767.219 68.688 8.094 2941.406 553.031367841.0313 speech 69303.938 42.156 3.000 273.969 9739.688844409.4063 simple 3864.375 78.688 12.000 3219.063 233.063

Table 3.5: Decision Tree Path Distribution for Write Miss AccessesTotal References Application M N O P Q R S

19629.126 Case N01: N = .01 0.031 1.281 28.500 8.000 0.125 0.094 19.06319719.253 Case N01: N = .02 0.000 0.844 52.094 5.250 0.000 0.000 18.18819835.565 Case N01: N = .05 0.000 0.813 112.500 3.125 0.125 0.094 19.34419899.345 Case N01: N = .10 0.000 0.594 215.906 2.594 0.031 0.031 19.59419719.253 Case S01: S = .01 0.000 0.844 52.094 5.250 0.000 0.000 18.18819796.501 Case S01: S = .02 0.031 2.188 63.969 9.781 0.125 0.031 19.28119888.283 Case S01: S = .05 0.031 4.125 113.250 21.844 0.094 0.031 20.00019930.939 Case S01: S = .10 0.281 7.906 181.656 46.219 0.094 0.031 20.25019726.471 Case N10: N = .01 0.000 1.938 40.000 10.938 1.156 0.281 195.15619769.784 Case N10: N = .02 0.000 1.594 63.313 10.250 1.281 0.438 195.84419852.627 Case N10: N = .05 0.031 1.156 132.844 5.063 1.594 0.438 194.56319915.096 Case N10: N = .10 0.000 0.563 233.063 3.094 1.531 0.438 195.06319722.22 Case S10: S = .01 0.000 0.750 52.219 4.438 1.656 0.188 198.03119769.784 Case S10: S = .02 0.000 1.594 63.313 10.250 1.281 0.438 195.84419853.876 Case S10: S = .05 0.000 2.844 119.906 13.031 1.469 0.219 197.56319938.909 Case S10: S = .10 0.188 3.594 210.594 20.188 1.375 0.469 197.844

992166.6563 weather 0.750 416.656 3567.906 14.844 0.500 6.281 150.406367841.0313 speech 1.844 49.313 376.938 44.313 0.688 1.719 156.406844409.4063 simple 1.500 327.875 4845.094 125.531 7.406 5.781 200.969

46Table 3.6: Net Cost for Paths A Through F

Application Ninv A B C D E F Total_ACase N01: N = .01 6.43883 0.000 26.292 0.000 26470.500 2.604 1968.792 28468.188Case N01: N = .02 6.285106 0.000 23.604 0.000 42420.000 5.292 1559.208 44008.104Case N01: N = .05 4.97619 0.000 26.292 0.000 88727.604 0.000 1018.584 89772.480Case N01: N = .10 4.6 0.000 26.292 0.000 163403.604 0.000 771.792 164201.688Case S01: S = .01 6.285106 0.000 23.604 0.000 42420.000 5.292 1559.208 44008.104Case S01: S = .02 6.130435 0.000 21.000 0.000 55762.896 7.896 2761.584 58553.376Case S01: S = .05 3.219231 0.000 26.292 0.000 100453.500 23.604 6189.792 106693.188Case S01: S = .10 1.421543 0.000 28.896 0.000 172969.104 105.000 12500.208 185603.208Case N10: N = .01 6.947644 0.000 160.104 0.000 39645.396 7.896 3276.000 43089.396Case N10: N = .02 6.480084 0.000 238.896 0.000 56227.500 0.000 3003.000 59469.396Case N10: N = .05 4.044534 0.000 265.104 0.000 102944.604 10.500 1748.208 104968.416Case N10: N = .10 2.91875 0.000 241.500 0.000 178192.896 2.604 1323.000 179760.000Case S10: S = .01 5.897638 0.000 270.396 0.000 42955.500 2.604 1622.208 44850.708Case S10: S = .02 6.480084 0.000 238.896 0.000 56227.500 0.000 3003.000 59469.396Case S10: S = .05 5.637457 0.000 228.396 0.000 100529.604 10.500 4032.000 104800.500Case S10: S = .10 1.476923 0.000 231.000 0.000 178237.500 133.896 5607.000 184209.396

weather 1.026346 0.000 790.104 0.000 750750.000 91.896 103850.208 855482.208speech 1.62919 0.000 25669.896 0.000 2319037.896 196.896 1667426.208 4012330.896simple 1.095569 0.000 2097.396 0.000 1768218.396 102.396 88404.792 1858822.980

Table 3.7: Net Cost for Paths H Through LApplication Ninv H I J K L Total_B

Case N01: N = .01 6.43883 0.000 0.000 0.000 3.120 342.158 351.716Case N01: N = .02 6.285106 0.000 0.000 0.000 6.260 194.430 206.975Case N01: N = .05 4.97619 0.000 0.000 0.000 14.380 54.876 74.232Case N01: N = .10 4.6 0.000 0.000 0.000 25.000 49.856 79.456Case S01: S = .01 6.285106 0.000 0.000 0.000 6.260 194.430 206.975Case S01: S = .02 6.130435 0.000 0.000 0.000 6.260 276.743 289.134Case S01: S = .05 3.219231 0.000 0.000 0.000 8.760 151.625 163.604Case S01: S = .10 1.421543 0.000 0.000 0.000 29.380 30.420 61.222Case N10: N = .01 6.947644 0.000 0.000 0.000 1.260 578.257 586.464Case N10: N = .02 6.480084 0.000 0.000 0.000 5.620 319.953 332.053Case N10: N = .05 4.044534 0.000 0.000 0.000 14.380 74.883 93.308Case N10: N = .10 2.91875 0.000 0.000 0.000 25.620 22.195 50.734Case S10: S = .01 5.897638 0.000 0.000 0.000 4.380 164.072 174.350Case S10: S = .02 6.480084 0.000 0.000 0.000 5.620 319.953 332.053Case S10: S = .05 5.637457 0.000 0.000 0.000 3.120 318.549 327.307Case S10: S = .10 1.476923 0.000 0.000 0.000 12.500 34.965 48.942

weather 1.026346 0.000 0.000 164.012 58828.120 22266.942 81260.100speech 1.62919 0.000 0.000 78.876 5479.380 450868.663 456428.548simple 1.095569 0.000 0.000 251.468 64381.260 9545.256 74179.080

47Table 3.8: Net Cost for Paths M Through S

Application Ninv M N O P Q R S Total_CCase N01: N = .01 6.43883 2.604 215.208 2394.000 7472.000 106.250 7.896 0.000 10195.354Case N01: N = .02 6.285106 0.000 141.792 4375.896 4903.500 0.000 0.000 0.000 9421.188Case N01: N = .05 4.97619 0.000 136.584 9450.000 2918.750 106.250 7.896 0.000 12619.480Case N01: N = .10 4.6 0.000 99.792 18136.104 2422.796 26.350 2.604 0.000 20687.646Case S01: S = .01 6.285106 0.000 141.792 4375.896 4903.500 0.000 0.000 0.000 9421.188Case S01: S = .02 6.130435 2.604 367.584 5373.396 9135.454 106.250 2.604 0.000 14985.288Case S01: S = .05 3.219231 2.604 693.000 9513.000 20402.296 79.900 2.604 0.000 30690.800Case S01: S = .10 1.421543 23.604 1328.208 15259.104 43168.546 79.900 2.604 0.000 59838.362Case N10: N = .01 6.947644 0.000 325.584 3360.000 10216.092 982.600 23.604 0.000 14907.880Case N10: N = .02 6.480084 0.000 267.792 5318.292 9573.500 1088.850 36.792 0.000 16285.226Case N10: N = .05 4.044534 2.604 194.208 11158.896 4728.842 1354.900 36.792 0.000 17473.638Case N10: N = .10 2.91875 0.000 94.584 19577.292 2889.796 1301.350 36.792 0.000 23899.814Case S10: S = .01 5.897638 0.000 126.000 4386.396 4145.092 1407.600 15.792 0.000 10080.880Case S10: S = .02 6.480084 0.000 267.792 5318.292 9573.500 1088.850 36.792 0.000 16285.226Case S10: S = .05 5.637457 0.000 477.792 10072.104 12170.954 1248.650 18.396 0.000 23987.896Case S10: S = .10 1.476923 15.792 603.792 17689.896 18855.592 1168.750 39.396 0.000 38357.426

weather 1.026346 63.000 69998.208 299704.104 13864.296 425.000 527.604 0.000 384582.212speech 1.62919 154.896 8284.584 31662.792 41388.342 584.800 144.396 0.000 82219.810simple 1.095569 126.000 55083.000 406987.896 117245.954 6295.100 485.604 0.000 586223.554

Table 3.9 gives a summary of the network cost totals for application cases N01, S01,

N10, and S10. SimTime is the execution time of the application in simulation clock cycles.

Total_A, Total_B, and Total_C are the network cost values taken from Tables 3.6, 3.7 and 3.8.

There are two types of messages that are not taken into account by the Decision Tree and they

relate to capacity cache misses. Victim messages are sent to the home directory to remove a node

from the copyset when a shared block is removed from the cache, and VictimWB messages

contain writeback data for a cache block that was in the EXCLUSIVE state when it was chosen as

a victim by the cache controller. The VictimWB and Victim columns in the table provide the

network cost for these two types of messages. The Total column is the sum of the individual

network cost totals. The Traffic column is the value in the Total column divided by the

Simulation Time and gives a measure of the channel utilization as calculated using the Decision

Tree and the victim message information. The ChanUtil column is the channel utilization

measured in the simulation by taking the ratio of the total number of cycles the channel was busy

transmitting data over the total number of clock cycles in the simulation. Table 3.9 shows that the

48Decision Tree accurately describes the contribution of each path in the decision tree to the overall

channel utilization observed for the application as a whole.

Table 3.9: Decision Tree Traffic vs. Channel UtilizationApplication Sim Time Total_A Total_B Total_C VICTIM_WB VICTIM Total Traffic Chan Util

Case N01: N = .01 59741 28,468.19 351.72 10,195.35 14.38 216.81 39,246.45 66% 64%Case N01: N = .02 82153 44,008.10 206.98 9,421.19 36.47 433.78 54,106.52 66% 74%Case N01: N = .05 148445 89,772.48 74.23 12,619.48 98.88 1,004.91 103,569.97 70% 83%Case N01: N = .10 256286 164,201.69 79.46 20,687.65 204.16 1,895.22 187,068.17 73% 87%Case S01: S = .01 82153 44,008.10 206.98 9,421.19 36.47 433.78 54,106.52 66% 74%Case S01: S = .02 108094 58,553.38 289.13 14,985.29 48.03 571.31 74,447.14 69% 72%Case S01: S = .05 193752 106,693.19 163.60 30,690.80 90.13 1,143.56 138,781.28 72% 73%Case S01: S = .10 333431 185,603.21 61.22 59,838.36 146.28 2,095.38 247,744.45 74% 74%Case N10: N = .01 85816 43,089.40 586.46 14,907.88 23.63 338.00 58,945.37 69% 66%Case N10: N = .02 108389 59,469.40 332.05 16,285.23 46.00 561.00 76,693.67 71% 72%Case N10: N = .05 175771 104,968.42 93.31 17,473.64 117.50 1,167.44 123,820.30 70% 80%Case N10: N = .10 290052 179,760.00 50.73 23,899.81 218.94 2,075.25 206,004.74 71% 84%Case S10: S = .01 84186 44,850.71 174.35 10,080.88 35.53 436.81 55,578.28 66% 73%Case S10: S = .02 108389 59,469.40 332.05 16,285.23 46.00 561.00 76,693.67 71% 72%Case S10: S = .05 193622 104,800.50 327.31 23,987.90 101.00 1,093.13 130,309.83 67% 72%Case S10: S = .10 328579 184,209.40 48.94 38,357.43 189.09 2,107.19 224,912.04 68% 75%

weather 6540632 855,482.21 81,260.10 384,582.21 6,485.13 6,033.00 1,333,842.65 20% 32%speech 14678029 4,012,330.90 456,428.55 82,219.81 510.53 21,519.50 4,573,009.29 31% 41%simple 12691795 1,858,822.98 74,179.08 586,223.55 7,961.22 18,123.06 2,545,309.90 20% 27%

Table 3.10: Distribution of Read or Write Misses Through the Decision Tree PathsApplication B C D E F G I J K L M N O P Q R S

Case N01: N = .01 32% 55% 2% 5% 1% 3%Case N01: N = .02 23% 66% 1% 7% 2%Case N01: N = .05 13% 77% 8% 1%Case N01: N = .10 8% 82% 9%

Case S01: S = .01 23% 66% 1% 7% 2%Case S01: S = .02 19% 69% 2% 7% 1% 2%Case S01: S = .05 11% 76% 2% 7% 1% 1%Case S01: S = .10 7% 80% 3% 7% 2%

Case N10: N = .01 70% 19% 2% 8%Case N10: N = .02 65% 24% 2% 7%Case N10: N = .05 53% 36% 4% 6%Case N10: N = .10 41% 49% 5% 4%

Case S10: S = .01 69% 20% 2% 8%Case S10: S = .02 65% 24% 2% 7%Case S10: S = .05 53% 36% 4% 6%Case S10: S = .10 41% 48% 5% 4%

weather 7% 48% 3% 16% 3% 2% 19% speech 5% 54% 19% 19% simple 6% 65% 2% 10% 1% 15%

Table 3.10 shows the percent of read or write misses that were processed through each of

the Tree Paths, rounded to the nearest 1%. This table provides a mechanism to compare the

behavior of the simulated applications. It also shows that our approach to generating synthetic

workloads produces application behavior that resembles that of the trace files.

49Path C is traversed when the address references are local. The fact that the trace

applications have a very low degree of locality (due to the frequent remote barrier accesses and

the remote dictionary data structure access) accounts for the differences between the synthetic

and trace-based applications for Path C.

Path F handles read accesses to remote data blocks that are in the Exclusive state and

Path L handles upgrade requests for remote data blocks that are shared by nodes other than the

requesting node. Speech has a higher value in both of these paths do to the scanning of the

dictionary data and the following update to a shared global database.

3.4.3 Queuing Network Theoretical Model

The system consisting of the processor, the cache, the directory and the channel

controller can be represented by a set of queues through which messages of different types flow.

Figure 3.15 shows the queue organization. Although no messages are delivered directly to the

processor, the arrival of Data-Ack and Ownership-Ack messages causes threads to become ready

for execution and therefore, affects the processor operation. Messages of other types used to

support coherence arrive at either the directory or the cache queue. Data and ownership request

messages from remote nodes arrive at the directory queue. Since the directory may have to send

and receive coherence messages in response to the arrival of a single data and ownership request

message, there is also a pending queue where such data and ownership request messages wait

until the coherence transactions are performed. Figure 3.16 shows the flow of messages through

several nodes for the case where a processor read operation causes a miss to a remote block that is

in EXCLUSIVE state.

50

CHAN

CACHE

PROC

DIR

INVREQ ORINV+DAT REQ ORDNGRADE REQ

INVACK ORINV+DAT ACK ORDNGRADE ACK

REMOTE MEMORYDATA/OWNREQ

LOCAL MEMORYDATA/OWNREQ

DATA/OWNREQ FROMREMOTE NODE

LOCAL MEMORYDATA/OWNACK

PENDINGDATA/OWNREQ

DATA REQ TO REMOTE SH OR COHREQ

COHACK

DATA/OWNACK FROMREMOTE NODE

PEND

DATA/OWNACKTO REMOTE NODE

Figure 3.15: Processor, Cache, Directory and Channel Queues

CHAN

CACHE

PROC

DIR







PENDINGDATA/OWNREQ


COHACK


PEND


CHAN

CACHE

PROC

DIR







PENDINGDATA/OWNREQ


COHACK


PEND


CHAN

CACHE

PROC

DIR







PENDINGDATA/OWNREQ


COHACK


PEND


HOME

DowngradeRequest

DownGradeAcknowledge

Figure 3.16: Message Flow in Queue system

51In this section, a model of a SOME-Bus system is developed and results are compared to

previous results from the Trace-driven simulation of the SOME-Bus. Theoretical analysis of the

SOME-Bus system is based on queuing network models.

In a SOME-Bus system with N nodes, each node contains a processor with cache,

memory, an output channel and a receiver that can receive messages simultaneously on all N

channels. Each node also contains a directory that maintains coherence information on the section

of the distributed memory that is implemented in that node (the home node).

A multithreaded execution model is assumed: each processor executes a program that

contains a set of M parallel threads. The node in which a group of M threads executes, is called

the host or owner node. Threads can only be executed in their owner node. A thread continues

execution until it encounters a global cache miss that requires data or permission from a remote

node. Then, it becomes suspended until the required action is completed (data transferred from a

remote memory or permission received), at which time it becomes ready for execution and

eventually resumes execution. A thread is executed on the processor for runtime R before

becoming suspended on a global cache miss.

A cache miss causes a request message to be enqueued for transmission on the output

channel. After transfer time T expires, the message is enqueued in an input queue at the receiver

of the destination (remote) node, is serviced by the directory at the remote node and another

message is sent back to the originating node with data or acknowledgment. The remote node

memory requires time L to assemble the response message. Time L is spent by the directory

controller, which must perform the necessary accesses on the memory, create the response

message and enqueue it for transmission on the output channel of the remote node. As part of

servicing a message, a directory may send messages to other nodes and receive data or

acknowledgments. A global cache miss may be due to a read or write miss at the local cache.

Accordingly, a Data request or Ownership request message is sent to the directory of the home

node. When the message is received, the required data block may be in SHARED state or

52EXCLUSIVE state. The home directory may send a data message back to the requestor (in the

case of reading a shared block) or it may send Downgrade or Invalidation messages and collect

Invalidation Acknowledge (Invalidation-Ack) messages.

Since M indicates the maximum number of outstanding requests that a node may have

before it blocks, it is assumed that when a node has less than M outstanding requests, it generates

messages with mean interval of R time units between requests. R is the mean thread runtime. A

request message generated by a node is directed to any other node with equal probability. These

types of assumptions on the node operations are similar to the ones made in [39] and to the ones

used in [1] and [15] who study the performance of DSM in a torus system.

Since a message is sent by a node to the home node, which responds with another

message back to the originating node, this type of operation can be represented by a multi-chain,

closed queuing network, where messages receive complicated forms of service at certain server

stations. The M threads owned by each node form a separate class of messages with population

equal to M. When m < M threads have outstanding messages, the remaining M-m messages are

served in FCFS order at the processor. A message receives service (of geometrically distributed

time with mean R) at the owner processor, is enqueued at the output channel of the owner node,

receives service (of time T) by the channel, is enqueued in a receiver input queue at a remote

node, receives service (of time L) by the directory at the remote node, and is similarly transferred

back to the owner node through the remote output channel and owner receiver input queue.

Due to the symmetry of the system, all directory controllers behave in a similar fashion.

Therefore, a system with N=M+1 nodes shows the same performance as a system with N > M+1

nodes. Typically, a small number of threads per node is possible. Then, a relatively small queuing

network representing a SOME-Bus system with M+1 nodes can be used to calculate performance

measures. Node P (P=0,1,...,M) owns M regular data messages that circulate through the

processor, the directory controllers of node P and the other M nodes and the necessary channels.

These messages represent the Data request and Ownership request messages from a remote node

53to the home node and the Data-Ack and Ownership-Ack messages sent back to the originating

node. Different chains are used to distinguish between messages belonging to different nodes, so

that messages in chain P belong to node P. Messages owned by node P are directed to the

directory of any node, including the owner. Messages that arrive at node P from other nodes and

do not belong to chain P are sent to the directory of node P. Figure 3.17 shows the queuing

network resulting from a system with N=4 nodes (M=3), and traces the paths of the four chains

present in the system.

DIR

CHAN

PROC

Q00

Q01

Q02

DIR

CHAN

PROC

Q03

Q04

Q05

DIR

CHAN

PROC

Q06

Q07

Q08

DIR

CHAN

PROC

Q09

Q10

Q11

DIR

CHAN

PROC

Q00

Q01

Q02

DIR

CHAN

PROC

Q03

Q04

Q05

DIR

CHAN

PROC

Q06

Q07

Q08

DIR

CHAN

PROC

Q09

Q10

Q11

DIR

CHAN

PROC

Q00

Q01

Q02

DIR

CHAN

PROC

Q03

Q04

Q05

DIR

CHAN

PROC

Q06

Q07

Q08

DIR

CHAN

PROC

Q09

Q10

Q11

DIR

CHAN

PROC

Q00

Q01

Q02

DIR

CHAN

PROC

Q03

Q04

Q05

DIR

CHAN

PROC

Q06

Q07

Q08

DIR

CHAN

PROC

Q09

Q10

Q11

Figure 3.17: Four-node Queuing Network

The complexity of the model arises from the fact that there are several other messages

that are used only to maintain coherence. These messages do not pass through the processors.

54Instead they are generated by the directory controllers, go through the channels, possibly interact

with cache controllers and return to the originating directory controllers. Figure 3.18 shows the

traffic through one node.

PROC

Q00

DIR

Q01

CHAN

Q02

CHAIN 0

CHAIN 0

CHAIN 0

OTHER CHAINS

OTHER CHAINS

PLM

COHERENCETRAFFIC

(1-PLM)

Figure 3.18: Traffic Through a Single Node in the Queuing Network

The service time at the processor is the thread run time. The coherence traffic determines

the service time at the directory controller. We assume that a data message caused by a cache

miss is due to a read reference with probability prd, and that a block may be found in shared state

with probability psh. There are four distinct cases, each with a different action by the directory

controller at the home node:

1. Data-request to a block in shared state (probability prd * psh). The directory controller

discards the data-request message and returns a data-acknowledge message back to the

requesting node. From the queuing network point of view there is no distinction between

a data-request and a data-acknowledge message. This activity can simply be viewed as a

data-request message performing one trip from its own processor to the home directory

55controller and back to its own processor. Its service time at the home directory controller

is the time L1 that the directory needs to compose the data-acknowledge message with the

copy of the requested block.

2. data-request to a block in EXCLUSIVE state (probability prd * (1-psh)). A Downgrade

message is sent by the home directory to the node with ownership of the block. That node

sends a DowngradeWB-Ack message to the home node, who sends a Data-Ack to the

requesting node.

3. Ownership request to a block in EXCLUSIVE state (probability (1-prd) * (1-psh)). An

Invalidation message is sent by the home directory to the node with ownership of the

block. That node sends an InvalidationWB-Ack (Invalidation Acknowledge with write-

back) message to the home node, who sends an Ownership-Ack to the requesting node.

4. Ownership request to a block in shared state (probability (1-prd) * psh). The home

directory controller broadcasts invalidations to all nodes having a copy of the requested

block. It collects all invalidate-acknowledge messages and then sends an Ownership-Ack

message to the requesting block.

The average service time experienced by a data message at the home directory controller

is therefore

L = L1 * prd * psh + 2 * (L2 + L3)*(prd * (1-psh)+(1-prd) * (1-psh)) + L4 * (1-prd) * psh (3.1)

where L1 is the time that the directory needs to compose the data-acknowledge message

with the copy of the requested block, L2 is the channel transfer time, L3 is the average channel

queue time and L4 = (L2+L3)+L5 is the mean time to send the Invalidation message and collect the

Invalidation-Ack messages.

We assume that the cache controller works at much higher speed than the network, so

that the response time of the cache controller can be ignored. If the number of blocks which must

56be invalidated on each Ownership request is fixed Ninv, then L5 is the mean value of a random

variable equal to the maximum of Ninv identical random variables, each being equal to the service

and queuing time at the channel server. In several applications and in most experiments described

here, it is observed that typically one Invalidation message is sent. In that case, L5 = L2+L3 and

L4 = 2 * (L2+L3). Then (3.1) becomes

L = L1 * prd * psh + 2 *(L2+L3) * (prd * (1-psh)+(1-prd) * (1-psh)) + 2 *(L2+L3) * (1-prd) * psh

L = L1 * prd * psh + 2 * (L2 + L3) * (prd * (1-psh) + (1-prd) * (1-psh) + (1-prd) * psh )

L = L1 * prd * psh + 2 * (L2 + L3) * (prd * (1-psh) + (1-prd))

L = L1 * prd * psh + 2 * (L2 + L3) * (1 - prd * psh) (3.2)

In fact, the contribution of Invalidation messages to the directory service time is even

smaller, because it is possible that the requested block is in the UNOWNED state and no

Invalidation messages need to be sent.

Second, the coherence traffic is also passing through the channels. The interference with

the data messages can be approximated by the fact that coherence messages absorb some fraction

of the service rate of the channel server, making the channel service time of the data messages

appear larger. This apparent service time is

T' = ( T * (KD +KC ) ) / KD (3.3)

where T is the real channel mean transfer time, KD=M(M+1) is the total number of data

messages in the system, and KC is the mean number of coherence traffic messages present in the

system. Since coherence messages are created due to the arrival of data messages into the

underlying coherence mechanism, KC can be calculated using Little’s result: KC = lD * L4, where

lD is the arrival rate of data messages into the underlying coherence mechanism and L4 =

2(L2+L3) is the time that a coherence message exists in the system. The arrival rate lD is lD = (1-

57prd psh) * KD / RD, where RD is the data message mean roundtrip time, and the (1- prd psh) factor is

necessary because data-request messages to a block in shared state do not result in additional

coherence traffic.

Assuming geometrically distributed processing times, this queuing network model is in a

condition that may be solved by standard techniques, with the only exception that the service time

at the directory server depends on the queuing time at the channel server queue. It is possible to

solve the model iteratively, by assuming an initial L3 value, and repeatedly solving the model and

updating L3 until the current and next values of L3 are sufficiently close. Such an iteration does

converge, because a large (or small) assumed value of L3 results in low (or large) channel

utilization and a small (or large, respectively) subsequent value of L3. However, in simulation

results show a moderate channel utilization, which results in relatively small queuing times. To

keep the queuing network model simple, we set L3 = 0, and essentially ignore the dependence of

the directory service time on the channel wait time. Using the reduced channel service rate, and

the estimate of the directory service time, the closed queuing network with 3(M+1) queues,

(M+1) chains and M messages in each chain can be solved using techniques described in [4].

Comparisons with simulation results indicate that such an assumption is justified.

To perform a comparison between simulation and theoretical results, the parameters of

the queuing network model must be related to some key measurements of the simulation.

Specifically, the inputs to the theoretical model are the mean service times at the processor,

directory and channel servers, and the probability that a miss can be satisfied by a block in the

local node memory. The directory and channel service times are calculated using (3.2) and (3.3).

The processor service time and the local probability are obtained from the simulation results.

From the Trace-driven simulation we extract a number of parameter values that are used

to characterize the application, and especially its effect on the processor and the network. In

general, the proper values of these parameters may be measured or estimated given a specific

58application or common values applicable to a wide range of applications may be used. The

relevant parameters are

PRead = probability that a memory access is a read

PMiss = probability of a cache miss

PLoca = probability of local access

PModi = probability that node finds one of its blocks modified by another node

PUpgr = probability that a write access finds the required block in cache in shared state

PShmo = probability that a node finds one of its blocks SHARED or EXCLUSIVE and not

UNOWNED

Pmo = probability that a node performs a remote access and finds the required block in

EXCLUSIVE state

Parameter PLoca is used to indicate the degree that individual processes in the application

tend to access the local memory of the node that they run on. Similarly, Pmodi, PUpgr and Pmo

indicate the degree of interaction between processes as they read and write in shared areas of the

distributed memory. Parameter PShmo also relates to process locality and indicates the degree that

blocks tend to be accessed repeatedly and consequently are found in SHARED or EXCLUSIVE

state and not UNOWNED state. From these parameters, we calculate the following

PWrit = 1. - PRead =probability that a memory access is a write

PHit = 1. - PMiss = probability of a cache hit

PRemo = 1. - PLoca = probability of remote access

Pdl = PRead * PMiss * PLoca * Pmodi =Prob. of data request to the local directory

Puol = PWrit * PHit * PUpgr * PLoca = Prob. of Upgrade request to the local directory

Prol = PWrit * PMiss * PLoca * PShmo = Prob. of Ownership request to the local directory

Pdr = PRead * PMiss * PRemo = Prob. of Data request to a remote directory

Puor = Pwrit * PHit * PUpgr * PRemo = Prob. of Upgrade request to a remote directory

Pror = PWrit * PMiss * PRemo = Prob. of an Ownership request to a remote directory

59Pol = Puol + Prol = Prob. of Ownership request (including upgrade) to the local directory

Por = Puor + Pror = Prob. of Ownership request (including upgrade) to a remote directory

Pml = Pdl + Pol =Prob. of any request to the local directory

Pmr = Pdr + Por =Prob. of any request to a remote directory

Pam = Pml + Pmr =Prob. of any request

1. / Pam = mean processor run time before message generated

PLM = Pml / (Pml + Pmr) = Prob. that a request message is sent to the local directory

3.4.4 Comparison Between the Simulators and the Theoretical Model

Figures 3.19 through 3.30 show the processor utilization, channel utilization and channel

queue waiting time obtained from the Trace-driven simulation, the Statistical simulation and the

queuing network model described above. A comparison is provided for each of the four

application cases (N01, N10, S01, S10).

The figures show that all three approaches provide very similar values for Processor

Utilization and Channel Utilization. The main differences between the two simulators and

queuing model are message size, victim messages (writeback and notification only victims), and

the number of invalidations. The theoretical model does not take into consideration the victim

messages, which will cause the values for the channel utilizations and the waiting time in the

output queue to be lower than the Trace-driven simulation. As the probability of local accesses

increase, a larger percent of the victim messages will be local and will not contribute to either the

channel utilization or the waiting time in the output queue. Figure 3.19 shows that as the

probability of data sharing between the Home and Home2 nodes increases, the channel utilization

also increases. This is due to the increase in the number of victim messages. In Figure 3-20,

when the Remote Miss ratio is low, the fact that the probability of a local access is higher has the

predominant effect and the channel utilization is lower. As the probability that data is shared

between Home and Home2 increases, so does the number of victim messages in the system.

60Internode sharing of data increases for both S01 and S10 cases and this causes the number of

upgrade messages to increase (as opposed to Ownership messages). Upgrade messages do not

carry data and therefore make a smaller contribution to the channel utilization. As the sharing

increases, data blocks in the cache tend to be invalidated more than being victimized. In Figures

3.21 and 3.22 the theoretical and Trace-driven values for the channel utilization are much closer.

The Statistical simulator has a higher channel utilization in all cases mainly because it uses a

fixed message size. For the same reason, the Statistical simulator also has a longer waiting time

at the output queue. In the theoretical model, the message size is exponentially distributed. In

the Trace-driven simulation, the messages are of two sizes, 74 bytes for a message containing

data and 10 bytes for a message that does not contain data.

The Statistical simulator and the theoretical model assume low values for Ninv. The

value of Ninv depends on the state of the memory block when a write request arrives at the

directory. If the block is UNOWNED, Ninv = 0, if the block is in the EXCLUSIVE state, Ninv =

1, if the block is shared Ninv = the number of sharers. The Trace-driven simulator is the only one

capable of accurately determining the value for Ninv. The Decision Tree Paths that are affected

by Ninv are J, L, P, and Q. Tables 3.7 and 3.8 indicate that the paths P and L are major

contributors to the overall network cost of an application due to the higher values for Ninv.

The difference in the processor utilization and the channel utilization for all three

approaches is less than 10%. The difference in output channel waiting time between the Trace-

Driven simulator and the Theoretical model is approximately 5-10 clock cycles. The similarities

in the results from each of the approaches tend to validate each other and provide confidence in

the results presented in this thesis. Another important result of this analysis is that a system with

such complex behavior can be modeled by relatively simple and tractable models, which can

provide reasonable estimates of processor utilization and message latencies, using parameters that

characterize real applications.

61

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14

Probability of Remote Miss

T_Sim_FT0S_Sim_FT0TH_FT0

Figure 3.19: Channel Utilization: Case N01

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14



Figure 3.20: Channel Utilization: Case N10

62

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14



Figure 3.21: Channel Utilization: Case S01

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14



Figure 3.22: Channel Utilization: Case S10

63

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Clo

ck C

ycl

es

T_Sim_FT0 S_Sim_FT0 TH_FT0

Figure 3.23: Channel Queue Waiting Time: Case N01

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Clo

ck C

ycl

es


Figure 3.24: Channel Queue Waiting Time: Case N10

64

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Clo

ck C

ycl

es


Figure 3.25: Channel Queue Waiting Time: Case S01

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14


Clo

ck C

ycl

es


Figure 3.26: Channel Queue Waiting Time: Case S10

65

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14



Figure 3.27: Processor Utilization: Case N01

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14



Figure 3.28: Processor Utilization: Case N10

66

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14



Figure 3.29: Processor Utilization: Case S01

0

10

20

30

40

50

60

70

80

90

100

0 0.02 0.04 0.06 0.08 0.1 0.12 0.14



Figure 3.30: Processor Utilization: Case S10

67

CHAPTER 4: FAULT TOLERANCE

A multiprocessor system based on the SOME-Bus can be composed of hundreds of

nodes. As the number of nodes in the system increases, the likelihood that the system will

experience temporary faults causing application errors or complete node failures also increases.

For this reason, the ability to tolerate node faults and failures becomes essential for parallel

applications with large execution times.

A popular approach to fault tolerance is known as Backward Error Recovery. Backward

Error Recovery enables an application that encounters an error to restart its execution from an

earlier, error-free state. This is achieved through the periodic saving of system information, which

is restored when an error is detected. The information that is collected represents a snapshot of

the program and processor state and is referred to as a checkpoint or recovery point. The two

main requirements of a checkpoint are that it contains a consistent system state and that it is saved

to a stable storage medium.

The Chapter presents four protocols for fault tolerance based on consistent checkpointing.

The first protocol FT0 is the traditional approach for consistent checkpointing., The remaining

three protocols FT1, FT2 and FT3 take advantage of the inherent multicast capabilities of the

SOME-Bus in order to reduce the overhead of fault tolerance that is observed is FT0. In addition,

the optimized protocols are shown to improve the performance of the DSM in some applications.

4.1 FAULT TOLERANCE AND DISTRIBUTED SHARED MEMORY ON THESOME-BUS

A multiprocessor system based on the SOME-Bus can be composed of hundreds of

nodes. As the number of nodes in the system increases, the likelihood that the system will

experience temporary faults causing application errors or complete node failures also increases.

For this reason, the ability to tolerate node faults and failures becomes essential for parallel

applications with large execution times. A popular approach to fault tolerance is known as

68Backward Error Recovery. Backward Error Recovery enables an application that encounters an

error to restart its execution from an earlier, error-free state. This is achieved through the periodic

saving of system information, which is restored when an error is detected. The information that is

collected represents a snapshot of the program and processor state and is referred to as a

checkpoint or recovery point. The two main requirements of a checkpoint are that it contains a

consistent system state and that it is saved to a stable storage medium.

Typically, in order to ensure that the checkpoint is consistent over the entire system, the

processes are synchronized before the checkpoint is established and then resume normal

execution after the checkpoint has been saved. In order to avoid excessive memory requirements,

it is only necessary to keep a single checkpoint during normal operation. When a new checkpoint

is created, it replaces the old one. Since it is possible to have an error during the establishment of

a checkpoint, it is crucial that the operation be atomic. This means that the entire checkpoint must

successfully be created before replacing the old checkpoint. If the entire checkpoint cannot be

created, the previous one should be preserved.

The checkpoint must be saved to a storage medium that can be accessed after a failure

occurs. Saving large amounts of data to disk can be extremely time consuming. Another approach

that has been proposed [18] is to utilize the memory of another node to store checkpoints. This

approach provides the necessary requirement to tolerate a single node failure, or multiple node

failures where all copies of the checkpoint do not reside on the failed nodes. Storing the

checkpoint in the memory of another node instead of writing the checkpoint to disk reduces the

amount of time that is required to save the checkpoint.

A number of recoverable distributed shared memory systems were described in chapter 1.

Many of these approaches use the built-in mechanism for data replication and transfer that exists

in a DSM to hide some of the overhead of fault tolerance. The DSM system described in this

thesis differs from those described in chapter 1 because it is hardware-based, implements the

sequential consistency model, uses the cache block as the unit of coherence, and in some

69situations is capable of using the data maintained for fault tolerance to improve the performance

of the DSM. In the following sections, four fault tolerance protocols, FT0, FT1, FT2 and FT3,

are presented which exploit the capabilities of the SOME-Bus Architecture to provide a

recoverable DSM with little or no performance degradation over the basic DSM system..

4.2 FT0 PROTOCOL

In the FT0 protocol, each node keeps a full copy, or checkpoint, of its local memory,

known as its recovery memory. In addition, every node also contains a copy of the recovery

memory of another node. Node X contains its own local memory and recovery memory as well as

a copy of Node Y’s recovery memory, as shown in Figure 4.1. If Node Y encounters an

application error that does not contaminate its recovery memory, Node Y can use its own

recovery memory to restore the previous checkpoint state and then restart along with the rest of

the nodes in the system. In the case of a complete failure and replacement of Node Y, the copy of

Node Y’s recovery data that resides on Node X can be used to initialize the replacement node. In

either case, after the memory of Node Y has been restored, the system can perform the Backward

Error Recovery.

70

Cache

Local Memory

Recovery DataNode Y

Recovery DataNode Z

Node Y

Cache

Local Memory

Recovery DataNode Z

Recovery DataNode X

Node Z

Cache

Local Memory

Recovery DataNode X

Recovery DataNode Y

Node X

Figure 4.1: FT0 Memory Organization of SOME-Bus Node

Initially the recovery memory is created by making a copy of the local memory.

However, it is not necessary to copy the entire contents of the local memory when subsequent

checkpoints are taken. Instead the recovery data can be incrementally updated since it is only

necessary to save data that has been modified since the last checkpoint [18]. The process of

creating a checkpoint involves three phases: the synchronization phase, cache writeback phase

and the recovery memory update phase, which are described below.

The first phase is the synchronization phase. At system start up, one of the processors is

designated the coordinator for all checkpoint procedures. The coordinator sends a message to all

71processors directing them to suspend normal operation and synchronize in order to begin the

checkpointing process. During the synchronization phase, the processors do not issue new

requests for data. Threads that are either in the ready or running state are suspended. If the

processor has threads that are blocked waiting for the response to a previously issued data

request, it waits for the outstanding request to complete before notifying the coordinator that the

processor is ready to begin the checkpoint. When all outstanding data requests have completed

and all processors have indicated their readiness to begin the checkpoint, the cache writeback

phase begins.

During the cache writeback phase, the cache controller in each node searches the cache

looking for blocks in the EXCLUSIVE state. The cache controller combines the data blocks that

must be written back to a particular node (the home node for the data blocks) into a single

message. The number of messages sent by the cache controller depends on the number of home

nodes for which the cache contains a data block in the EXCLUSIVE state. After each cache

controller sends the necessary writeback messages, the main memory of every node contains the

most recent value of all data blocks in preparation for the checkpoint. At the end of the cache

writeback phase, all blocks in the cache will be in the SHARED state3. After the cache writeback

phase is complete, the checkpointing process enters the recovery memory update phase.

In the recovery memory update phase, the directory controller in each node copies the

blocks in main memory that have been modified since the previous checkpoint to the recovery

memory. At the same time, these modified blocks are added to an update message that is sent to

the backup node so that it may update the backup copy of the recovery memory. A fault that

occurs during this phase could cause the recovery memory to become corrupted. For this reason,

the recovery memory update operation must be atomic. The first step is to create a temporary

copy of the recovery memory with the required updates. When the temporary copy of the

3 The issue of changing the state to Shared is discussed later in the section.

72 recovery memory on every node has been updated successfully, the processors synchronize again

and the newly created copy of the recovery memory becomes the recovery memory for the

current checkpoint and the previous recovery memory is deleted. All nodes synchronize again and

then resume normal operation. If any node has a failure while the new copy of the recovery data

is being created, the checkpointing procedure can be stopped and then restarted using the original

recovery memory that remained unmodified. The original recovery memory is not deleted until

every node indicates that it has successfully created the new recovery memory.

A “modified” attribute is added to the directory entry for each memory block to indicates

that the block has been modified in the interval between two checkpoints. After the recovery

memory has been updated, all of the “modified” attributes are cleared since all cache blocks are

in the SHARED state. When a directory controller grants ownership of a block to a requesting

node, the “modified” attribute is set so that the block will become part of the next checkpoint.

Downgrading all cache blocks in the EXCLUSIVE state to the SHARED state during the

cache writeback phase can cause a rush of Ownership requests immediately following the

checkpoint creation. Another approach is to have the cache controllers perform the writeback but

keep the blocks in the EXCLUSIVE state. This optimization may require some additional

complexity when the recovery memory is updated. If the cache blocks are allowed to remain in

the EXCLUSIVE state after they have been written back to main memory during the cache

writeback phase, the “modified” attribute for these blocks should not be cleared after the recovery

memory has been updated. If the behavior of the application is such that multiple write operations

are performed to the cache blocks once ownership has been obtained, allowing the blocks to

remain in the EXCLUSIVE state after the writeback can prevent a burst of ownership (or

upgrade) requests following the checkpoint. However, if the blocks tend to only be written once

or twice before being requested by another thread, allowing a cache to maintain ownership of the

EXCLUSIVE blocks following the writeback could needlessly increase the size of the update

messages that are transmitted to the backup nodes during the recovery memory update phase.

73This occurs because the “modified” attribute for any block in the EXCLUSIVE state will remain

set after the checkpoint even though the data block might not be written again by the thread that

owns it before the next checkpoint is created. The approach used in this paper is to change all

EXCLUSIVE blocks to shared once the cache writeback operation is performed. The main reason

for this is that the experimental results show that the major contributor to the network traffic

overhead associated with fault tolerance is the size of the recovery update message that is

exchanged between each node and its assigned backup node during the recovery memory update

phase.

The FT0 protocol does not cause any overhead in terms of network traffic associated with

fault-tolerance related messages during normal execution. The overhead occurs only when a

checkpoint is being created and is caused by the messages related to the cache writeback phase

and the exchange of update messages between each node and its backup node during the recovery

memory update phase. The total execution time for an application is the combination of the time

required to perform the normal activities required by the application and the time required to

create and maintain the checkpoint information.

The amount of time required to create a checkpoint depends on the number and size of

messages that must be exchanged during the creation of the checkpoint. For this reason, the

number of cache blocks that need to be written back to main memory during the cache writeback

phase and the number of blocks whose “modified” attribute is set and therefore must be added to

the update message in the recovery memory update phase have a direct effect on the overhead

associated with the checkpoint procedure. Depending on the memory access pattern, as the

interval between consecutive checkpoints increases, the number of modified memory blocks may

also increase, resulting in larger update messages and consequently more time spent creating each

checkpoint. The number of cache blocks that must be written back also contributes to the

checkpoint time. The number of EXCLUSIVE blocks in the cache at any particular time is related

to the read probability and miss probability of the application.

74Figures 4.2 through 4.5 show the effect that the checkpoint interval has on the overhead

associated with the FT0 protocol. Each figure provides results for one of the application classes

described in Chapter 3 (Case N01, Case N10, Case S01 and Case S10 respectively). Each figure

provides the proportion of the checkpoint creation time required for each of the three phases: the

synchronization phase, the cache writeback phase, and the recovery memory update phase. The

data is provided in terms of the number of simulation clock cycles required to perform the

activity.

Figure 4.2 provides results for the Case N01 class of applications described in Chapter 3.

For the Case N01 class of applications the parameter L and the parameter S are held constant and

set to .01. The parameter N takes on the values .01, .02, .05 and .10. The application type

represented by this combination of L, S, and N is one in which there is a small degree of locality

within a particular node as well as a small degree of sharing between threads. The data in Figure

4.2 is arranged in four pairs where each pair corresponds to one of the four possible values for the

parameter N. Within each pair, the results provided are for checkpoint intervals of 10,000 and

30,000 memory references. The simulated applications consist of 30,000 memory references,

resulting in 3 checkpoints for the first data column (I=10) and a single checkpoint for the second

data column (I=30) in each pair.

The data in Figure 4.3 is organized in pairs as described for the previous figure. Figure

4.3 provides results for the Case N10 class of applications in which the parameter N and the

parameter S are the same values as the N01 case but the L parameter is .10. The application type

represented by Case N10 is one in which there is a larger degree of locality within a particular

node and there is little sharing of data between threads.

75

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

N=.01, I =10 N=.01, I =30 N=.02, I =10 N=.02, I =30 N=.05, I =10 N=.05, I =30 N=.10, I =10 N=.10, I =30

Case N01

Sim

ula

tio

n C

ycle

s

Synchronization Cache Writeback Recovery Mem Writeback

Figure 4.2: Checkpoint Intervals of 10,000 and 30,000 for the Case N01 Applications

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

N=.01, I =10 N=.01, I =30 N=.02, I =10 N=.02, I =30 N=.05, I =10 N=.05, I =30 N=.10, I =10 N=.10, I =30

Case N10

Sim

ula

tio

n C

ycle

s


Figure 4.3: Checkpoint Intervals of 10,000 and 30,000 for the Case N10 Applications

76Figure 4.4 provides results for the Case S01 class of applications in which the parameter

L and the parameter N are held constant and set to .01. The parameter S takes on the values .01,

.02, .05 and .10. The application type represented by Case S01 is one in which there is a smaller

degree of locality within a particular node and there is a larger degree of sharing of data between

threads.

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

S= .01, I = 10 S= .01, I = 30 S= .02, I = 10 S= .02, I = 30 S= .05, I = 10 S= .05, I = 30 S= .10, I = 10 S= .10, I = 30

Case S10

Sim

ula

tio

n C

ycle

s


Figure 4.4: Checkpoint Intervals of 10,000 and 30,000 for the Case S01 Applications

Figure 4.5 provides results for the Case S10 class of applications in which the parameters

N and S are the same values as the S01 case but the L parameter is .10. The application type

represented by Case S10 is one in which there is a larger degree of locality within a particular

node and there is a larger degree of sharing of data between threads.

77

0

10,000

20,000

30,000

40,000

50,000

60,000

70,000

80,000

90,000

S= .01, I = 10 S= .01, I = 30 S= .02, I = 10 S= .02, I = 30 S= .05, I = 10 S= .05, I = 30 S= .10, I = 10 S= .10, I = 30

S Group, L = .10

Sim

ula

tio

n C

ycle

s


Figure 4.5: Checkpoint Intervals of 10,000 and 30,000 for the Case S10 Applications

Figures 4.2 through 4.5 show that in all cases, the recovery memory update phase is

responsible for the majority of the time required to create a new checkpoint. The figures also

show that the time required to perform the recovery memory update phase increases as the

checkpoint interval increases for all four of the application classes. The amount of increase in the

time to perform the recovery memory update phase is directly related to the number of data

blocks that are modified during the checkpoint interval. The number of data blocks modified

during the checkpoint interval depends on the behavior of the application and this is the reason

for the differences the amount of additional time that is required to perform the recovery memory

update in Figures 4.2 through 4.5. These application behavior characteristics are described

below.

The probability that a reference will be to a block already in the node’s cache is 1-

(L+N+S). The Case N01 and S01 applications have a cache miss ratio in the range of 3% to 12%

78while the Case N10 and S10 applications have a cache miss ratio in the range of 13% to 22%.

The difference in the cache miss ratio is due to the increase in the L parameter from .01 to .10.

The miss ratio is the reason for the increase in the number of modified data blocks for Case N10

over Case N01 and between Case S10 and Case S01. An increase in the number of modified data

blocks during the interval between checkpoints causes an increase in the time require to perform

the recovery memory update phase of the checkpoint procedure.

For Case N01 and N10, the probability of sharing is low and constant and as the N

parameter is increased from .01 to .10, the number of remote accesses increases but the majority

of these accesses are to unique remote locations. This results in a larger number of data blocks

being modified as the checkpoint interval increases. The Case S01 and S10 application classes

have an increased probability of sharing. In these applications, the N parameter is held constant

and the S parameter is increase from .01 to .10. The S parameter corresponds to the probability

that a thread will choose an address that has been accessed previously. For this reason, once a

data block has been accessed, there is an increased probability that the data block will be chosen

again during the interval between the checkpoints. For the same cache miss ratio, the N01 and

N10 class applications will modify more data blocks than the S01 and S10 class applications

because the later will tend to pick data blocks that have already been modified resulting in fewer

modified blocks over all. For this reason, the Case S01 and Case S10 applications do not exhibit

the same level of increase in the number of modified blocks when the S parameter is increased as

the Case N01 and Case N10 applications do when the N parameter is increased.

The average number of cache blocks that must be written to main memory for each

checkpoint ranges between 12 and 15 block for all experiments and is not affected by the length

of the interval between the checkpoints. A method for theoretically determining these values

based on a set of parameters that describe the application is presented below.

Assumptions for the proposed model are a DSM architecture with a Write-Invalidate

cache coherency policy. The system has N nodes, M memory blocks per node, and C cache

79blocks per node. The replacement strategy for the cache is the random algorithm with a uniform

distribution. All memory accesses happen with a common clock and in every clock cycle each

processor generates a memory reference. In a single clock period, a particular memory block can

only be referenced by one of the nodes.

The memoryless property of a Markov process requires that the future of the process

depends only upon the present state of the process and not upon its history. A discrete time

Markov Chain Markov Chain (MC) assumes a discrete state space, with state transitions

occurring at integer units of time.

If the number of blocks in a particular cache is used to represent the state of the cache,

the system can be modeled as a discrete-time MC if it can be shown that the state of the cache at

time t+1 depends only upon the state of the cache at time t and not any previous state. Assuming

a Write-Invalidate cache coherency protocol, changes in the number of modified blocks in a

cache occur as the result of write requests by the local processor, incoming Invalidation or

Downgrade requests, or selection of EXCLUSIVE blocks as victims to be removed from the

cache. Which of these operations occur at a particular time t, depends upon the application

behavior or more specifically upon the sequence of accesses to memory generated by the

processors in the system. In order for the system to be modeled as a MC, neither the type of

operation (read or write) nor the selection of memory address can be related in any way to the

state of the cache. In other words, the program flow of the application (and consequently the

sequence of memory accesses) cannot be influenced by the state of the cache. An example of the

type of interaction that is not allowed would be for a node to purposely avoid accessing a

memory block that is in the Exclusive state in another cache. In this thesis, the memory

references are assumed to be generated by the process without knowledge of and completely

independent of the state of any cache in the system. With this assumption, the behavior of a

cache can be approximated using a discrete-time Markov chain. The states (Si for i = 0 to C) used

80in the model represent the number of cache blocks in the EXCLUSIVE state in a single cache.

Figure 4.6 shows the state-transition-rate diagram for the MC described above.

0 1 2 CC-1

l0 l1lC-1

...mCm2m1

l2 lC-2

mC-1m3

Figure 4.6: State-transition-rate Diagram

The caches are assumed to be identical in design and therefore behave identically under

the same conditions. The model is described from the point of view of a single cache, which will

be referred to as the target cache. The local node is the node that contains the target cache and

other nodes are referred to as remote. Since all N nodes generate a memory reference in a single

clock period, when a memory block is chosen, the probability that the node that accessed the

block is the local node is (1/N). The probability that the node that accessed the memory block

was one of the remote nodes in the system is 1-(1/N). The probability of a write access is Pw and

the probability of a read access is 1-Pw. The equations for the transition probabilities will be

presented with the assumption that memory blocks are randomly accessed with a uniform

distribution. The adjustments to the equations that are required when the memory access pattern

is not uniformly distributed are also discussed.

In order to move from state Si to state Si+1, there must be a write access from the local

node which causes the number of EXCLUSIVE blocks in the cache to be increased by 1. This

can be achieved in one of two ways described below.

1. the access is a write operation (Pw) from the local node (1/N) to a block that is in the

cache (C/NM) but not in the EXCLUSIVE state (1 - (i/C)).

812. the access is a write operation (Pw) from the local node (1/N) to a block that is not already

in the cache (1 - C/NM) and that the victim block was not in the EXCLUSIVE state(1 -

(i/C)).

Equation (4.1) provides li, the probability of moving from Si to state Si+1

˙˚

˘ÍÎ

È˜¯

ˆÁË

Ê -˜¯

ˆÁË

Ê -+˜¯

ˆÁË

Ê -˜¯

ˆÁË

Ê˜¯

ˆÁË

Ê=Ci

NMC

Ci

NMC

NPWi 111

1l (4.1)

˜¯

ˆÁË

Ê -˜¯

ˆÁË

Ê=Ci

NPWi 1

1l

In order to move from state Si to state Si-1, there must be a write access from the local

node, which causes the number of EXCLUSIVE blocks in the cache to be decreased by 1. This

can be achieved in one of the two ways described below. (4.2) provides mi, the probability of

moving from Si to state Si-1.

1. The access is a read operation (1-Pw) from the local node (1/N) to a block that is not in

the cache (1 - C/NM) and the resulting victim block is in the EXCLUSIVE state (i/C).

2. The access is either a read or write operation from a remote node (1-1/N) to a block that

is in the cache (C/NM) that is in the EXCLUSIVE state(i/C).

( ) ˜¯

ˆÁË

Ê˜¯

ˆÁË

Ê˜¯

ˆÁË

Ê -+˜¯

ˆÁË

Ê˜¯

ˆÁË

Ê˜¯

ˆÁË

Ê --=Ci

NMC

NCi

NNMC

Pwi

11

111m (4.2)

Let PBF be the probability that the memory block is found in the cache. If uniform

memory access is assumed, the PBF is equal to C/NM and is independent of whether the node

accessing the memory block is the local node or one of the remote nodes. Multiprocessor

applications typically do not have uniform memory access patterns. If there is a high level of

spatial or temporal locality of reference, then there is a higher probability that the memory block

selected by the local node will fall within the local memory space of the node. If memory access

82is uniform, PBF=C/NM but if instead, a node always picks a memory block within its own local

memory space (with a uniform distribution), PBF=C/M. When there are higher levels of spatial

locality in applications PBF is not the same when the node is local and when it is remote. If the

application exhibits a large amount of interprocessor sharing of data, PBF for the remote nodes will

increase.

Let PBF_L be the probability that the memory block is found in the cache when the block

was selected by the local node, and PBF_R be the probability that the block was selected by one of

the remote nodes. The necessary changes to the transition probabilities are shown in (4.3) and

(4.4).

( ) ( ) ˙˚

˘ÍÎ

È˜¯

ˆÁË

Ê --+˜¯

ˆÁË

Ê -˜¯

ˆÁË

Ê=Ci

Ci

NPWi 1P11P

1BF_LBF_Ll (4.3)

˜¯

ˆÁË

Ê -˜¯

ˆÁË

Ê=Ci

NPWi 1

1l

( )( ) ( ) ˜¯

ˆÁË

Ê˜¯

ˆÁË

Ê -+˜¯

ˆÁË

Ê˜¯

ˆÁË

Ê--=Ci

NCi

NPwi BF_RBF_L P

11

1P11m (4.4)

Once the MC is solved and the state probabilities are found, the expected value for the

number of EXCLUSIVE cache blocks can be determined using (4.5), where C is the number of

cache blocks in each node and i is the number of EXCLUSIVE cache blocks in state i.

mean number of EXCLUSIVE cache blocks =

†

iSii= 0

C

Â (4.5)

Figures 4.7 and 4.8 show state probabilities for a system with 32 nodes, 2868 memory

blocks per node, and 64 cache blocks per node using Equations 4.3 and 4.4. Figure 4.7 shows the

state probabilities when Pw=.10, PBF_R = C/MN=.0003 and PBF_L ranges from .0003 to 1. As PBF_L

increases the steady state probability of being in the higher states also increases. Figure 4.8

83shows the state probabilities when Pw=.10, PBF_L = .9 and PBF_R ranges from 0 to 1. As PBF_r

increases the steady state probability of being in the lower states also increases. Table 4.1

provides the average number of cache blocks that were written back at checkpoint time.

0

0.05

0.1

0.15

0.2

0.25

0 5 10 15 20 25 30 35 40 45 50 55 60

State number

Sta

te P

rob

abili

ty

L=1.0000

L=0.9500L=0.9000

L=0.8000

L=0.0003 L=0.200

0L=0.4000 L=0.600

0

Figure 4.7: State Probabilities, Pw= .10 and Pbfr = .0003

84

0

0.05

0.1

0.15

0.2

0.25

0.3

0 5 10 15 20 25 30 35 40 45 50

State Number

Sta

te P

rob

abili

ty

R=0.0000

R=0.0001

R=0.0003

R=0.0002

R=0.0010

R=0.0100

R=0.1000

Figure 4.8: State Probabilities, Pw= .10 and Pbfl = .9

Table 4.1: Average Number of Cache Blocks Written back at Checkpoint TimeModified Cache Blks

Case N01: N=.01 15Case N01: N=.02 15.09Case N01: N=.05 13.63Case N01: N=.10 13.28Case S01: S=.01 15.09Case S01: S=.02 14.97Case S01: S=.05 12.72Case S01: S=.10 11.19Case N10: N=.01 14.78Case N10: N=.02 13.5Case N10: N=.05 11.88Case N10: N=.10 11.63Case S10: S=.01 13.94Case S10: S=.02 13.5Case S10: S=.05 12.53Case S10: S=.10 11.69wthr_FT0 28.07simp_FT0 17.93speech_FT0 2.3

85

4.3 FT1 PROTOCOL

As shown previously for the FT0 protocol the largest source of overhead from

checkpointing is the amount of time required to transfer the updates for the recovery memory to

the back-up node. The FT1 Protocol reduces the overhead associated with checkpointing by

eliminating the update message exchange during the recovery memory update phase of the

checkpoint creation process. Instead of exchanging large update messages at checkpoint time, the

backup node keeps a consistent backup copy of the primary node’s local memory in addition to

the backup copy of the node’s recovery memory. During the recovery memory update phase of

the checkpoint process, the backup copy of the recovery memory is updated from the

corresponding backup copy of local memory, requiring no additional network communication. A

Node X keeps the backup copy of Node Y’s local memory consistent by receiving a copy of any

cache writeback messages to Node Y as they occur. This approach ensures that the backup node

has all the information necessary to update the backup recovery memory without relying on the

home node to explicitly send it at checkpoint time.

Figure 4.9 shows the organization of a node’s memory for the FT1 Protocol. Node X

contains its own local memory and recovery memory as well as a copy of Node Y’s local

memory and Node Y’s recovery memory. Node X is the Home node for data blocks located in its

own local memory (Nx_H). In addition, Node X is the Home2 node for data blocks located in the

copy of Node Y’s local memory, which resides on Node X (Nx_H2). Similarly Node Y contains

Home memory (Ny_H) and Home2 memory (Ny_H2) for data blocks located in the copy of Node

Z’s local memory, which resides on Node Y.

86

Cache

HomeLocal Memory

Node X

Recovery DataNode X

Recovery DataNode Y

Node X

Home2Local Memory

Node Y

Cache

Recovery DataNode Y

Recovery DataNode Z

Node Y

Cache

Recovery DataNode Z

Recovery DataNode X

Node Z

Home2Local Memory

Node X

HomeLocal Memory

Node Y

Home2Local Memory

Node Z

HomeLocal Memory

Node Z

Nx_H Nx_H2 Ny_H Ny_H2 Nz_H Nz_H2

Figure 4.9: FT1 Memory Organization of SOME-Bus Node

Data blocks in the Home memory can be in the state EXCLUSIVE, SHARED, or

UNOWNED but data blocks in the Home2 memory can only be in the state SHARED or

INVALID. In Figure 4.10, Node Y is Home for the data block DataB, and Node X is Home2 for

DataB. If DataB is in the state SHARED or UNOWNED in Ny_H, the copy of DataB in Nx_H2

will be in the state SHARED. If DataB is in the EXCLUSIVE state in Ny_H, then the copy of

DataB located in Nx_H2 will be in the state INVALID.

The FT1 protocol must keep the Home2 local memory consistent with the Home local

memory in order to guarantee the correct values of each data block are used to update the backup

recovery memory during the recovery memory update phase of the checkpoint process. Any

additional messages that must be sent in order to keep the Home2 local memory consistent causes

overhead in terms of additional network traffic during normal operation. In FT0, the overhead

due to fault tolerance occurs only at checkpoint time.

87

Cache

HomeLocal Memory

Node X

Recovery DataNode X

Recovery DataNode Y

Node X

Home2Local Memory

Node Y

Cache

Recovery DataNode Y

Recovery DataNode Z

Node Y

HomeLocal Memory

Node Y

Home2Local Memory

Node Z

Nx_H Nx_H2 Ny_H Ny_H2

Px Py

DataBDataB

Figure 4.10: Local and Remote Requests for Data Block DataB

4.3.1 Home2 Consistency

In order to correctly update its copy of the Node Y recovery memory at checkpoint time,

Node X must dynamically keep its Home2 memory (Nx_H2) consistent with Node Y’s Home

memory (Ny_H). Initially Nx_H2 is an exact copy of Ny_H and all blocks in both Nx_H2 and

Ny_H are in the SHARED state. When an Ownership request for a block in Ny_H is received by

Node Y’s directory controller, an Invalidation message is sent to Node X and the corresponding

data block in Nx_H2 is changed from SHARED to INVALID and an Acknowledge message is

sent from Node X to Node Y. Upon receipt of the Invalidation-Ack message, Node Y changes the

state of the block in Ny_H to EXCLUSIVE and grants the requesting node ownership of the data

block.

Keeping Home2 consistent means that every time ownership of a block is requested, the

Home2 node must receive an Invalidation request along with any other nodes sharing the block.

The Invalidation request is multicast to all nodes with a copy of the block. Individual

88Invalidation-Ack messages must then be collected from all recipients of the Invalidation request.

The additional Invalidation-Ack message that must always be collected from the Home2 node,

increases the time the directory controller spends invalidating all nodes.

As long as the Invalidation-Ack messages are being collected from several other nodes,

the time spent collecting the additional one from Home2 is relatively small. The time necessary to

send the Invalidation request is equal to the sum of two variables: the waiting time at the channel

queue (wch) and the transfer time of the Invalidation message (ttrans). The time necessary to collect

a single Invalidation-Ack message is the sum of the waiting time at the cache input queue (wca),

the service time at the remote cache controller(sca), the waiting time at the channel queue (wch)

and the transfer time of the Invalidation-Ack message(ttrans).

Let X1,X2,...,XNinv be random variables representing the time required to collect

Invalidation-Ack messages from Ninv nodes that received the Invalidation request. The random

variable Xi is the time required to collect an Invalidation-Ack message from the ith node where i =

1 to NINV. Let Xi = wca(i) + sca(i) + wch(i) + ttrans(i)

Simulations show that sca and wca are negligible. The transfer time ttrans is constant and

small for Invalidation-Ack messages. Simulations indicate that wch can be approximated with an

exponential distribution. Let Tinv be the time to send an Invalidation request and collect all Ninv

of the Invalidation-Ack messages, where Tinv = wch(i) + ttrans(i) + max(Xi : i = 1 to Ninv).

If X1,X2,...,Xn are mutually independent, identically distributed continuous random

variables. The random variables Y1,Y2,...,Yn are obtained by arranging the set X1,X2,....,Xn

variables so that they are in increasing order. In this case Y1=min(X1,X2,...,Xn), and

Yn=max(X1,X2,...Xn). The random variable Yk is called the kth-order statistic [37]. In this case, if

Xi is exponentially distributed with parameter l for each i in i=1 to n, the order statistic Yn-k+1 is

hypoexponentially distributed [37] as shown in Equation (4.6) . The density of Yn-k+1 is shown in

Equation (4.7) and the mean of Yn and Yn+1 is provided in Equation (4.8).

89

],...,)1(,[1 lll knnHYPOY kn -ª+- (4.6)

lll l iwhereea it

n

miii

i =-

=Â , (4.7)

ÂÂÂ===

===n

i

n

i

n

i in ii 111

1111lll

j (4.8)

Â+

=+ =

1

11

11 n

in il

j

Equation (4.9) shows that the increase in the mean of Y when one additional variable Xn+1

is added is 1/(n+1) which becomes relatively small when n ≥ 3. This analysis indicates that

sending an Invalidation request to Home2 and collecting the Invalidation-Ack does not add

significantly to the latency of the corresponding ownership request unless the Home2

Invalidation-Ack is the only one being collected. In this case the entire time spent sending the

Invalidation request to Home2 and collecting the Invalidation-Ack directly adds to the latency of

the ownership request.

1

111 +

=-+ nnn ljj (4.9)

Assume the location of a data block DataB in the local memory of Node Y and in the

Home2 memory of Node X as shown in Figure 4.10. Figure 4.11 illustrates the message transfers

required when Node Z requests ownership of DataB, which is in the shared state in the cache of

Node W. Node Z sends an Ownership Request message to Node Y, which is the Home node for

DataB. The directory controller at Node Y finds that DataB is in the shared state and sends an

Invalidation message to Node W in order to invalidate the copy of DataB in Node W’s cache.

The Invalidation message is also sent to Node X so that the copy of DataB in the Home2 memory

can also be invalidated. The directory controller waits until the Invalidation-Ack messages are

90received from both Node W and Node X and then sends the Ownership-Ack message to Node Z.

If DataB was not shared by any node when the directory controller at Node Y received the

Ownership request from Node Z, an Invalidation message would still be sent to Node X so that

the copy of DataB in the Home2 memory will be invalidated. In the FT0 protocol, no

invalidation messages would be sent in this case.

Node X Node Y Node ZNode W

Ownership Request

Invalidation Request

Invalidation Acknowledgement

Invalidation Acknowledgement

Ownership Acknowledgement

Figure 4.11: Ownership Request Message Transfer with FT1 Protocol

The Home2 node must receive a copy of any Writeback message sent to the Home node

so that it can update the data block in the Home2 memory with the most recent value and change

its state from INVALID to SHARED. The cache controllers multicast Writeback messages for a

data block to both the Home and Home2 nodes. In the SOME-Bus, a message can be multicast by

having the source node specify the two destinations (Home Node and Home2 Node) in the

message header. Alternatively, the address filters at the SOME-Bus receivers can be programmed

to accept messages of certain types even when destined for another (specific) node. The second

solution is more hardware-intensive, but avoids increasing the message header size, however the

first solution provides more flexibility in terms of the ability to change the node that will be

designated the Home2 node.

91In the SOME-Bus, each node has a separate receiver and associated input queue for every

channel in the system. All messages sent by a node are received by other nodes in the order in

which they were sent and are stored in the associated input queue. The order in which the

processor sees messages sent from different nodes is determined by the order in which messages

are removed from the set of receiver queues. Since there are multiple queues at each node input,

the arrival order of messages in different input queues cannot be determined by simple inspection

of the corresponding positions in the queues. Consequently, such a simple inspection cannot

determine the last value written to a data block whose ownership changes between several nodes.

A data block in the Home2 memory changes from state SHARED to INVALID when

Home2 receives an Invalidation message from the Home node. In order for Home2 to change the

state of the data block from INVALID to SHARED, it must be able to determine 1) when the data

block changes to the SHARED state in the Home node and 2) which Writeback message in the

Home2 input queues corresponds to the most recently written value for the data block.

Ownership Acknowledge messages are multicast from the Home node to the requesting

node and to Home2 in order to allow Home2 directory controller to keep track of the transfer of

ownership of a data block, and therefore to determine the most recently written value for that data

block. Anytime there is a transfer of ownership of a data block, there must be a corresponding

InvalidateWB or VictimWB message in which the modified value of the data block is written

back to main memory. Similarly when a block changes from EXCLUSIVE to SHARED, the

resulting Data Acknowledge message can be matched with the Downgrade writeback message.

Since all messages originating from the Home node are received by the Home2 node in the same

order in which they were transmitted, Home2 determines the correct order of the InvalidateWB,

VictimWB or DowngradeWB messages received from any of the nodes by examining the

sequence of Data-Ack and Ownership-Ack messages in the receiver queue associated with the

Home node.

92Home2 must receive a copy of all coherence-related messages so that it knows the order

in which the Data-Ack and Ownership-Ack messages were sent by Home. The FT1 protocol

does not permit Ownership requests to be filled locally because there would be no Invalidation

message (or subsequent writeback message) sent to Home2 and no Ownership-Ack message

appearing in the Home2 node for the corresponding change of ownership. Under the FT1

approach, all Ownership requests must result in a multicast to the Home2 of the Ownership- Ack

messages and the InvalidationWB, DowngradeWB or VictimWB message when the cache

eventually gives up ownership of the data.

Writeback messages are sent by the current owner of a block for two reasons. The first

reason is that another node requests access to the block, either read access or ownership. The

second reason is because a cache may chose to replace a block that is in EXCLUSIVE state. In

this case, the cache must send a copy of that block to the Home node.

If node A wishes to access a block for which node Z is the owner, node A sends an

Ownership or Data request to the Home node Y, and node Y sends an Invalidation or Downgrade

request to the current owner Z. Node Z responds with an Invalidation-Ack or Downgrade-Ack

message to Home node X. The Home node X then sends an Ownership-Ack message to the new

owner node A.

Figures 4.12a and 4.12b show an example where three nodes (A, B and C) take turns in

writing and reading a block whose Home is node H (and Home2 is node H2). Figure 4.13 shows

the messages that have arrived at the corresponding input queues of Home2 node H2 under the

assumption that H2 has been busy serving other queues during the time that those messages

arrived.

93

SH

H H2

NODE 0

SH

OWN REQ (A=>H)

NODE 5 NODE 6

A B C

NODE 3 NODE 4

INV

INV

SH

SH => BUSY

INV REQ (H=>H,H2,C)

INV ACK (C=>H)

OWN ACK (H=>A,H2)

BUSY => EXCLUSIVE

EXCLUSIVE EXCLUSIVE

EXCLUSIVE=>BUSY

INV REQ (H=>A,H2)

INV_WB ACK (A=>H, H2)

INV BUSY => EXCLUSIVE

OWN REQ (C=>H)

INV

OWN-REQ (B=>H)

INV ACK (H2=>H)

SH =>INV

SH => INV

INV

INV INVINV

EXCLUSIVE =>INV

INV => EXCLUSIVE

OWN ACK (H=>B,H2)

Figure 4.12a: Message Time Line

94

EXCLUSIVE=>BUSY

INV REQ (H=>B,H2)

INV_WB ACK (B=>H, H2)

BUSY => EXCLUSIVE

EXCLUSIVEINV

DATA REQ (A=>H)

DOWNGRD REQ (H=>C,H2)

DOWNGRD_WB ACK (C=>H,H2)

EXCLUSIVE => BUSY

EXCLUSIVE =>SH

BUSY => SHINV =>SH

EXCLUSIVE => INV

OWN ACK (H=>C,H2)

INV => EXCLUSIVE

EXCLUSIVE

SHINV

DATA ACK (H=>A,H2)

Figure 4.12b: Message Time Line (continued)

As the figures show, nodes A, B, and C have acquired ownership of the block and have

sent writeback messages to node H. All the relevant messages have also been received by node

H2. (In the following, the notation Qn is used to indicate the queue in node H2 that receives

messages from node n). As Figure 4.13 shows, three writeback messages destined for the same

memory block can be found in queues QA, QB and QC. Since these queues are distinct, it is not

possible to distinguish which writeback message arrived last just by examining the contents of

these three queues. The messages in queue QH must be used to determine the proper order. As

Figure 4.13 shows, the writeback messages can be matched with the corresponding Invalidation

Request message or Downgrade request messages sent by Home to the owner node at different

95times. Since all messages in queue QH originate at the same node, they are stored in-order and

can be used to determine the order of the writeback messages. While there are writeback

messages in any input queue in Home2, its directory controller examines queue QH and dequeues

either an Invalidation or a Downgrade request message, uses the information contained in that

message to determine the queue where the corresponding writeback message is stored and

dequeues it. The Home2 directory controller extracts the data from the writeback message and

writes it in the corresponding block in the Home2 memory.

H H2

NODE 0 NODE 5 NODE 6

A B C

NODE 3 NODE 4

DOWNGRD REQ (H=>C,H2)

DOWNGRD_WB ACK(C=>H,H2)

QUEUES AT HOME2

INV REQ (H=>H,H2,C)

OWN ACK (H=>A,H2)

INV REQ(H=>A,H2)

INV_WB ACK (A=>H,H2)

INV REQ (H=>B,H2)

INV_WB ACK (B=>H,H2)

head

tail

OWN ACK (H=>C,H2)

OWN ACK (H=>B,H2)

DATA ACK (H=>C,H2)

Figure 4.13: Messages at the Receiver Queues of the Home2 Node

Writeback messages are also generated because a cache may chose to replace a block that

is in EXCLUSIVE state. Then the cache must send a copy of that block to the Home node. Such a

message is called victim-writeback. The existence of victim-writeback messages presents a

complication to the procedure described above, because victim-writeback messages are not

generated as a response to a message from the Home node, and therefore the directory in Home2

cannot directly use the Invalidation or Downgrade Request messages in queue QH to determine

the correct arrival order. When a node (say node Z) generates a victim-writeback message it must

have already acquired ownership of that block in one of two ways: either 1) the block was in

96SHARED state, in which case Home sent an Invalidation request, collected the Invalidation-Ack

messages, and replied to node Z with an Ownership-Ack message, or 2) the block was in

EXCLUSIVE state, in which case the previous owner of the block was notified by the Home

node to return an InvalidateWB-Ack message after which Home replied to node Z with an

Ownership-Ack message.

To determine the proper order, the Home2 directory controller operates in the following

way when it encounters a victim-writeback message from node Z in queue QZ: locate in QH the

matching Ownership-Ack message from the Home node to node Z If the subsequent

acknowledgment message in QH (either Data-Ack or Ownership-Ack) is found without an

intervening InvalidationWB or DowngradeWB request to node Z, then the VictimWB occurred in

between these two acknowledgement messages. It should be noted that it is possible that a

VictimWB message can be sent before an Invalidation or Downgrade message that is already

enroute from Home has been received by node Z. In this case, the VictimWB message takes the

place of the expected InvalidationWB or DowngradeWB message in QH.

4.3.2 Using Home2 Memory to Fill Read Requests

The data replication that occurs as a result of keeping the Home2 local memory

consistent can be used to enhance the performance of the DSM system. This performance

enhancement is achieved by allowing a Data request to be filled with either the Home or Home2

memory. When the Home2 memory is used to fill a Data request, the network traffic and service

latency for the data request is reduced since the request is filled from the local memory rather

than from memory on a remote node. As long as a block in Nx_H2 is in the SHARED state, it can

be used to fill a read request from a process at Node X locally.

An Ownership request, however, cannot be serviced locally under the FT1 Protocol.

Consider Process Py from Figure 4.10, running on Node Y. Py issues a write request for DataB

that misses in the cache. The FT0 protocol allows the request to be serviced locally if the block is

97in the UNOWNED state since Node Y is Home for DataB, traversing Path S in Figure 3.14 Since

no other node has a copy of the block, the transfer of ownership to Node Y does not require

messages to be sent to other nodes. If the block was either in the state SHARED or EXCLUSIVE,

an Invalidation message would be sent to the nodes that are sharing the data block before

ownership of the block could be given to Py. In the FT1 protocol, the Home2 node (in this case

Node X) will always have a copy of the data block and must receive the Invalidation message.

The Decision Tree described previously for FT0 requires changes to the network cost for

several paths as well as structural changes to the tree itself. Figure 4.14 shows the Decision Tree

for Read Requests under the FT1 protocol. The network cost information is provided for each

individual message transfer as well as a total cost for each tree path. The parameters used in the

network cost values include SH, the cost of sending a message without data, SD, the cost of sending

a message containing a data block, and Ninv, the average number of nodes found in the copyset

when an ownership request is received for a data block in the SHARED state.

A new path, G, is added to the Decision Tree shown in Figure 4.14. This path is taken

when the requested data block is found in the SHARED state in the Home2 memory of the

requesting node and can be used to fill the request locally. Path G provides a 0 cost path for

requests that can be filled from the Home2 memory, in the FT0 protocol, these requests would

normally have been associated with path D. In this case FT1 provides a savings of (SH + SD) over

the FT0 protocol. Other changes in network cost between the FT0 and FT1 protocols include an

increase of SD to Path B, because the Local Data Acknowledge message of FT0 is remote in FT1

because it must be multicast to the requesting node and to the Home2 node. The network cost of

Path E is also increased by SD because the Local Downgrade Writeback message of FT0 is now

remote in FT1, because it must be multicast to both the Home and Home2 nodes.

98

F

REFERENCE

READ

MODIFIEDBLOCK

HOMEADDRESS

NOT HOMEADDRESS

NOT MODIFIEDBLOCK

DATA_REQTO LOC DIR

(NODE H = NODE A)(net cost 0)

DATA_REQTO REM DIR

(NODE H)(net cost SH)

IN CACHEHIT

NOT IN CACHE

NODE A

LOC DIR SENDSDOWNGR_REQ

TO REMOTE NODE C(net cost SH)

REMOTE NODE CSENDS DOWNGR_WB_ACK

TO THIS NODE A AND HOME2(net cost SD)


NOT MODIFIED MODIFIEDBY NODE C

REM DIR SENDSDATA_ACK TO NODE A

AND HOME2(net cost SD)

HOME ISNODE C

HOME IS NOTNODE C


TO NODE A AND HOME2 DIR(net cost SD)



NODE C SENDSDOWNGR_WB_ACK TO REM DIR (NODE H) AND

HOME2 Dir(net cost SD)

NO MESSAGELocal Read miss


TO NODE H CACHE(net cost 0)

NODE H CACHE SENDSDOWNGR_WB_ACK TO

NODE H DIR AND HOME2 Dir(net cost SD)

A

B

C

D

E

HOME2ADDRESS

MODIFIEDBLOCK

NOT MODIFIEDBLOCK

NO MESSAGEHome2 Read miss

G

Network Cost:0

Network Cost:0

Network Cost:0

LOC DIR SENDS DATA_ACK TO LOC CACHE AND HOME2

(net cost SD)

Network Cost:SH + (2 * SD)


Network Cost:SH + (2 * SD)


TO NODE A AND HOME2 DIR(net cost SD)

Network Cost:2 * (SH + SD)

Figure 4.14: Decision Tree for FT1 Read Request

99Figure 4.15 shows the modification to the Decision Tree for Write Requests that Miss in

the cache under the FT1 protocol. In all paths M-S, an Invalidation message must be sent to the

Home2 node and the resulting Invalidation-Ack message must be collected. The additional

Invalidation-Ack message adds SH to every path in the tree. Since Invalidation messages are

multicast, the only paths which will have an added network cost for sending an extra Invalidation

message to the Home2 node are path O and path S, the ones that would not have resulted in any

Invalidation messages under the FT0 protocol. The Invalidation messages have a network cost of

SH. In paths M. N and R, the block for which ownership is being requested is in the state

EXCLUSIVE. In this case, an InvalidationWB-Ack will be multicast from the current owner to

both Home and Home2 in response to the Invalidation message. The writeback message would

have occurred locally in path M for FT0, but is remote in the FT1 protocol and adds a network

cost of SD to path M. In all paths M-S, the Ownership-Ack message must be multicast to the

requesting node and the Home2 node. This requirement adds SD to the network cost of Path Q, R

and S.

Figure 4.16 provides the FT1 protocol Decision Tree for a write request that is found in

the cache. Paths I-L result in upgrade requests for the data block. In paths I and J, the data block

is located in the local memory of the requesting node resulting in a local Upgrade request and

local Ownership Acknowledge under the FT0 protocol. In the FT1 protocol a copy of the

Ownership Acknowledge must be sent to the Home2 node. Since it is not necessary to send the

data block itself, only the notification of the Ownership Acknowledge, the additional message to

the Home2 node increases the network cost of these paths by SH. Paths I and K contain an

Invalidation message that is sent to Home2 for a data block in which the only copy of the data

was the one that was about to be upgraded. The additional network cost for the FT1 protocol for

these paths is (2 * SH) for the Invalidation message and the Invalidation-Ack message. In Paths J,

and L, an Invalidation message is multicast to all nodes on the copyset for the data block (Ninv

100nodes). In this case the additional network cost for the FT1 protocol is SH due to the additional

Invalidation-Ack message that must be collected from the Home2 node.

M

REFERENCE

NODE A

WRITE

SHAREDMODIFIEDBY NODE C

UNOWNED

HOME ISNODE C

HOME IS NOTNODE C

REM DIR (NODE H)SENDS INV REQ

TO NODE C AND HOME2(net cost SH)


TO NODE A AND HOME2(net cost SD)

REM DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET

AND HOME2(net cost SH)

REM DIR (NODE H=C)SENDS INV REQ TO

NODE (H=C) CACHE AND HOME2(net cost SH)

NODE (H=C) CACHE SENDSINV_WB_ACK TO

DIR (NODE H=C) AND HOME2(net cost SD)

HOMEADDRESS

SHR/MODBLOCK

OWNR_REQTO LOC DIR H(net cost 0)

UNOWNEDBLOCK






TO OWNER AND HOME2(net cost SH)

MODIFIED BYREM NODE

NOT IN CACHE

NOT HOMEADDRESS

OWNR_REQTO REM DIR H

(net cost SH)




PQ RO

N


TO HOME2 (net cost SH)

S


TO HOME2 (net cost SH)

OWNR_REQTO LOC DIR H

(net cost 0)

NOT IN CACHE

HOME2 SENDSINV_ACK TO

REM DIR (NODE H)(net cost SH)

Network Cost:( 3 * SH) + (2 * SD)

Network Cost:( 3 * SH) + (2 * SD)

NODE C SENDSINV_WB_ACK TO

NODE H AND HOME2(net cost SD)









Network Cost:( 3 * SH) +SD




NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)(net cost (Ninv *SH)





NODES IN COPYSETSEND INV_ACK TO REM DIR (NODE H)

(net cost (Ninv *SH)

LOC DIR (NODE H)SENDS INV_REQ TO NODES IN COPYSET

AND HOME2(net cost SH)




REM NODE SENDS INV_WB_ACK TO

NODE H AND HOME2(net cost SD)

Network Cost:(2 * SH) + (2 * SD)

HOME2 SENDSINV_ACK TO NODE H

(net cost SH)



Network Cost:(2 * SH) +SD

Figure 4.15: Decision Tree for FT1 Write Miss

101

IN CACHE

MODIFIEDBLOCK

HITSHAREDBLOCK

HOMEADDRESS

UPGRADEOWNR_REQTO LOC DIR(net cost 0)


SHARED ONLYBY NODE A


TO NODE A AND TOREM NODE H2

(net cost SH)


NOT HOMEADDRESS

UPGRADEOWNR_REQTO REM DIR(net cost SH)

SHARED


SHARED ONLY BYNODE A

H

I

J

K

L


TO HOME2(net cost SH)


TO SHARING NODES AND HOME2(net cost SH)

REFERENCENODE A

WRITE

NOT IN CACHE

Network Cost:0


(net cost SH)

Network Cost:3 * SH




(net cost SH)


(net cost SH)





(net cost SH)


TO HOME2(net cost SH)


(net cost SH)

Network Cost:4 * SH


TO SHARING NODES AND HOME2(net cost SH)



(net cost SH)


(net cost SH)




Figure 4.16: Decision Tree for FT1 Write Hit

102Table 4.2 summarizes the differences in the network cost associated for all Decision Tree

Paths A-S for FT0 and FT1. As mentioned previously, SH is the network cost of a message with

data, SD is the network cost of a message with one data block, and Ninv is the number of nodes

found on the copyset when an ownership request is made for a block in the SHARED state. The

paths that have the largest difference in network cost between FT1 and FT0 are paths B, E, G, M,

R and S. FT0 does not have a path G. Any request that would be serviced via path G would have

been serviced via path D in FT0. For this reason FT1 has a benefit for path G of (SH + SD). The

remaining paths would be an additional network cost for the FT1 protocol.

Table 4.2: Comparison of the Network Cost of the Tree Paths for FT0 and FT1Path FT0 network cost FT1 network cost Difference

A 0 0

B SH + SD SH + (2 * SD) SD

C 0 0

D SH + SD SH + SD

E SH + SD SH + (2 * SD) SD

F 2 * (SH + SD) 2 * (SH + SD)

G N/A 0 SH + SD

H 0 0

I 0 3 * SH 3 * SH

J SH * (1 + Ninv) SH * (Ninv + 3) 2 * SH

K 2 * SH 4 * SH 2 * SH

L SH * (3 + Ninv) SH * (Ninv + 4) SH

M SH + SD (2 * SD)+(3 * SH) SD+(2 * SH)

N 2 * (SH + SD) (2*SD)+(3*SH) SH

O SH + SD (3 * SH) + SD 2 * SH

P SD + (SH * (2 + Ninv)) SD + (SH * (Ninv + 3)) SH

Q SH * (1 + Ninv) SD + (SH * (Ninv + 2)) SH

R SH + SD (2 * SD)+(2 * SH) SH + SD

S 0 SD + (2 * SH) SD + (2 * SH)

103The FT1 protocol removes the burst of network traffic related to updating the backup

Recovery Memory that occurs in the FT0 protocol, but produces overhead during normal

operation unlike the FT0 protocol. This overhead is due to the network load caused by the

necessity of sending additional messages to the Home2 node including: Invalidation requests,

writeback (InvalidationWB, DowngradeWB, or VictimWB) messages, Ownership-Ack and Data-

Ack messages. For every Ownership request, an Invalidation-Ack message must be collected

from Home2 in addition to any other Invalidation-Ack messages collected from nodes on the

copyset.

4.3.3 Performance

In Chapter 3, a set of synthetically generated applications was described and compared to

each other in terms of performance for a basic DSM system that directly relates to the FT0

protocol during normal operation (no checkpointing). Differences in application behavior were

obtained by varying three parameters L, N and S.

1. Parameter L is the probability that the memory reference will reside in the node’s

local memory.

2. Parameter N is the probability that the reference will be to the local memory of the

node for which the requesting node has a backup copy. (Node X is Home for DataB,

Node Y is Home2 for DataB, N is the probability that Node Y will choose a memory

reference that resides in the local memory of Node X).

3. Parameter S is the probability that the reference belongs to a remote node’s memory

space and has been accessed by another thread in the system some time previously.

4. The probability that the reference will be one of the blocks already in the cache is 1-

(L+N+S).

104Applications of interest can be described by the following cases:

1. Threads tend to access arbitrary blocks at the Home2 node and the current node. This

case indicates a smaller degree of locality within the current node and a smaller

degree of sharing between threads. The parameters of interest are L=0.01, S=0.01 and

N=[0.01,...,0.10]. (Case N01).

2. Threads tend to access arbitrary blocks within the Home2 memory of a node. Within

the current node there is a larger degree of locality. The parameters of interest are

L=0.10, S=0.02 and N=[0.01,...,0.10]. (Case N10).


another thread. Within the current node, threads tend to access arbitrary blocks. This

case indicates a smaller degree of locality within the current node but a larger degree

of sharing between threads. The parameters of interest are L=0.01, N=0.01 and

S=[0.01,...,0.10]. (Case S01).


another thread. Within the current node there is a larger degree of locality. The

parameters of interest are L=0.10, N=0.02 and S=[0.01,...,0.10]. (Case S10).

The main advantage FT1 has over the FT0 protocol during normal operation is the ability

to fill Data requests from the Home2 memory. The main disadvantage of FT1 protocol during

normal operation is the fact that messages that would occur locally (with a network cost of 0) in

the FT0 protocol, must be multicast to Home2 and therefore become remote (with a network cost

> 0) in the FT1 protocol. The synthetic applications described above were simulated for each

protocol, FT0 and FT1 in order to compare their performance over a range of application types.

These experiments assume that the size of a message header (SH) is 10 bytes and the size of a

message with one data block (SD) is 74 bytes.

105Table 4.3 shows the percentage of cache misses that traverse the specified Tree paths(s)

for the Case N01, Case N10, Case S01 and Case S10 set of applications. The Tree paths included

in are those through which at least .1% of the cache misses are handled. If fewer that .1% of the

cache misses use a particular path, the path was not included in the table. Tree paths G, O and S

are the paths that are used more frequently and have a difference in network cost between FT0

and FT1 as shown in Table 4.2.

Table 4.3: Distribution of Tree Path Usage for Cache MissesB C D F G L N O P S

Case N01: N=.01 29.82% 28.23% 1.09% 30.65% 5.96% 0.76% 3.36%Case N01: N=.02 22.65% 21.12% 0.69% 45.58% 6.86% 0.42% 2.53%Case N01: N=.05 12.55% 12.21% 0.27% 64.82% 8.56% 0.16% 1.40%Case N01: N=.10 7.52% 7.15% 0.14% 75.23% 9.15% 0.78%

Case N10: N=.01 22.65% 21.12% 0.69% 45.58% 6.86% 0.42% 2.53%Case N10: N=.02 18.54% 32.97% 1.26% 37.06% 0.18% 6.85% 0.96% 2.01%Case N10: N=.05 11.60% 50.70% 2.95% 24.52% 0.32% 6.61% 2.04% 1.19%Case N10: N=.10 7.12% 62.46% 4.22% 16.09% 0.11% 0.45% 5.89% 2.97% 0.77%

Case S01: S=.01 69.16% 13.21% 0.29% 7.27% 2.14% 0.15% 7.65%Case S01: S=.02 0.10% 64.44% 11.94% 0.27% 13.29% 2.70% 0.12% 7.16%Case S01: S=.05 0.11% 52.99% 10.05% 0.23% 26.59% 4.10% 5.91%Case S01: S=.10 40.88% 7.93% 0.13% 41.00% 5.45% 4.58%

Case S10: S=.01 0.13% 69.11% 6.61% 0.19% 14.16% 2.24% 7.55%Case S10: S=.02 0.10% 64.44% 11.94% 0.27% 13.29% 2.70% 0.12% 7.16%Case S10: S=.05 52.86% 24.99% 0.52% 11.43% 3.89% 0.29% 5.90%Case S10: S=.10 41.04% 38.76% 0.82% 9.47% 4.83% 0.47% 4.49%

Figures 4.17 and 4.18 compare the processor utilization and the execution time for the

FT0 and FT1 protocols for Case N01. In each of the four Case N01 applications (N=.01, N=.02,

N=.05 and N=.10), FT1 provides better performance than FT0. As the N parameter increases, the

difference between the FT0 and FT1 performance also increases. The reason for this in that in the

N01 applications, there is a larger probability that the thread will access a data block within its

Home2 memory. As the probability that the address is located in the Home2 memory increases,

so does the probability that the Home2 memory will be able to fill a read request locally.

106

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

FT0, N=.01 FT1, N=.01 FT0, N=.02 FT1, N=.02 FT0, N=.05 FT1, N=.05 FT0, N=.10 FT1, N=.10

Case N01

Pro

cess

or

Uti

lizat

ion

Figure 4.17: Processor Utilization, Case N01

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000


Case N01

Sim

ula

tio

n C

lock

Cyc

les

Figure 4.18: Execution Time, Case N01

107Figures 4.19 and 4.20 compare the processor utilization and the execution time for the

FT0 and FT1 protocols for Case N10. In the N10 applications, there is a larger probability that

the thread will access a data block within its own local memory. The FT0 protocol can support

both local read requests and local ownership requests, the FT1 algorithm however does not

support local ownership requests.

When the N parameter is .01 and .02, the FT0 protocol performs better due to the higher

percentage of local memory references. As the N parameter is increased to .05 and .10, the FT1

algorithm performs better due to the higher percentage of memory references that fall within the

requesting node’s Home2 memory.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


Case N10

Pro

cess

or

Uti

lizat

ion

Figure 4.19: Processor Utilization, Case N10

108

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000


Case N10

Sim

ula

tio

n C

lock

Cyc

les


Figures 4.21 and 4.22 compare the processor utilization and the execution time for the

FT0 and FT1 protocols for Case S01. The Case S01 applications have a low probability that the

memory reference will be to the local memory and a higher probability that the memory reference

will be remote and to a data block that has been previously accessed. For this class of

applications, the FT1 algorithm performs better than the FT0 algorithm due to the ability to fill

read requests with the Home2 memory. The difference in performance is not as large as it is for

the Case N01 applications because the percentage of references that fall within the Home2

memory is not as high.

109

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%

FT0, S=.01 FT1, S=.01 FT0, S=.02 FT1, S=.02 FT0, S=.05 FT1, S=.05 FT0, S=.10 FT1, S=.10

Case S01

Pro

cess

or

Uti

lizat

ion

Figure 4.21: Processor Utilization, Case S01

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


Case S10

Pro

cess

or

Uti

lizat

ion

Figure 4.22: Execution Time, Case S01

110Figures 4.23 and 4.24 compare the processor utilization and the execution time for the

FT0 and FT1 protocols for Case S10. In the S10 applications, there is a larger probability that the

thread will access a data block within its own local memory. For this class of applications, the

performance of the FT0 and FT1 algorithms is approximately the same. The advantages that the

FT0 algorithm has for local references balances the advantage that the FT1 algorithm has for

number of references the fall within the Home2 memory in this application class.

0.00%

10.00%

20.00%

30.00%

40.00%

50.00%

60.00%

70.00%

80.00%

90.00%

100.00%


Case S10

Pro

cess

or

Uti

lizat

ion

Figure 4.23: Processor Utilization, Case S10

111

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000

400,000


Case S10

Sim

ula

tio

n C

lock

Cyc

les


4.3.4 Implementation Issues

There are several possibilities for the implementation of the node structure for the FT1

approach. Separate directory controllers could be used for the Home and Home2 memory within

a node. Alternatively there could be a single directory controller whose capabilities are extended

so that it can maintain both Home and Home2 memories. One argument for having a single

directory controller is that a node would have one physical memory that would be logically

divided into sections serving as Home, Home2 and recovery memory sections. In this situation,

two separate controllers would have to coordinate access to the memory bus and therefore one

controller would often be waiting for the other to finish using the memory bus.

The Home directory controller receives Data or Ownership requests from the cache

controllers and issues Invalidation or Downgrade request messages if necessary, waits for

acknowledgements, and then sends the Data-Ack or Ownership-Ack messages (usually along

112with the data block being requested). The Home2 directory controller receives and processes any

messages necessary to keep the Home2 copy of the memory consistent with the Home memory.

The performance of the Home directory controller directly impacts the performance of

the multiprocessor system in terms of the service latency of the Data or Ownership request. The

service latency is the length of time between the issue of the request until the receipt of the reply.

When servicing an Ownership request, the directory may have to issue a number of Invalidation

messages and wait for the resulting acknowledgments. If the acknowledgments are not received

and processed in a timely manner, the cache controller must wait longer before receiving the data.

Longer service latencies can cause the processor to spend more time in an idle state waiting for

the data it requires to continue processing.

The response time of the Home2 directory controller for handling incoming messages is

important in two cases. The first case is the ability to allow a read miss from the local cache to be

filled with a block from the Home2 memory if it is in the SHARED state. The advantage of the

Home2 memory providing the data block to the cache is the reduction in service latency for the

read request since messages do not have to be exchanged with the Home node. The performance

of the Home2 directory controller also affects the amount of time required to determine the most

recent value written to a data block when it comes time to create a new checkpoint. The most

recent value written to the data block must be determined before the recovery memory can be

updated.

The speed with which the Home messages are processed will impact the performance of

the system. The processing of the Home2 messages, however, is not as time-critical. If there were

two directory controllers (Home and Home2) and two input queues, the Home directory

controller could be given higher priority for the memory bus and therefore continue to process

messages with no reduction in performance. The Home2 directory controller could process its

messages when it doesn’t interfere with the Home directory controller. This actually leaves a

reasonable amount of time since the home directory spends time waiting for acknowledgments

113before reading or writing to the memory. The performance degradation resulting from the

increased service latencies that could result from a single controller performing the functions of

both the Home and Home2 directory function may warrant having separate controllers and

possibly separate queues. In the simulations, it is assumed that there are separate Home and

Home2 directory controllers and associated input queues.

4.4 FT2 PROTOCOL

Although the FT1 Protocol outperforms the FT0 protocol when a large number of Data

Requests can be filled by the Home2 memory, in cases where this is not possible, the FT1

Protocol exhibits an increasing loss in performance as the degree of spatial locality in the

application increases. This loss in performance is due to the fact that Ownership/Upgrade requests

cannot be performed locally because every write request must cause an Invalidation message to

be sent to the Home2 node, even if no other node has a copy of the data.

The performance loss for the FT1 Protocol is not merely due to the additional network

traffic, but also to the increase in the service latency of an ownership request. The increase in

service latency is due to the necessity of invalidating the data block in the Home2 node and then

waiting for the acknowledge to be received by the Home node. When the only Invalidation

message required is the one to Home2, the delay is significant because in FT0 there would be no

Invalidation messages and the Ownership-Ack message would be sent immediately.

If the ownership request is for a data block in either the UNOWNED state or the

SHARED state where the only sharer is the node requesting the ownership, the FT2 protocol

allows the Home node to send an Invalidation message to Home2 and then grant the requesting

node ownership of the data block, without waiting for the Invalidation-Ack message from

Home2. In either of these cases, Home2 is the only node that must be invalidated before the

ownership is transferred. If the data block was in the EXCLUSIVE state or in the SHARED state

with multiple sharers, invalidations must be sent to nodes other than the Home2 node and the

114resulting Acknowledgements must be collected in order to guarantee sequential consistency.

With this optimization, the FT2 protocol decreases the service latency for Ownership or Upgrade

requests for a block that is either unowned or shared only by the node requesting ownership for

the block.

The FT2 protocol awards ownership to a node without waiting for the Acknowledgment

from Home2. In order to ensure that sequential consistency is not violate, three orderings of the

set of interactions between the Home and Home2 nodes must be considered. In [8] a sequentially

consistent execution is defined to be one in which the result that is produced is the same as one

obtained by any one of the possible interleaving of operations that preserve program order within

a single process.

Referring to Figure 4.25, assume that both Node X and Node Y have a copy of DataB in

the SHARED state in their respective caches. Process Px issues a write access to DataB. In this

case, the cache controller on Node X will send an Ownership request to Node Y and wait for the

corresponding Ownership-Ack before the write operation completes. The Home directory (on

Node Y) will cause the local cache (on Node Y) to invalidate the block and then send an

Invalidation request message to the Home2 directory (on Node X) and send the Ownership-Ack

to Node X. This scenario poses no problems for the FT2 protocol since the cache controller on

Node X is waiting for the data block to come from Home and will not attempt to access the

Home2 value in the meantime and sequential consistency is maintained.

Assume that caches on both Node X and Node Y again have DataB in the SHARED

state. Process Py issues a write access to DataB. In this case, the Home directory controller (on

Node Y) changes the state of the block to a transient “busy” state just as it normally does when it

is waiting to gather Invalidation-Ack or Downgrade-Ack messages, sends an Invalidation request

to the Home2 directory (on Node X) and immediately grants ownership of the block to the Node

Y cache. If another request for the data block reaches the Home node while the block is in the

“busy” state, the request is placed on the waiting queue. When the Home node receives the

115Invalidation-Ack message from Home2, the state of the block is changed to EXCLUSIVE and the

Home directory controller can process any requests for the block that are on the waiting queue.

Process Py is granted ownership immediately and may begin writing to the Data block before the

Invalidation-Ack message is received from the Home2 node. Whether the Home directory

controller waits for the acknowledge message from Home2 before granting the ownership or not,

the Home2 node will eventually receive the invalidation and stop reading from DataB in its

cache. The Home2 node will then invalidate the cache and send the Invalidation-Ack message

back to the Home node. The resulting global order of operations preserves the program order of

each of the processes and is a valid sequential interleaving of the program orders of all processes,

so sequential consistency is not violated.

Cache

HomeLocal Memory

Node X

Recovery DataNode X

Recovery DataNode Y

Node X

Home2Local Memory

Node Y

Cache

Recovery DataNode Y

Recovery DataNode Z

Node Y

HomeLocal Memory

Node Y

Home2Local Memory

Node Z

Nx_H Nx_H2 Ny_H Ny_H2

Px Py

DataBDataB

Figure 4..25: Ownership Request scenarios.

116As before, assume that caches on both Node X and Node Y have DataB in the SHARED

state. This time both process Py and process Px decide to write to DataB at the same time. The

Home directory will see the request from Py first. Since only Home2 has a copy of the data block,

Py is allowed to proceed with the write while DataB is placed in the “busy” state. If Px issues a

write access to DataB after it receives the Invalidation message from Home, the effect is the same

as the previous case that was shown to preserve sequential consistency.

If Px issues the write access to DataB before it receives the Invalidation message from

Home, it sends an Ownership request to the Home directory and waits for the reply. When the

Home directory receives the Ownership Request from Home2, DataB is still in the “busy” state

because the Home has not received the Invalidation-Ack from Node X. The request is placed on

the waiting queue. When Home2 receives the Invalidation request, it invalidates the cache block

and sends an Invalidation-Ack to the Home node. When the Home node receives the

Invalidation-Ack, the state of the block is changed to EXCLUSIVE and the waiting queue is

checked for requests. When the Ownership Request message from Node X is retrieved from the

waiting queue, the Home directory causes the DataB block in the cache on Node Y to be

invalidated and the Ownership-Ack message is sent to Node X. Once the data block becomes

busy, it stays busy until the acknowledge message is received from Home2, even if the process

finished writing earlier. For this reason successive write operations appear in the same order for

all processes and the result of execution is the same as if the operations of all processes were

interleaved and program order was preserved in each process, i.e. sequential consistency is

preserved.

The FT2 protocol reduces the service latency for Ownership requests for unowned data

blocks by granting Ownership of the data block before the Invalidation-Ack message has been

received from the Home2 node. In addition, the FT2 protocol reduces the network cost that is

involved in the multicasting of all Ownership and Data Acknowledge messages in FT1. The

Ownership Acknowledge message is sent to Home2 for the purposes of tracking the changes in

117ownership for a data block. The data block value included in the Ownership Acknowledge

message is not used by Home2 because the data block is in the INVALID state in the Home2

memory and will be modified again before being written back by the new owner. For this reason,

it is not necessary to send the data block to Home2, only the notification of the ownership

Acknowledge. If the Home node is remote to the requesting node, multicasting the Ownership

Acknowledge message to Home2 and the requesting node does not increase the network cost over

what it would be in FT0. If however, the Home node is local to the requesting cache, the

Ownership Acknowledge would be handled locally (network cost of 0). Any message sent to

Home2 in this case will have a network cost proportional to the message size (SD). Since it is not

necessary to send the data to the Home2 node, a reduction in network cost can be achieved by

sending 2 Ownership Acknowledge messages. One Ownership Acknowledge message containing

the data will be sent locally to the requesting cache. Another Ownership Acknowledge message

that does not contain the data block (message size of SH) will be sent remotely to Home2.

4.4.1 Performance

Figures 4.26 through 4.29, provide the execution time for the Case N01, Case S01, Case

N10 and Case S10 application classes respectively. In each all of the experiments reported in this

section, each thread processed 10,000 memory references. These figures show that the

performance increase provided by the FT2 protocol over the FT1 protocol can be observed when

the degree of locality of the application is higher (CASE N10 and Case S10) since both

enhancements are aimed at the situation where there is an Ownership or Upgrade request for a

data block that is local to the requesting node and is either unowned or shared only by the

requesting node. If the degree of locality is lower (Case N01 and Case S01), there is little

difference in the performance of the FT1 and FT2 protocol.

118

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000


Case N01

Sim

ula

tio

n C

lock

Cyc

les


0

50,000

100,000

150,000

200,000

250,000

300,000

350,000


Case S01

Sim

ula

tio

n C

lock

Cyc

les


119

0

50,000

100,000

150,000

200,000

250,000

300,000

350,000


Case N10

Sim

ula

tio

n C

lock

Cyc

les


0

50,000

100,000

150,000

200,000

250,000

300,000

350,000


Case S10

Sim

ula

tio

n C

lock

Cyc

les


120

In general, the FT2 protocol provides better performance than the FT1 protocol when the

application exhibits a high degree of locality. However it is possible when an application exhibits

a high degree of locality and there are multiple threads per process, the DSM can become

unstable and the queues can become full bringing the system to a halt. This can occur because the

cache is granted ownership of a block that is unowned or shared only by the requesting cache

without waiting for the Invalidation-Ack from Home2. As described previously, this does not

pose a problem in regard to sequential consistency however, without a limiting condition (such as

waiting for an acknowledgment) the number of outstanding Invalidation-Ack messages in the

system can become large enough to adversely affect the performance of the DSM.

If every thread in a process issues an ownership request to a different memory block

which is resident in the local memory of that node and is unowned or only shared by the

requesting cache, ownership requests and acknowledgments can be issued one right after the

other without waiting for the corresponding Invalidation-Ack message from Home2. This results

in the output queue of the Home node being filled with Invalidation requests and Ownership

acknowledgments destined for Home2. Assuming the system is symmetric and all nodes are

behaving similarly, the queues fill up and the response to other requests that involve Invalidation-

Ack or Downgrade-Ack becomes slower and slower. This problem worsens as the number of

threads increases, the write ratio increases, or the degree of locality of the application increases.

In order to prevent this situation, a limit is set to the number of outstanding Invalidation-Ack

messages from Home2 for blocks that are unowned or shared only by the requesting cache. When

this limit is reached, all subsequent ownership requests must wait for the Home2 Invalidation-

Ack message before the cache is awarded ownership of the block.

121

4.5 FT3 PROTOCOL

Although the FT2 Protocol provides some improvement over the FT1 protocol, the

network cost for Decision Tree paths that involve transferring a data block (writeback messages

as well as Data and Ownership acknowledge messages) still causes performance loss in paths

where the data transfer would otherwise have been done locally. The FT3 Protocol reduces the

network costs incurred in FT1 and FT2 in three ways.

Cache controllers do not send either DowngradeWB or InvalidationWB messages to the

Home2 node. This is possible because all InvalidatationWB or DowngradeWB messages occur

as the result of an Ownership or Data request to a block in the EXCLUSIVE state. Once the

Home directory controller receives the writeback message, it sends the Data or Ownership

Acknowledge message with the data block to the requesting cache. In order to reduce the

overhead of sending InvalidationWB or DowngradeWB messages for locally owned data blocks,

the new value for the memory block can be sent to the Home2 directory controller as part of the

subsequent Data or Ownership Acknowledge messages thereby avoiding the cost of sending local

writeback messages only to the Home2 directory as would occur for Decision Tree paths E and

M. Victim writeback messages, however, are not the result of a request from another node and

the next Data or Ownership request for the block could occur after the next checkpoint is taken.

For this reason, Victim writeback messages must be sent to both the Home and Home2 directory

controllers.

When a cache chooses a locally owned data block to be replaced in the cache, the Victim

writeback message would have been handled locally in FT0 but causes a remote message to be

sent in either FT1 or FT2. The next optimization in the FT3 protocol is to eliminate the necessity

of sending a remote message only to Home2. Instead, the update to Home2 can be postponed so

that a node other than the home node can request the data block and Home2 can receive the

update through the multicast of the Data Acknowledge message that is sent to the remote node.

122If another node does not request the data block before the next checkpoint occurs, the

Home2 node will not have all the information it requires to update its copy of the recovery

memory. This situation is handled by associating a “writeback-pending” attribute with the

directory entry for each data block. When a home node receives a local victim writeback

message, it sets the “writeback-pending” attribute to indicate that the data has not been written

back to the Home2 node. At checkpoint time, each node will go through the directory entries for

its local memory. All data blocks that have the “writeback-pending” attribute set are added to a

message that will be sent to the Home2 node providing it with the most recent value for these data

blocks. This message is similar to the recovery memory update message sent by the FT0 protocol

at checkpoint time but will contain fewer data blocks.

The FT3 protocol removes the requirement to multicast the Ownership Acknowledge

message to the requesting node and Home2. The data in an Ownership Acknowledge message is

not useful to the Home2 node because the data block in the Home2 memory remains in the

INVALID state until a Data Acknowledgement message is received by the Home2 node. As long

as a node has the ownership of a block, it is possible the block will be modified. The only way to

get the most recent value of a data block is to either receive the writeback message that occurs

when the node relinquishes ownership of the block or to get the data block as part of a future Data

Acknowledge message. Furthermore, it is only necessary to send the Home2 node a Data

Acknowledge message when the data block has just transitioned from the EXCLUSIVE state to

the SHARED state. It is not necessary to multicast the Data Acknowledge message to Home2 if

the block was already in the SHARED state when the requesting node made a read request. The

“writeback-pending” attribute is used to indicate that the Home2 node has not received the latest

value of a data block and is set anytime a writeback of any kind occurs. The attribute is cleared

when a Data Acknowledge message is multicast to the Home2 node. Each time the home

directory is ready to send a Data acknowledge to another node, it checks the “writeback-pending”

attribute. If the attribute is set, the Data Acknowledge message is multicast to the Home2 node,

123otherwise it is not. A major advantage of this approach is that no additional effort is needed to

determine the correct order of updates to a data block that occurs when the updates consist of

Writeback messages received from different caches. All messages sent from the Home node are

seen in the same order in all other nodes. Therefore the most recent value of a data block is the

value received in the most recent Data Acknowledge message.

4.5.1 Performance

The total execution time, including normal execution and checkpointing time, is provided

for the FT0, FT1, FT2 and FT3 protocols are compared for each of the four synthetic workloads

in Figures 4.30 through 4.33 and for the multiprocessor trace files in Figure 4.34. The synthetic

applications contained 30,000 memory references and created three checkpoints during the

execution of the application.

In the Case N01 applications, there is a low probability (.01) that a memory reference

will be for a block that is in the local memory of the requesting node. There is also a low level of

sharing data blocks between all the nodes. Figure 4.30 shows the performance for each of the

protocols when the probability that the backup node (Home2) will reference a data block in the

local memory of the primary node (Home). Increasing the probability that data will be shared

between the Home and Home2 nodes increases the probability that read misses on the Home2

node can be filled with the Home2 memory which otherwise would have required a remote

access. Under these conditions the FT1, FT2 and FT3 provide an improvement over FT0. The

results from these experiments show that even for very small increases in the probability of

sharing data between the Home and Home2 nodes, the performance of the system increases

significantly.

In the Case S01 and Case S10, very little effort is made to encourage sharing of data

between the Home and Home2 nodes in particular (probability of .01). The probability that a

memory reference is to a remote block that has been accessed before is increased. This type of

124application is one in which there is an increasing degree of sharing data structures between all of

the nodes in the system. The Case S01 experiments shown in Figure 4.32 have a small

probability of a node accessing a block in its own local memory (.01) and the Case S10

experiments shown in Figure 4.33 have a larger probability (.10) of a node accessing a local

block. In all experiments shown in Figure 4.32 and 4.33, the probability that the Home and

Home2 nodes in particular share data is .02. The Case S01 experiments in Figure 4.32 show that

FT0, FT1 and FT2 perform better than FT0 both in terms of normal execution time and in the

time required to create checkpoints. In addition FT1, FT2, and FT3 provide similar performance.

The reason for this is that the features of the FT2 and FT3 protocols are aimed at reducing the

penalties from write requests that are local in FT0 but must be handled as a remote request in the

FT1 protocol. Since the probability of accessing local memory is small in these experiments, the

optimization of FT2 and FT3 are not necessary. The higher probability of accessing local

memory in the Case S10 experiments shown in Figure 4.33 allow the optimizations of FT2 and

FT3 to be utilized. Although all three protocols, FT1, FT2, and FT3 provide better performance

than FT0, in each experiment, the FT2 and FT3 algorithms provided an improvement over the

FT1 algorithm. In these experiments, the FT3 algorithm has a lower normal execution time and

FT2 but the time required to send the update message containing the local Victim writeback

messages that were not sent to Home2 is high enough to be the total execution time higher than

FT2.

In order for FT3 to spend a smaller amount of time in the Recovery Memory Update

phase of the checkpoint procedure, the local memory accesses must be to a smaller number of

local data blocks. If the number of local memory accesses is high but the set of blocks that are

being accessed is smaller (i.e. ongoing updates to a particular data structure). The FT3 protocol

will not only eliminate the local Victim writeback messages that the FT2 protocol requires but

will also have a significantly smaller update message during checkpoint time. In other words a

125small number of local data blocks were obtained Exclusively and then written back a large

number of times.

0

100000

200000

300000

400000

500000

600000

700000

800000

900000

1000000

N01_F

T0

N01_F

T1

N01_F

T2

N01_F

T3

N02_F

T0

N02_F

T1

N02_F

T2

N02_F

T3

N05_F

T0

N05_F

T1

N05_F

T2

N05_F

T3

N10_F

T0

N10_F

T1

N10_F

T2

N10_F

T3

Increasing Probability of Neighbor Access

Clo

ck C

ycl

es

Normal Execution Total Checkpoint

Figure 4.30: Total Execution Time, Case N01

126

0

200000

400000

600000

800000

1000000

1200000

N01_F

T0

N01_F

T1

N01_F

T2

N01_F

T3

N02_F

T0

N02_F

T1

N02_F

T2

N02_F

T3

N05_F

T0

N05_F

T1

N05_F

T2

N05_F

T3

N10_F

T0

N10_F

T1

N10_F

T2

N10_F

T3

Increasing Probability of Neighbor Access

Clo

ck C

ycl

es


Figure 4.31: Total Execution Time, Case N10

0

200000

400000

600000

800000

1000000

1200000

S01_

FT0

S01_

FT1

S01_

FT2

S01_

FT3

S02_

FT0

S02_

FT1

S02_

FT2

S02_

FT3

S05_

FT0

S05_

FT1

S05_

FT2

S05_

FT3

S10_

FT0

S10_

FT1

S10_

FT2

S10_

FT3

Increasing Level of Sharing

Clo

ck C

ycl

es


Figure 4.32: Total Execution Time, Case S01

127

0

200000

400000

600000

800000

1000000

1200000

1400000

S01_

FT0

S01_

FT1

S01_

FT2

S01_

FT3

S02_

FT0

S02_

FT1

S02_

FT2

S02_

FT3

S05_

FT0

S05_

FT1

S05_

FT2

S05_

FT3

S10_

FT0

S10_

FT1

S10_

FT2

S10_

FT3

Increasing Level of Sharing

Clo

ck C

ycl

es


Figure 4.33: Total Execution Time, Case S10

Figure 4.34 provides a comparison of the protocols for the multiprocessor trace files. The

weather application provides an example of a situation where the FT3 protocol performs better

than the FT2 protocol. In this case, the FT0 protocol has a smaller normal execution time than

any of the other protocols but the time to take a checkpoint causes the total execution time to be

higher than FT1, FT2 or FT3.

In the simple application, there are a large number of local accesses to different local data

blocks and a large degree of sharing between the nodes. In this case, the most of the local

ownership requests would have to be handled remotely by either FT0 or FT1 because the

requested data is rarely found in the UNOWNED state. The FT2 algorithm has a higher

execution time due to the necessity of multicasting the local victim writeback messages to the

Home2 node. Although the FT3 algorithm, does not sent the local victim writeback messages to

Home2, the number of blocks that were accessed during the interval between checkpoints is high

128enough to cause the size of the update message sent during the Update Recovery Memory phase

of the checkpoint to approach that of the one in the FT0 protocol. The FT1 algorithm performs

best in this situation, because waiting for the acknowledge message from Home2 limits the

number of outstanding ownership requests in the system. If the number of outstanding ownership

requests is too high, the output queues tend to fill up which causes the entire system to experience

higher latencies in servicing the cache requests for data.

All four protocols perform similarly in the speech application for the simple reason that a

small number of data blocks are modified during the interval between checkpoints making the

time required to take a checkpoint small in all cases, even FT0. In addition, there is a very low

probability of locality or sharing between the Home and Home2 nodes, which results in similar

execution times during normal operations for all protocols.

0

2,000,000

4,000,000

6,000,000

8,000,000

10,000,000

12,000,000

14,000,000

16,000,000

18,000,000

wthr_

FT0

wthr_

FT1

wthr_

FT2

wthr_

FT3

simp_

FT0

simp_

FT1

simp_

FT2

simp_

FT3

spee

ch_F

T0

spee

ch_F

T1

spee

ch_F

T2

spee

ch_F

T3

Clo

ck C

ycle

s

Normal Operation Checkpoint

Figure 4.34: Execution Time, Trace Files

129

CHAPTER 5: CONCLUSIONS

In this thesis, characteristics required for interconnection networks to provide DSM

Systems with the means to achieve efficient low latency communication were discussed. In

Chapters two and three, the architecture of the SOME-Bus was presented along with possible

implementations of the internal components necessary to provide a fully hardware-supported

DSM system. A method of generating synthetic workloads was presented along with a Decision

Tree structure capable of providing detailed information on the behavior of applications and the

message traffic generated as a result of changes in application behavior. A Trace-driven

simulator was used to compare performance measures such as processor and channel utilization

as well as application runtime for the set of multiprocessor traces and synthetic workloads used to

evaluate the performance of DSM on the SOME-Bus. Also in chapter three, a theoretical model

was developed for the SOME-Bus and solved as a closed queuing network. A simulator based on

the network queuing model was developed in which parameters that characterize behavior

patterns of specific applications can be incorporated into the model. Values for Processor and

Channel utilization as well as waiting time in the output channel queue were compared for the

Trace-driven and Distribution-driven simulators as well as the theoretical model. The results of

the three approaches were very close, with differences less that 10% for processor and channel

utilization showing that complex behavior present in DSM systems can effectively be modeled by

relatively simple queuing models using parameters that characterize real applications.

In Chapter four, a set of protocols were presented to achieve fault tolerance with little or

no network overhead when implemented on the SOME-Bus. The advantages and disadvantages

of each protocol were presented in terms of specific application characteristics such as locality

and degree of internode sharing of data. Applications implementing the protocols were analyzed

using the Decision Tree structure in order to determine the source of higher levels of network

traffic. Several versions of the protocols were proposed in order to take advantage of features

130such as locality and communication patterns that occur frequently between pairs of nodes.

Results show that for the synthetic workloads, not only was the overhead for fault tolerance

hidden, but the performance of the DSM was improved by the addition of the protocols.

Characteristics such as hot spots and a small degree of locality prevented the multiprocessor

traces from benefiting from the protocols with the exception of simple, which demonstrated a

performance increase when the FT1 protocol was applied. The speech application was successful

in hiding the fault tolerance overhead and weather allowed the overhead to be minimized.

The work described in this thesis demonstrates the benefits of an interconnection network

architectures that are optimized for the types of communication inherent in DSM shared memory,

namely support for frequent efficient multicast of relatively small messages. A DSM

implemented within such a framework has the potential to realize not only an increase in the level

of fault tolerance of the system but also a simultaneous increase in the performance of the overall

DSM system.

131

List of References

1. Adve, V., "Performance Analysis of Mesh Interconnected Networks With DeterministicRouting", IEEE Transactions on Parallel and Distributed Systems, Mar 1, 1994, v 5, n 3, p.225.

2. Banatre, M., Gefflaut, A., Joubert, P., Morin, C., and Lee, P. A., "An architecture fortolerating processor failures in shared-memory multiprocessors," 1996.

3. Bershad B. N. and Zekauskas, M. J., “Midway: Shared Memory Parallel Programming withEntry Consistency for Distributed Memory Multiprocessors,” Research Report CMU-CS-91-170. Dept of Computer Science, Carnegie-Mellon Univ. Sept, 1991.

4. Bolch G., Greiner S., Trivedi K. S. and Meer H. Queueing Networks and Markov Chains:Modeling and Performance Evaluation with Computer Science Applications.

5. Bouzid, A., M. A. G. Abushagur, "Thin-film approximate modeling of in-core fiber gratings."Optical Engineering, Vol. 35, No. 10, pp. 2793-2797 (1996)

6. Calvin, C., "All-To-All Broadcast in Torus with Wormhole-Like Routing", IEEE Symp.Parallel Distributed Process., 1995, pp. 130-137.

7. Chaiken, D.; Fields, C.; Kurihara, K.; Agarwal, A. “Directory-based cache coherence inlarge-scale multiprocessors “Computer , Vol.23, Iss.6, 1990 Pages: 49- 58

8. Culler, D. E., Singh, J. P. and Gupta, A., Parallel Computer Architecture, Morgan Kaufman,1999

9. Culp, J., B. Nabet, F. Castro, and A. Anwar, "Gain Enhancement of Low Temperature GaAsHeterojunction MSM Photodetector", Applied Physics Letters, March 1998.

10. Dong, L., Ortega, B., Reekie, L., "Coupling characteristics of claddding modes in tiltedoptical fiber gratings", Applied Optics, 37:(22), pp. 5099-5105, 8/1998.

11. Erdogan, T., Sipe, J., "Tilted fiber phase gratings", Journal of the Optical Society of America,13:(2), pp. 296-313, 2/1996.

12. Fleisch, B. D., Michel, H., Shah, S. K., and Theel, O. E., "Fault tolerance and configurabilityin DSM coherence protocols," 2000.

13. Gharachorloo K., Lenoski D., Laudon J., Gibbons P., Gupta A., and Hennessy J., “MemoryConsistency and Event Ordering in Scalable Shared Memory Multiprocessors,” Proc. 17th

Ann. Int’l Symp. Computer Architecture, pp. 15-26, Seattle, May 1990.

14. Gravenstreter, G., Melhem, R., "Realizing Common Communication Patterns in PartitionedOptical Passive Stars (POPS) Networks", IEEE Transactions on Computers, Vol. 47, No. 9,9/1998.

15. Grujic, A., Tomasevic, M., Milutinovic, V., "A Simulation Study of Hardware-Oriented DSMApproaches", IEEE Parallel & Distributed Technology, Sprg 1996, v 4, n 1, p. 74.

13216. Ho, C., "Optimal Broadcast in All-Port Wormhole-Routed Hypercubes", IEEE Trans.

Parallel Distr Syst., Feb. 1995 v. 6 n. 2 p. 203(5).

17. Katsinis, C., "Performance Analysis of the Simultaneous Optical Multiprocessor ExchangeBus", Parallel Computing Journal, Vol. 27, No. 8, pp. 1079-1115, July 2001.

18. Kermarrec, A-M, Morin, C., Banatre, M., “Design, implementation and evaluation ofICARE: an efficient recoverable DSM”, Software - Practice and Experience, vol. 28, no. 9,pp. 981-1010, 25 Jul 1998.

19. Kim J.H. and Vaidya N. H., “Recoverable Distributed Shared Memory Using the CompetitiveUpdate Protocol”, Proceedings of the Pacific Rim International Symposium on Fault-TolerantSystems, Newport Beach, California, USA, Dec 1995, pp. 152-157

20. Kim J.H. and Vaidya N. H., “Single fault-tolerant distributed shared memory usingcompetitive update”, Microprocessors and Microsystems, Volume 21, Issue 3, 15 December1997, Pages 183-196.

21. Kontothanassis, L. L., Scott, M. L, "Using memory-mapped network interfaces to improvethe performance of distributed shared memory", IEEE High Performance ComputerArchitecture 1996, pp. 166-177.

22. Kulick, J., Cohen, W. E., Katsinis, C., Wells, E., Thomsen, A., Gaede, R. K., Lindquist, R.G., Nordin, G. P., Abushagur, M., and Shen, D., "The simultaneous optical multiprocessorexchange bus," 1995.

23. Lan, Y., "Multicast Communication in 2-D Mesh Network", IEEE Int Conf. ParallelDistributed Syst., Icpads, 1994, pp. 63-68.

24. Lee, M. , Little, G., "Study of radiation modes for 45-deg tilted fiber phase gratings", OpticalEngineering, 37:(10), pp. 2687-2698, 10/1998.

25. Li Y., T. Wang, "Distribution of light power and optical signals using embedded mirrorsinside polymer optical fibers", IEEE Photonics Technology Letters, vol. 8, no. 10, October1996, pp. 1352-1354.

26. Li Y., T. Wang, and K. Fasanella, "Cost-Effective Side-Coupling Polymer Fiber Optics forOptical Interconnections", Journal of Lightwave Technology, vol. 16, no. 5, May 1998, pp.892-901.

27. Mckinley, P., "Collective Communications in Wormhole-Routed Massively ParallelComputing", IEEE Computer, v. 28, n. 12, pp. 39-50, 1995.

28. Morin, C. and Puaut, I., "A survey of recoverable distributed shared virtual memorysystems," 1997.

29. Morin, C., Kermarrec, A. M., Banatre, M., and Gefflaut, A., "An efficient and scalableapproach for implementing fault-tolerant DSM architectures," 2000

30. Ould-Khaoua, M., "Comparative Evaluation of Hypermesh and Multi-Stage InterconnectionNetwork", Computer Journal, 1996 v. 39 n. 3 p. 232.

13331. Panda, D., "Fast Barrier Synchronization in Wormhole k-ary n-cube Network with

Multidestination Worms", IEEE High Performance Computing Architecture Symp., 1995, pp.209-209.

32. Rajasekaran, S. Sahni, S., "Sorting, Selection, and Routing on the Array with ReconfigurableOptical Buses", IEEE Transactions on Parallel and Distributed Systems, Vol. 8, No. 11,11/1997.

33. Shang, S., "Distributed Hardwired Barrier Synchronization for Scalable MultiprocessorClusters", IEEE Trans. Parallel Distributed Syst., v. 6, n. 6, pp. 591-605, 1995.

34. Silva, Luis M, Silva, Joao Gabriel, “Checkpointing distributed shared memory”, Journal ofSupercomputing, vol. 11, no. 2, pp. 137-158, 1997.

35. Szymanski, T., "Hypermeshes. Optical Interconnection Network for Parallel Computing",Journal Parallel Distributed Computing, Apr 01 1995 v. 26 n. 1 p. 1.

36. Theel O., Fleisch B., “A dynamic coherence protocol for distributed shared memoryenforcing high data availability at low costs”, IEEE Transactions on Parallel and DistributedSystems”, vol.7, no.9, p. 915-30, Sept. 1996.

37. Trivedi, K., Probability and Statistics with Reliability, Queueing and Computer ScienceApplications, Prentice-Hall, 1982.

38. Turk, J. G. and Fleisch, B. D., "DBRpc: a highly adaptable protocol for reliable DSMsystems," 1999.

39. Willick, D. L., Eager, D. L., "An Analytic model of Multistage interconnection networks",ACM SIGMETRICS, May 1990, pp. 192-202.

40. Yang, J., "Hardware Supports for Efficient Barrier Synchronization on 2-D Mesh Network",IEEE Int. Conf. Distributed Computing Syst., 1996, pp. 233-240.

134

VITA

Diana Lynn Hecht was born in Tuscaloosa Alabama in 1964. Diana Hecht received her

B.S.E. in 1995 and M.S.E. in 1999 from the University of Alabama in Huntsville. As an

undergraduate, Diana was a Co-op Student working at NASA, where she received a U.S. Patent

for her contribution to the design of an Imaging Phosphor Scanner for application in X-Ray

Crystallography. After receiving her B.S.E., Diana was employed by VME Microsystems Inc,

where she developed low-level software for embedded systems before returning to school to

pursue her M.S.E. and Ph.D. in Computer Engineering. While pursuing her graduate studies,

Diana held a teaching and research assistantship and has extensive teaching experience in the

areas of hardware design and Real-time Systems Software. Diana’s past research activities

included the development of a parallel version of an existing Optical Plume Anomaly Detection

software application (sponsored by NASA). The scheduling issues that arose from the

transformation of the application became the focus of her master’s thesis. Additional research

efforts involved Fault Tolerance for Low-Bandwidth Networks aimed at the Rapid Force

Projection Initiative System Network (sponsored by the US Army Aviation and Missile

Command). Diana has also been involved in research sponsored by the National Science

Foundation for the development of a prototype of the Simultaneous Optical Multiprocessor

Exchange Bus (SOME-bus). Diana Hecht is currently working as a research engineer for Rydal

Research and Development where she is involved in research and development efforts in low

latency processor interconnects and represents Rydal Research in the RapidIO Trade Association.

fault-tolerant distributed shared memory on a broadcast ...48/datastream/obj... · thrashing. in...

Documents