design, development, and simulation/experimental …heath/masters_proj_report_venugopal...1 design,...

1

Design, Development, and Simulation/Experimental Validation of a

Crossbar Interconnection Network for a Single-Chip Shared Memory

Multiprocessor Architecture

Master’s Project Report

June, 2002

Venugopal Duvvuri

Department of Electrical and Computer Engineering

University Of Kentucky

Under the Guidance of

Dr. J. Robert Heath

Associate Professor

Department of Electrical and Computer Engineering

University of Kentucky

2

Table of Contents

Topic Page Number

ABSTRACT 3

Chapter 1: Introduction, Background, and Positioning of Research 4

Chapter 2: Types of Interconnect Systems 8

Chapter 3: Multistage Interconnection Systems Complexity 16

Chapter 4: Design of the Crossbar Interconnect Network 28

Chapter 5: VHDL Design Capture, Simulation, Synthesis and

Implementation Flow 35

Chapter 6: Design Validation via Post-Implementation Simulation Testing 39

Chapter 7: Experimental Prototype Development, Testing, and Validation

Results 61

Chapter 8: Conclusions 65

References 66

Appendix A: Interconnect Network and Memory VHDL Code (Version 1) 67

Appendix B: Interconnect Network VHDL Code (Version 2) 76

3

ABSTRACT

This project involves modeling, design, Hardware Description Language (HDL) design

capture, synthesis, HDL simulation testing, and experimental validation of an

interconnect network for a Hybrid Data/Command Driven Computer Architecture

(HDCA) system, which is a single-chip shared memory multiprocessor architecture

system. Various interconnect topologies that may meet the requirements of the HDCA

system are studied and evaluated related to utilization within the HDCA system. It is

determined the Crossbar topology best meets the HDCA system requirements and it is

therefore used as the interconnect network of the HDCA system. The design capture,

synthesis, simulation and implementation is done in VHDL using XILINX Foundation

CAD software. A small reduced scale prototype design is implemented in a PROM based

Spartan XL Field Programmable Gate Array (FPGA) chip which is successfully

experimentally tested to validate the design and functionality of the described crossbar

interconnect network.

4

Chapter 1

Introduction, Background, and Positioning of Research

This project is first, the study of different kinds of interconnect networks that may

meet the requirements of a Hybrid Data/Command Driven Architecture (HDCA)

multiprocessor system [2,5,6] shown in Figure 1.1. The project then involves Vhsic

Hardware Description Language (VHDL) [10] description, synthesis, simulation testing,

and experimental prototype testing of an interconnect network which acts as a Computing

Element (CE) to data memory circuit switch for the HDCA system.

The HDCA system is a multiprocessor shared memory architecture. The shared

memory is organized as a number of individual memory blocks as shown in Figures 1.1,

3.4a, and 4.1 and is explained in detail in later chapters. This kind of memory

organization is required by this architecture. If two or more processors want to

communicate with memory locations within the same memory block, lower priority

processors have to wait until the highest priority processor gets its transaction done. Only

the highest priority processor will receive a request grant and the requests from other

lower priority processors must be queued and these requests are processed only after the

completion of the first highest priority transaction. The interconnect network to be

designed should be able to connect requesting processors on one side of the interconnect

network to the memory blocks on the other side of the interconnect network. The

efficiency of the interconnect network increases as the possible number of parallel

connections between the processors and the memory blocks increases. Interconnection

networks play a central role in determining the overall performance of a multiprocessor

system. If the network cannot provide adequate performance for a particular application,

nodes (or CE processors in this case) will frequently be forced to wait for data to arrive.

In this project, different types of interconnect networks, which may be applicable

to a HDCA system are addressed, and advantages and disadvantages of these

interconnects are discussed. Different types of interconnects, their routing mechanism,

and the complexity factor in designing the interconnects is also described in detail. This

project includes design and VHDL description and synthesis of a interconnect network

5

based on the crossbar topology, the topology which best meets HDCA system

requirements.

Input FIFOs

RAM Data Memory

CE-Data Memory Interconnect

CE0 CE1 CEn-1

Q Q Q

CE-Mapper

Control

Token Router

Control Token

Mapper (CTM) File Large File Memory File

CE-File Interconnect

=> Muiltifunctional Queue Q

… Inputs …

Outputs

Figure 1.1: Single-Chip Reconfigurable HDCA System (Large File Memory May Be Off-Chip)

FIFO

…...

…

… … …

…………………………….

.

.

.

…

…

6

The crossbar topology is a very popular interconnect network in industry today.

Interconnects are applicable to different kinds of systems having their own requirements.

In some systems, such as distributed memory systems, there should be a way that the

processors can communicate with each other. A crossbar topology (single sided topology)

[1] can be designed to meet the requirement of inter-processor communication and is

unique to distributed memory systems, because in distributed memory systems,

processors do not share a common memory. All the processors in the system are directly

connected to their own memory and caches. Any processor cannot directly access another

processor's memory. All communication between the processors is made possible through

the interconnection network. Hence, there is a need for inter-processor communication in

distributed memory architectures. The crossbar topology suitable for these architectures

is the single-sided crossbar network. All the processors are connected to an

interconnection network and communication between any two processors is possible. The

HDCA system does not need an interconnect that supports inter-processor

communication, as it is a shared memory architecture. For the shared memory

architectures, a double-sided crossbar network can be used as the interconnect network.

This design needs some kind of priority logic, which prioritizes conflicting requests for

memory accesses by the processors. This also requires a memory organization which is

shared by all processors. The HDCA system requires the memory to be divided into

memory blocks, each block containing memory locations with different address ranges.

The actual interconnect design is a combination of a crossbar interconnect (double sided

topology) [1], priority logic, and a shared memory organization. Another interconnect

architecture has been implemented as the interconnect for the CE to Data memory circuit

switch in an initial prototype of the HDCA system [2]. The initial HDCA prototype

assumes no processor conflicts in accessing a particular memory block, which can be

handled in the design presented here by the priority logic block. The input queue depth of

individual CE processors of a HDCA system is used by the priority logic block of the

proposed interconnect network in granting requests to the processor having the deepest

queue depth. The presented design is specific to the CE to Data memory circuit switch for

a HDCA system. The detailed crossbar interconnect network design is described in

7

Chapter 4. VHDL design capture, synthesis and implementation procedures are discussed

in Chapter 5. Chapter 6 includes the VHDL simulation testing setup and results. A test

case is described in chapter 6 which was tested during pre-synthesis HDL simulation,

post-synthesis HDL simulation, and the post-implementation HDL simulation process. In

Chapter 7 an experimental prototype of the crossbar interconnect network is developed

and tested to validate the presented interconnect architecture, design, and functionality.

8

Chapter 2

Types of Interconnect Systems

Interconnect networks can be classified as static or dynamic [11]. In the case of a

static interconnection network, all connections are fixed, i.e. the processors are wired

directly, whereas in the latter case there are routing switches in between. The decision

whether to use a static or dynamic interconnection network depends on the type of

problem to be solved by the computer system utilizing the interconnect system.

Generally, static topologies are suitable for problems whose communication patterns can

be predicted a priori reasonably well, whereas dynamic topologies (switching networks),

though more expensive, are suitable for a wider class of problems. Static networks are

mainly used in message passing networks and are mainly used for inter-processor

communications.

Types of Static Networks:

1. Star connected network:

Figure 2.1: Star Connected Network

In a star topology there is one central node computer, to which all other node

computers are connected; each node has one connection, except the center node, which

has N-1 connections. Routing in stars is trivial. If one of the communicating nodes is the

center node, then the path is just the edge connecting them. If not, the message is routed

from the source node to the center node, and from there to the destination node. Star

9

networks are not suitable for large systems, since the center node will become a

bottleneck with an increasing number of processors. A typical Star connected network is

shown in Figure 2.1.

2. Meshes:

Figures 2.2 and 2.3 show a typical 1-Dimensional (D) mesh and 2-D mesh

respectively. The simplest and cheapest way to connect the nodes of a parallel computer

is to use a one-dimensional mesh. Each node has two connections and boundary nodes

have one. If the boundary nodes are connected to each other, we have a ring, and all

nodes have two connections. The one-dimensional mesh can be generalized to a k-

dimensional mesh, where each node (except boundary nodes) has 2k connections. In

meshes, the dimension-order routing technique is used [12]. That is, routing is performed

in one dimension at a time. In a three-dimensional mesh for example, a message's path

from node (a,b,c) to the node (x,y,z) would be moved along the first dimension to node

(x,b,c), then, along the second dimension to node (x,y,c), and finally, in the third

dimension to the destination-node (x,y,z). This type of topology is not suitable to build

large-scale computers, since there is a wide range of latencies (the latency between

neighboring processors is much lower than between not-neighbors), and secondly the

maximum latency grows with the number of processors.

Figure 2.2: 1 - D Mesh

Figure 2.3: 2 - D Mesh

10

3. Hypercubes:

The hypercube topology is one of the most popular and used in many large-scale

systems. A k-dimensional hypercube has 2k nodes, each with k connections. Figure 2.4

shows a 4-D hypercude. Hypercubes scale very well, the maximum latency in a k-

dimensional (or "k-ary") hypercube is log2 N, with N = 2k.

An important property of hypercube interconnects is the relationship between

node-number and which nodes are connected together. The rule is that any two nodes in

the hypercube, whose binary representations differ in exactly one bit, are connected

together. For example in a four-dimensional hypercube, node 0 (0000) is connected to

node 1 (0001), node 2 (0010), node 4 (0100) and node 8 (1000). This numbering scheme

is called the Gray code scheme. A hypercube connected in this fashion is shown in Figure

2.5.

A k-dimensional hypercube is nothing more than a k-dimensional mesh with only

two nodes in each dimension, and thus the routing algorithm is the same as for meshes;

apart from one difference. The path from node A to node B is calculated by simply

calculating the Exclusive-OR, X = A XOR B, from the binary representations for node A

and B. If the ith bit in X is '1' the message is moved to the neighboring node in the ith

dimension. If the ith bit is '0', the message is not moved. This means, that it takes at most

log2 N steps for a message to reach its destination (where N is the number of nodes in the

hypercube).

Figure 2.4: 4-D Hypercube

11

Figure 2.5: Gray Code Scheme in Hypercube

Types of Dynamic Networks:

1. Bus-based networks:

They are the simplest and an efficient solution when the cost and a moderate

number of processors are involved. Their main drawback is a bottleneck to the memory

when the number of processors becomes large and also a single point of failure hangs the

system. To overcome these problems to some extent, several parallel buses can be

incorporated. Figure 2.6 shows a bus-based network incorporated with a single bus.

P0 P1 Pn-1

Figure 2.6: Bus-based Network

0000 0010

0100

0101

0001

0110

0111

0011

1000 1010

1100

1101

1001

1110

1111

1011

12

2. Crossbar switching networks:

Figure 2.7 shows a double-sided crossbar network having ‘n’ processors (Pi) and

‘m’ memory blocks (Mi). All the processors in a crossbar network have dedicated buses

directly connected to all memory blocks. This is a non-blocking network, as a connection

of a processor to a memory block does not block the connection of any other processor to

any other memory block. In spite of high speed, their use is normally limited to those

systems containing 32 or fewer processors, due to non-linear (n x m) complexity and

cost. They are applied mostly in multiprocessor vector computers and in multiprocessor

systems with multilevel interconnections.

M2M1 Mm

P1

P2

Pn

Figure 2.7: Crossbar Network

3. Multistage interconnection networks:

Multistage networks are designed with several small dimension crossbar

networks. Input/Output connection establishment is done in two or more stages. Figures

13

2.8 and 2.9 shown below are Benes multistage and Clos multistage networks. These are

non-blocking networks and are suitable for very large systems as their complexity is

much less than that of Crossbar networks. The main disadvantage of these networks is

latency. The latency increases with the size of the network.

II

BN/d,d

I

Xd, d

II

BN/d,d

II

BN/d,d

I

Xd, d

III

Xd, d

III

Xd, d

III

Xd, d

I

Xd, d

P0

P1

Pd-1

Pd

P2d-1

Pn-d

Pn-1

Pd

Pd+1

Pn-d+1

M0

M1

M0

Md-1

Md

Md+1

M2d-1

Mn-d

Mn-d+1

Mn-1

Figure 2.8: Benes Network (BN,d)

14

I(1)

I(2)

I(C1)

III(1)

III(2)

III(C2)

II(1)

II(2)

II(K)

P0

P1

Pd-1

Pd

P2d-1

Pn-d

Pn-1

PdPd+1

Pn-d+1

M0

M1

Md-1

Md

Md+1

M2d-1

Mn-d

Mn-d+1

Mn-1

Figure 2.9: Clos Network

15

Some of the multistage networks are compared with crossbar networks, primarily

from a stand-point of complexity, in the next chapter. Table 2.10 shows some general

properties of bus, crossbar and multistage interconnection networks.

Property Bus Crossbar Multistage

Speed Low High High

Cost Low High Moderate

Reliability Low High High

Complexity Low High Moderate

Table 2.1: Properties of Various Interconnection Network Topologies

16

Chapter 3

Multistage Interconnection Network Complexity

For the HDCA system, the desired interconnect should be able to establish non-

blocking, high speed connections from the requesting Computing Elements (CEs) to the

memory blocks. The interconnect should be able to sense if there are any conflicts such

as two or more processors requesting connection to the same memory block and give

only a highest priority processor the ability to connect. The multistage Benes network and

Clos network, their complexity comparison with a crossbar network, and advantages and

disadvantages of these candidate networks is discussed in this chapter.

Crossbar Topology:

A crossbar network is a highly non-blocking, very reliable, very high-speed

network. Figure 2.7 is a typical single stage crossbar network with N inputs/processors

and M outputs/memory blocks. It is denoted by XN,M . The complexity (Crosspoint

count) of a Crossbar network is given by N x M. Complexity increases with an increase

in number of inputs or number of outputs. This is the main disadvantage of the Crossbar

network. Hence there is less scope for scalability of crossbar networks. The crossbar

topology implemented for shared memory architectures is referred to as a double-sided

crossbar network.

Benes Network:

A Benes network is a multistage, non-blocking network.. For any value of N, d

should be chosen so that LogdN is an integer. The number of stages for a N x N Benes

network is given by (2logdN-1) and it has (N/d) crossbar switches in each stage. Hence

BN,d is implemented with [[(N/d).(2logdN-1)] crossbar switches. The general

architecture of a Benes network (BN,d) is shown in Figure 2.8.

In the figure,

N: Number of Inputs or Outputs,

17

d: Dimension of each crossbar switch ( Xd,d ) ,

I: First stage switch = Xd,d ,

II: Middle stage switch = BN/d,d

III: Last stage switch = Xd,d ,

The complexity (crosspoint count) of the network is given by [(N/d).(2logdN-1) .

d2 ]. Network latency is a factor of (2logdN-1), because of (2logdN-1) stages between

input stage and output stage. There are different possible routings from any input to

output. It is a limited scalability architecture. For a BN,d implementation, N has to be a

power of d. For all other configurations, a higher order Benes network can be used, but at

the cost of some hardware wastage. The main disadvantage of this network is the network

latency and limited scalability. For very large networks, a Benes network implementation

is very cost effective.

Clos Network:

Figure 2.9 shows a typical N x M Clos network represented by CN,M. The blocks

I, III are always crossbar switches and II is a crossbar switch for a 3 stage Clos network.

In implementations of higher order Clos networks, II is a lower order Clos network.

For example, for a 5 stage Clos implementation, II is a three-stage Clos network.

N: Number of processors

M: Number of Memory blocks

K: Number of Second stage switches

C1: Number of First stage switches

C2: Number of Third stage switches

For a three stage Clos network, I = X N/C1,K

, II = X C1,C2 , III = X K,M/C2 and

the condition for non-blocking Clos implementation is K = N/C1 + M/C2 - 1.

A three stage Clos implementation for N = 16, M = 32, C1 = 4, C2 = 8 has K = 16/4 +

32/8 - 1 = 7. Each 1st stage switch becomes a 4 x 7 crossbar switch and the 2nd stage

18

switch becomes a 4 x 8 Crossbar switch and each third stage switch becomes a Crossbar

switch of size 7 x 4. (I = X4,7 II = X4,8 III = X7,4 ). The complexity of a Clos network

is given by C clos = [K(N +M) + K(C1.C2 ) ]. Using the non-blocking condition, K =

N/C1 + M/C2 - 1. For N = M & C1 = C2, K = 2N/C1 - 1 and hence

C clos = (2N/C1 - 1) {2N + C12 }.

For an optimum crosspoint count for non-blocking Clos networks, N/C1 = (N/2)1/2

= > C12 = 2N

=> C clos = ((2N)1/2 - 1). 4N. (Approximately)

The main advantage of a Clos network implementation is its scalability. A Clos

network can be implemented for any non-prime value of N. The disadvantages of this

implementation are network latency and implementation for small systems. The network

latency is a factor of the number of intermediate stages between the input stage and the

output stage.

From the complexity comparison shown in Table 3.1 and charts shown in Figures

3.2 and 3.3, it can be analyzed that the crossbar topology for small systems and the Benes

network for large systems, match the requirements of the interconnect network for a

HDCA system. The number of processors on the input side and number of memory

blocks on the output side are assumed to be ‘N’ for simplicity in comparison of the

topologies. This assumption holds for any rectangular size implementations of these

topologies, which is not possible in the Benes network. The complexity comparison table

for the three topologies studied so far is given in Table 3.1. In the table “I” is the

complexity, and “II” is the corresponding network implementation for the values of N for

the respective topologies. Chart 1, shown in Figure 3.2, is the graph of complexity of the

three topologies versus N, the number of processors or memory blocks, for lower values

of N (N <= 16). Chart 2, shown in Figure 3.3, is the graph of complexity of the three

topologies versus N, the number of processors or memory blocks, for higher values of N

(N >= 16).

19

Table 3.1: Complexity Comparison Table

N Crossbar Benes Clos

I II I II I II

2 4 X(2,2) 4 B(2,2) 4 C(2,2)

3 9 X(3,3) 9 B(3,3) 9 C(3,3)

4 16 X(4,4) 24 B(4,2) 36 C(4,2)

5 25 X(5,5) 25 B(5,5) 25 C(5,5)

6 36 X(6,6) 80 B(8,2) 63 C(6,3)

7 49 X(7,7) 80 B(8,2) 96 C(8,4)

8 64 X(8,8) 80 B(8,2) 96 C(8,4)

9 81 X(9,9) 81 B(9,3) 135 C(9,3)

10 100 X(10,10) 224 B(16,2) 135 C(10,5)

11 121 X(11,11) 224 B(16,2) 180 C(12,6)

12 144 X(12,12) 224 B(16,2) 180 C(12,6)

13 169 X(13,13) 224 B(16,2) 189 C(14,7)

14 196 X(14,14) 224 B(16,2) 189 C(14,7)

15 225 X(15,15) 224 B(16,2) 275 C(15,5)

16 256 X(16,16) 224 B(16,2) 278 C(16,8)

32 1024 X(32,32) 576 B(32,2) 896 C(32,8)

64 4096 X(64,64) 1408 B(64,2) 2668 C(64,16)

81 6561 X(81,81) 1701 B(81,3) 4131 C(81,9)

128 16384 X(128,128) 3328 B(128,2) 7680 C(128,16)

20

2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

255075

100125150175200225250275300

Chart 1

CrossbarBenesClos

N

Com

plex

ity

Figure 3.2: Complexity Chart for N <= 16

16 32 64 81 1280

2500

5000

7500

10000

12500

15000

17500

Chart 2

CrossbarBenesClos

N

Com

plex

ity

Figure 3.3: Complexity Chart for N >= 16

21

The following equations are used to calculate the complexity of all three topologies for

the different configurations given in Table 3.1.

C clos = (2N/C1 - 1) {2N + C12 },

For N/C1 = (N/2)1/2, taken to the closest integer value.

C benes = [(N/d).(2logdN-1). d2 ],

C crossbar = N2

From Table 3.1 and the Charts in Figures 3.2 and 3.3, the crossbar topology has

the lowest complexity for the values of N < = 16. Hence the crossbar network is the best

interconnect implementation for systems that have not more than 16 processors/memory

blocks since the hardware required for the implementation is less than all other possible

implementations, it is faster than any other network, and it is a non-blocking network as

every input has connection capability to every output in the system. And for the systems

having more than 16 x 16 configurations, and less than 64 x 64 configurations, the

designer has to tradeoff between speed and complexity. Because, for the multistage

networks such as the Benes network, complexity is less than that of the crossbar network

but at the cost of speed, as speed of multistage networks is much lower than that of the

crossbar network. For systems having more than 64 x 64 configurations, the Benes

network proves to be the best implementation.

The HDCA system normally requires an interconnect with a complexity less than

256. The crossbar implemented interconnect best suits the system as it has minimum

complexity for the sizes of interconnect needed by the HDCA system, it is highly non-

blocking as no processor has to share any bus with any other processor and it is a very

high speed implementation as it has only one intermediate stage between processors and

memory blocks.

22

Multiprocessor Shared Memory Organization:

INTERCONNECT

MB[0]

MB[1]

MB[M-1]

P[0]

P[1]

P[N-1]

PI[0]

PI[1]

PI[N-1]

IM[0]

IM[1]

IM[M-1]

Figure 3.4a: Multi-processor, Interconnect and Shared Memory Organization

Figure 3.4a shows the organization of a multiprocessor shared memory

organization. Figure 3.4b shows the organization of the shared memory used in the

HDCA system. By making 2c = M, the shared memory architecture in Figure 3.4b can be

used as the shared memory for the HDCA system. Figure 3.4c shows the organization of

each memory block. In each memory block there are 2b addressable locations each of ‘a’

bits width.

23

MB [0]

MB [1]

MB [2]

MB [2c -1]

Figure 3.4b: Shared Memory Organization

24

Figure 3.4c: Organization of each Memory Block

Related Research in Crossbar Networks:

The crossbar switches (networks) of today see a wide use in a variety of

applications including network switching, parallel computing, and various

telecommunications applications. By using the Field Programmable Gate Arrays

(FPGAs) or the Complex Programmable Logic Devices (CPLDs) to implement the

crossbar switches, design engineers have the flexibility to customize the switch to suit

their specific design goals, as well as obtain switch configurations not available with off-

the-shelf parts. In addition, the use of in-system programmable devices allows the switch

0

1

2b - 1

01a - 1

25

to be reconfigured if design modifications become necessary. There are two types of

implementations possible based on the crossbar topology. One is the single-sided

crossbar implementation as shown in Figure 3.5 and the other is the double-sided

crossbar implementation as shown in Figure 2.7.

Figure 3.5: Single-sided Crossbar network

The single-sided crossbar network is usually implemented and utilized in

distributed memory systems, where all the nodes (or processors) connected to the

interconnect need to communicate to each other. Whereas in the double-sided crossbar

networks which are usually utilized as the interconnect between processors and memory

blocks in a multiprocessor shared memory architecture as shown in Figures 3.4a, 3.4b,

and 3.4c, processors need to communicate with memory blocks but not processors with

processors and memory blocks with memory blocks.

A crossbar network is implemented with serial buses or parallel buses. In a serial-

bus, crossbar network implementation, addresses and data are sent by the processor

through a single-bit bus in a serial fashion, which are fetched by the memory blocks at

the rate of 1-bit on every active edge of the system clock. Some of the conventional

crossbar switches use this protocol for the crossbar interconnect network. The other

implementation is the parallel-bus crossbar network implementation. This

implementation is much faster than the serial-bus implementation. All the memory blocks

0

1

2

N- 2

N- 1

26

fetch addresses on one clock and data on the following clocks. This implementation

consumes more hardware than the serial-bus crossbar network implementation but is a

much faster network implementation and is hence used in some high performance

multiprocessor systems.

The main issue in implementing a crossbar network is arbitration of processor

requests for memory accesses. The processor request arbitration comes into picture when

two or more processors request for memory access within the same memory block. There

are different protocols that may be followed in designing the interconnect. One of them is

a round robin protocol. In the case of conflict among processor requests for memory

accesses within the same memory block, requests are granted to the processors in a round

robin fashion. A fixed priority protocol assigns fixed priorities to the processors. In case

of conflict, the processor having the highest priority ends up having its request granted all

the time. For a variable priority protocol, as will be used in an HDCA system, priorities

assigned to processors dynamically vary over time based on some parameter (metric) of

the system. In the HDCA system of Figure 1.1, all processors are dynamically assigned a

priority depending upon their input Queue (Q) depth at any time. In the case of conflict,

the processor having the highest (deepest) queue depth at that point of time gets the

request grant. A design engineer has to choose among the above mentioned protocols and

various kinds of implementations depending on the system requirements in designing a

crossbar network. The HDCA system will need some kind of arbitration in case of

processor conflicts.

The interconnect design presented in this project is closely related to the design of

a crossbar interconnect for the distributed memory systems presented in [3]. Both the

designs address the possibility of conflicts between the processor requests for the

memory accesses within the same memory block. Both the designs use parallel address

and data buses between every processor and every memory block. The design presented

in this project is different from the interconnect network of [3] in two ways. Firstly, the

crossbar interconnect network presented in this project is suitable for the shared memory

architecture. The crossbar topology used in this project is a double-sided topology

whereas a single-sided topology is used in the design of the crossbar interconnect

network of [3]. The priority arbitration scheme proposed in the interconnect network of

27

[3] uses a fixed priority scheme based on the physical distance between the processors

and gives the closest processor the highest priority and farthest processor the lowest

priority. The priority arbitration scheme presented in this design uses the input queue

depth of the processors in determining the priorities. The HDCA system requires a

double-sided, parallel bus crossbar network, with variable priority depending upon input

queue depth of the processors. In this project, a double-sided, parallel-bus crossbar

network using variable priority protocol is designed, implemented and tested as the

interconnect for the HDCA system. The detailed design of the implementation is

described in the next chapter.

28

Chapter 4

Design of the Crossbar Interconnect Network

This chapter presents the detailed design of a crossbar interconnect which meets

the requirements of the HDCA system of Figure1.1. The organization of processors,

interconnect and memory blocks is shown in Figure 3.4a.

Shared Memory Organization:

The shared memory organization used in this project is shown in Figures 3.4b and 3.4c.

From Figure 3.4b, there are (2c = M) memory blocks and the organization of each

memory block is shown in Figure 3.4c. In each memory block, there are 2b addressable

locations, of ‘a’ bits width. Hence the main memory which includes all the memory

blocks has 2b+c addressable locations of ‘a’ bits width. Hence the width of the address bus

of each processor is (b + c) bits wide and the data bus of each processor is ‘a’ bits wide.

Signal Description:

The schematic equivalent to the behavioral VHDL description of the interconnect

is shown in Figure 4.1. In general, a processor ‘i’ of Figure 3.4a has CTRL[i], RW[i],

ADDR[i], QD[i] as inputs to the interconnect. CTRL[i] of processor ‘i’ goes high when it

wants a connection to a memory block. RW[i] goes high when it wants to read and goes

low when it wants to write. ADDR[i] is the (b + c) bit address of the memory location

and ADDR_BL[i] is the ‘c’ bit address of the memory block with which the processor

wants to communicate. The memory block is indicated by the ‘c’ MSBs of ADDR[i].

QD[i] is the queue depth of the processor. FLAG[i], an input to the processor goes high

granting the processor’s request. FLAG[i] is decided by the priority logic of the

interconnect network. PI[i], DM[i][j] and IM[j] of Figure 4.1 are different types of buses

used in the interconnection network. The bus structure of these buses is shown in Figures

4.2 and 4.3.

At any time, processors represented by ‘i’, can request access to memory blocks

represented by MB[j]. That means CTRL[i] of those processors go high and have

memory block address, ie ADDR_BL[i] = j. Hence in Figure 4.2, the bus PI[i] of the

29

P [ 0 ] M B [ 0 ]P R L [ 0 ]

D E C [ 0 ]

P I [ 0 ]

M B _ A D D R [ 0 ]

I M [ 0 ]

P [ N - 1 ] M B [ M - 1 ]P R L [ M - 1 ]

D E C [ N - 1 ]

P I [ N - 1 ]

M B _ A D D R [ N - 1 ]

I M [ M - 1 ]

P [ 1 ] M B [ 1 ]P R L [ 1 ]

D E C [ 1 ]

P I [ 1 ]

M B _ A D D R [ 1 ]

I M [ 1 ]

D M [ 0 ] [ 0 ]

D M [ N - 1 ] [ M - 1 ]

Figure 4.1: Block Diagram of the Crossbar Interconnect Network

30

C T R L [ i ]

R W [ i ]

F L A G [ i ]

A D D R [ i ]

D A T A [ i ]

Q D E P [ i ]

B + C

A

Figure 4.2: PI[i] and DM[i][j] Bus Structures. (The PI[i] Bus and DM[i][j] Bus Have the

Same Set of Signal Lines as Shown in This Figure)

C T R L [ i ]

R W [ i ]

F L A G [ i ]

A D D R [ i ]

D A T A [ i ]

B + C

A

Figure 4.3: IM[j] Bus Structure

processors gets connected to the bus DM[i][j], through the decode logic DEC[i] of Figure

4.1 and shown again in Figure 4.4. As shown in Figure 4.4, ADDR_BL[i] of the

requesting processor is decoded by decoder DEC[i], and connects PI[i] to the DM[i][j]

output bus of DEC[i]. Every memory block has a priority logic block, PRL[j], as shown

in Figure 4.5. The function of this logic block is to grant a request to the processor having

the deepest queue depth among the processors requesting memory access to the same

memory block. As shown in Figure 4.5, once processor ‘i’ gets a grant from the priority

logic PRL[j] via the FLAG[i] signal of the DM[i][j] and PI[i] busses shown in Figures 4.1

and 4.2, the DM[i][j] bus is connected to the IM[j] bus by MUX[j] of Figure 4.5. Thus a

31

connection is established via PI[i], DM[i][j] and IM[j] between processor ‘i’ and memory

block ‘j’. This connection remains active as long as the processor holds deepest queue

depth or CTRL[i] of the processor is active. A priority logic block gives a grant only to

the highest

Figure 4.4: Decode Logic (DEC[i])

Figure 4.5: Priority Logic (PRL[j])

DEC[i]

PI[i]

DM[i][0]

DM[i][1]

DM[i][M-1]

ADDR_BL[i]

MUX[j]

PRL_LOGIC[j]

PROC SEL[j]

IM[j]

DM[0][j]

DM[1][j]

DM[N-1[j]

32

Figure 4.6: Priority Control Flow Chart for PR_LOGIC[j] in Figure 4.5.

ctrl[0] = '1' &mbaddr[0] = j

max=qd[0] i = 0max=0

ctrl[1] = '1'& mbaddr[1] = j& qd[1] >= max

max = max i = i

flag[i] = '0' max = qd[1] i = 1

ctrl[2] ='1'& mbaddr[2] =j& qd[2] >= max

flag[i] = '0' max = qd[2] i = 2

max = max i = i

ctrl[3] = '1'& mbaddr[3] = j& qd[3] >= max

flag[i] = '0' max = qd[3] i = 3

max = max i =i

flag[i] = '1'

ctrl[i] = '1'

flag[i] = '0'

T

F

T

T

T

T

F

F

F

F

33

priority processor. The queue depth of the processors is used in determining the priority.

In cases of processors having the same queue depth the processor having highest

processor identification number gets the highest priority. A processor can access a

particular memory block as long as it has highest priority to that memory block. If some

other processor gets highest priority for that particular memory block, the processor,

which is currently accessing the memory block gets its connection disconnected. It will

have to wait until it gets the highest priority for accessing that block again.

The flow chart showing the algorithmic operation of the jth priority logic block as

shown in Figure 4.5 is shown in Figure 4.6 above. To fully follow the flow chart of

Figure 4.6, we must reconcile signal names used in Figure 4.2 and Figure 4.6. The ‘c’

MSBs of ADDR[i] of Figure 4.2 correspond to the ‘mbaddr[x]’ of Figure 4.6 where x has

an integer value ranging from ‘0’ to (2C – 1). QDEP[i] of Figure 4.2 is the same as ‘qd[i]’

of Figure 4.6. The number of processors in the figure are assumed to be ‘4’ but the

algorithm holds true for any number of processors. The PR_LOGIC[j] block of the

priority logic of Figure 4.5 compares the current maximum queue depth with the queue

depth of every processor starting from the 0th processor. This comparison is done only for

those processors whose CTRL is in the logic ‘1’ state and are requesting memory access

to that memory block where the priority logic operation is performed. The integer value

‘i’ shown in Figure 4.6, is the identification number of the processor having deepest

queue depth at that time. After completion of processor prioritizing, the processor ‘i’

which has the deepest queue depth gets its request granted (FLAG[i] goes high) to access

that memory block. This logic operation is structurally equivalent to the schematic shown

in Figure 4.5, in which PROC_SEL[j] ( = ‘i’ in the flowchart shown in Figure 4.6) acts as

the select input to the multiplexer MUX[j]. This condition is achieved in VHDL code

(Appendix A and Appendix B) by giving memory access to any processor (it’s FLAG[i]

is set equal to logic ‘1’) only if its CTRL[i] is in the logic ‘1’ state and it has the deepest

queue depth among the processors requesting access within the same memory block.

The Interconnect gives all the processors the flexibility of simultaneous reads or

writes for those processors that are granted requests by the priority logic. In the best case

all processors will have their requests granted. This is the case when CTRL of all

processors is ‘1’ and no two processors have the same ADDR_BL. In this case the binary

34

value of FLAG is ‘1111’, after the completion of all iterations of the priority logic. The

VHDL description of the crossbar interconnect network implementation has a single

function which describes the processor prioritization done in all the memory blocks (ie

the corresponding priority logic blocks). The described function works for any number of

processors or memory blocks.

Figures 4.1 through 4.6 show the block level design of the interconnect network.

In the best case when all processors access different memory blocks, all the processors

receive request grants (FLAGs) in the logic ‘1’ state and all get their connections to

different memory blocks. In this case the Interconnect is used to the fullest of its capacity,

having different processors communicating with different memory blocks

simultaneously.

35

Chapter 5

VHDL Design Capture, Simulation, Synthesis, and Implementation

Flow

VHDL, the Very High Speed Integrated Circuit Hardware Description Language,

became a key tool for design capture and development of personal computers, cellular

telephones, and high-speed data communications devices during the 1990s. VHDL is a

product of the Very High Speed Integrated Circuits (VHSIC) program funded by the

department of defense in the 1970s and 80s. VHDL provides both low-level and high-

level language constructs that enable designers to describe small and large circuits and

systems. It provides portability of code between simulation and synthesis tools, as well as

device-independent design. It also facilitates converting a design from a programmable

logic to an Application Specific Integrated Circuit (ASIC) implementation. VHDL is an

industry standard for the description, simulation, modeling and synthesis of digital

circuits and systems. The main reason behind the growth in the use of VHDL can be

attributed to synthesis, the reduction of a design description to a lower-level circuit

representation. A design can be created and captured using VHDL without having to

choose a device for implementation. Hence it provides a means for a device-independent

design. Device-independent design and portability allows benchmarking a design using

different device architectures and different synthesis tools.

Electronic Design Automation (EDA) Design Tool Flow:

The EDA design tool flow is as follows:

• Design description (capture) in VHDL

• Pre-synthesis simulation for design verification/validation

• Synthesis

• Post-synthesis simulation for design verification/validation

• Implementation (Map, Place and Route)

• Post-implementation simulation for design verification/validation

• Design optimization

36

• Final implementation to FPGA/CPLD/ or ASIC technology.

The inputs to the synthesis EDA software tool are the VHDL code, synthesis directives

and the device technology selection. Synthesis directives include different kinds of

external as well as internal directives that influence the device implementation process.

The required device selection is done during this process.

Field Programmable Gate Array (FPGA):

The FPGA architecture is an array of logic cells (blocks) that communicate with

one another and with I/O via wires within routing channels. Like a semi-custom gate

array, which consists of an array of transistors, an FPGA consists of an array of logic

cells [8,10]. A FPGA chip consists of an array of logic blocks and routing channels as

shown in Figure 5.1. Each circuit or system must be mapped into the smallest square

FPGA that can accommodate it.

Figure 5.1: FPGA Architecture

Each logic block contains or consists of a number of RAM based Look Up Tables

(LUTs) used for logic function implementation and D-type flip-flops in addition to

several multiplexers used for signal routing or logic function implementation. FPGA

37

routing can be segmented and/or un-segmented. Un-segmented routing is when each

wiring segment spans only one logic block before it terminates in a switch box. By

turning on some of the programmable switches within a switch box, longer paths can be

constructed. The design for this project is implemented (prototyped) to a Spartan XL

FPGA chip which is a XILINX product [8]. It is a PROM based FPGA.

The one that is used for this project is an XCS10PC84 from the XL family. The

FPGA is implemented with a regular, flexible, programmable architecture of

Configurable Logic Blocks (CLBs), routing channels and surrounded by I/O devices. The

FPGA is provided with a clock rate of 50 Mhz. There are two more configurable clocks

on the chip. The Spartan XCS10PC84XL is an 84- pin device with 466 logic cells and is

approximately equivalent to 10,000 gates. Typically, the gate range for the XL chips will

be from 3000-12,000. The XCS10 has a 14x14 CLB matrix with 196 total CLBs. There

are 616 flip-flops in the chip and the maximum available I/Os on the chip are 112.

Digilab XL Prototype Board:

Digilab XL prototype boards [8] feature a Xilinx Spartan FPGA (either 3.3V or

5V) and all the I/O devices needed to implement a wide variety of circuits. The FPGA on

the board can be programmed directly from a PC using an included JTAG cable, or from

an on-board PROM. A view of the board is shown in the figure below. The board has one

internally generated clock and two configurable clocks.

Figure 5.2: Digilab Spartan XL Prototyping Board

38

The Digilab prototype board contains 8 LEDs and a seven-segment display which

can be used for monitoring prototype input/output signals of interest when testing the

prototype system programmed into the Spartan XL chip on the board.

VHDL Design Capture:

Behavioral VHDL description was used in design capture and coding of the

crossbar interconnect network design and logic. The structural equivalents of two

behavioral VHDL descriptions is shown in Figures 6.1 and 6.2 of the next chapter. The

scenario of a processor trying to access a particular memory block and whether its request

is granted (FLAG = ‘1’) or rejected (FLAG = ‘0’) can be generalized to all processors

and all memory blocks. Hence, the main VHDL code has a function ‘flg’, an entity

‘main’ and a process ‘P1’ and it is possible to increase the number of processors, memory

blocks (and address bus) , memory locations in each memory block (and address bus),

width of data bus, and the width of the processor queue depth bus.

Appendix A contains the VHDL code, structured as shown in Figure 6.1, which

describes the crossbar interconnect network assuming the number of processors, the

number of memory blocks, and the number of addressable locations in each memory

block to be ‘4’. The input queue depth of each processor is 4-bits wide. This VHDL code

is described considering the crossbar interconnect network and the shared memory as a

single functional unit as shown in Figure 6.1. This code is tested for correct design

capture and interconnect network functionality via the pre-synthesis, post-synthesis and

post-implementation VHDL simulation stages and is downloaded onto the XILINX based

Spartan XL FPGA [7,8] for prototype testing and evaluation. Appendix B contains the

VHDL code describing only the crossbar interconnection network as a single functional

unit as depicted in Figure 6.2. This code has more I/O pins than the previous code. This

code was tested via pre-synthesis and post-synthesis VHDL simulation. With the

exceptions of the I/O pins and shared memory, the functionality of both VHDL

interconnect network descriptions is the same and is identical to the description of the

crossbar interconnect network organization and architecture design described in Chapter

4.

39

Chapter 6

Design Validation via Post-Implementation Simulation Testing

There are two VHDL code descriptions of the interconnect network, one in

Appendix A and the other in Appendix B. Both have the same interconnect network

functionality but different modular structures. The VHDL code in Appendix A is

described considering both the crossbar interconnect network and the shared memory as a

single block as shown in Figure 6.1. The VHDL code described in Appendix A has only

processors to interface to the interconnect.

Figure 6.1: Block Diagram of VHDL Code Described in Appendix A.

MB0 MODULE main MB1 MB2 MB3

data_in

addr bus

qdep

ctrl

rw

clk

rst

flag

data_out

16

16

16

4

4

4

16

40

Figure 6.2: Block Diagram of VHDL Code Described in Appendix B.

main_ic

addr_prc

16

data_in_prc

16

qdep

16

4 ctrl

4 rw

clk

rst

data_out_prc

16

flag

4

8

addr_mem

16

data_in_mem 16

data_out_mem

4

rw_mem

41

The VHDL code in Appendix B describes the crossbar interconnect network as a

single block as shown in Figure 6.2. The VHDL code in Appendix B has two interfaces,

the processors to interconnect network interface and interconnect to memory blocks

interface. In both cases (Appendix A and Appendix B) the VHDL code describes a

crossbar interconnect network interfaced to four processors and four memory blocks and

each memory block has 4 addressable locations. The data bus of each processor is taken

to be 4-bits wide. Entity 'main' corresponds to the interconnect module described in

Appendix A and entity 'main_ic' corresponds to the interconnect described in Appendix

B. The VHDL descriptions of the interconnect network in Appendices A and B are

written in a generic parameterized manner such that additional processors and memory

modules may be interfaced to the interconnect and also, the size of the memory modules

may be increased. The size of the prototyped interconnect was kept small so that it would

fit into the Xilinx Spartan XL chip on the prototype board shown in Figure 5.2. No

functionality of the interconnect network was compromised by keeping the prototype to a

small size.

The VHDL description in Appendix A has a 16-bit ‘data_in’ input port to the

interconnect in which 4-bit data buses of all the 4 processors are included as shown in

Figure 6.3. The same is the case of ‘addr_bus’, ‘qdep’, ‘ctrl’, ‘rw’, ‘data_out’, ‘flag’.

Various scenarios such as:

1. all the processors writing to different memory locations in different memory ,

2. all the processors reading from different memory locations in different memory,

3. two or more processors requesting access to different memory locations within the

same memory block, and only the highest priority processor getting the grant to access

the memory block,

4. two or more processors requesting access to the same memory location, and only the

highest priority processor getting the grant to access the memory block,

5. two or more processors requesting access to the same memory block, and some of the

processors having the same queue depth, with only the highest priority processor getting

the grant to access the memory block,

42

Figure 6.3: Input Data Bus Format

6. one or more processors in idle state,

are tested in the module 'main' during the pre-synthesis, post-synthesis and post-

implementation simulations. This module is also downloaded onto a Xilinx based Spartan

XL FPGA chip and is tested under the various scenarios described above. Figures 6.4,

6.5, and 6.6 show the post-implementation simulation tracers, which show the behaviour

of the interconnect network and shared memory under different scenarios, described in

VHDL code in Appendix A.

The same coding style is followed in the VHDL code for ‘main_ic’ described in

Appendix B. The above mentioned scenarios of processor requests are tested on the

module 'main_ic' also during the pre-synthesis, post-synthesis simulations. Figures 6.7

and 6.8 show the post-synthesis simulation tracers, which shows the behaviour of the

interconnect network under different scenarios.

D A T A 0

D A T A 1

D A T A 2

D A T A 3

4

4

4

4

16

DATA_IN[15:0]

0

15

43

The simulation tracers in Figure 6.4 – 6.6 show the behaviour of the interconnect

module output ‘data_out’ and shared memory in different scenarios, which are explained

below. A testcase ‘top’ is developed to generate input stimulus to the module ‘main’ and

to display the control signals, address, data of all processors on LEDs of the Spartan XL

FPGA chip. ‘scnr’, ‘pid’, ‘addr’ and ‘data’ signals observed on the simulation tracer are

used in developing the testcase ‘top’, which is described in detail in Chapter 7. In this

chapter, the input stimulus and output (data_out and data in memory locations) observed

on the simulation tracers is discussed. (All the data mentioned in different scenarios and

that are shown on the simulation tracers are represented in hexadecimal system)

Scenario 0:

Input stimulus:

data_in <= x"4321" ;

addr_bus <= x"FB73" ;

qdep <= x"1234" ;

ctrl <= x"F" ;

rw <= x"F" ;

In this case, processor '0' is requesting memory access within memory block '0'

(location '3'), processor '1' within memory block '1' (location '3') , processor '2' within

memory block '2' (location '3') and processor '3' within memory block '3' (location '3').

Hence there is no conflict between the processors for memory accesses. Processors '0', '1',

'2' and '3' (from 16-bit 'data_in' bus ) are writing '1', '2', '3' and '4' to the corresponding

memory locations. As all the processors get the memory access and hence the data '1', '2',

'3' and '4' is written to the memory location '3' in each of the four memory blocks. This

data can be observed in the corresponding memory locations, on the simulation tracer 1

in Figure 6.4, for ‘scnr’ = ‘0’.

Scenario 1:

Input stimulus:

data_in <= x"26FE" ;

44

addr_bus <= x"37BF" ;

qdep <= x"1234" ;

ctrl <= x"F" ;

rw <= x"0" ;





'2' and '3' are reading '4', '3', '2' and '1' from the corresponding memory locations. As all

the processors get the memory access and hence the data '4', '3', '2' and '1' is read from to

the memory locations 'F', 'B', '7' and '3'. This data can be observed on the ‘data_out’ bus,

on the simulation tracer 1 in Figure 6.4, for ‘scnr’ = ‘1’.

Scenarios '0' and '1' test the case of data exchange between processors. In the

scenario '0', processor '0' writes '1' to memory location '3' in memory block '0' and

processor '3' writes '4' to memory location '3' in memory block '3'. In scenario '1'

processor '0' reads '4' (Data written by processor '3' in scenario '0') from memory location

'3' in memory block '3' and processor '3' reads '1' (Data written by processor '0' in

scenario '0') from memory location '3' in memory block '0'. Similarly data exchange

between processors '1' and '2' is also tested.

Scenario 2:

Input stimulus:

data_in <= x"AAAA" ;

addr_bus <= x"CD56" ;

qdep <= x"EFFF" ;

ctrl <= x"F" ;

rw <= x"5" ;

45

In this case, processor ‘0’ (write) and processor ‘1’ (read) are requesting for

memory access within the same memory block '1', but to different memory locations '2'

and '1'. Processor ‘1’ gets the access as its processor ID is greater than that of processor

‘0’ and queue depths of the processors is same. So processor ‘1’ reads ‘0’ from the

memory location '1' in the memory block '1'. This data is observed on ‘data_out’ bus on

the simulation tracer ‘1’ shown in Figure 6.4, for ‘scnr’ = ‘2’. Similarly, processor ‘2’

(for memory write) and processor ‘3’ (for memory read) are requesting for memory

access within same memory block '3', but to different memory locations '0' and '1'.

Processor ‘2’ gets the access as its queue depth is greater than that of processor ‘3’. So

processor ‘2’ writes 'A' to the memory location '1' in the memory block '3'. This data is

observed in the memory location ‘D’ on the simulation tracer ‘1’ shown in Figure 6.4, for

‘scnr’ = ‘2’.

Scenario 3:

Input stimulus:

data_in <= x"5555" ;

addr_bus <= x"DC65" ;

qdep <= x"4434" ;

ctrl <= x"F" ;

rw <= x"5" ;


memory access within same memory block '1', but to different memory locations '1' and

'2'. Processor ‘0’ gets the access as its queue depth is greater than that of processor ‘1’. So

processor ‘0’ writes ‘5’ to the memory location '1' in the memory block '1'. This data is

observed in the memory location ‘5’, on the simulation tracer ‘2’ shown in Figure 6.5, for

‘scnr’ = ‘3’. Similarly, processor ‘2’ (for memory write) and processor ‘3’ (for memory

read) are requesting for memory access within same memory block '3', but to different

memory locations '0' and '1'. Processor ‘3’ gets the access as its processor ID is greater

than that of processor ‘2’ and queue depths of the processors is same. So processor ‘3’

reads 'A' (which was written by processor ‘1’ in the previous scenario) from the memory

46

location '1' in the memory block '3'. This is observed on ‘data_out’ bus , on the simulation

tracer ‘2’ shown in Figure 6.5, for ‘scnr’ = ‘3’.

Scenario 4:

Input stimulus:

data_in <= x"9999" ;

addr_bus <= x"3399" ;

qdep <= x"4EE4" ;

ctrl <= x"F" ;

rw <= x"C" ;

In this case, processor ‘0’ and processor ‘1’ are requesting for memory access

(read) with same memory location ‘1’ within same memory block ‘2’. Processor ‘1’ gets

the priority as its queue depth is greater than that of processor ‘0’. Hence processor ‘1’

reads ‘0’ (which is reset value) from memory location ‘9’ and is observed on ‘data_out’

bus on the simulation tracer ‘2’, shown in Figure 6.5, for ‘scnr’ = ‘4’. Processor ‘2’ and

processor ‘3’ are requesting for memory access (write) with same memory location

within same memory block. Processor ‘2’ gets the priority as its queue depth is greater

than that of processor ‘3’. Hence processor ‘2’ writes ‘9’ to the memory location ‘3’ in

memory block ‘0’. This data is observed in the memory location ‘3’, on the simulation


Scenario 5:

Input stimulus:

data_in <= x"EEEE" ;

addr_bus <= x"7654" ;

qdep <= x"5556" ;

ctrl <= x"F" ;

rw <= x"1" ;

47

In this case, processor ‘0’ (write), processor ‘1’ (read), processor ‘2’ (read) and

processor ‘3’ (read) are requesting for memory access within same memory block ‘1’, but

to different memory locations. Processor ‘0’ gets the priority as it has greatest queue

depth of all the processors. Hence processor ‘0’ writes 'E' to the memory location '0' in

the memory block '1'. This data is observed in the memory location ‘4’, on the simulation


Scenario 6:

Input stimulus:

data_in <= x"CCCC" ;

addr_bus <= x"6666" ;

qdep <= x"5555" ;

ctrl <= x"E" ;

rw <= x"0" ;

In this case, processor ‘1’ (read), processor ‘2’ (read) and processor ‘3’ (read) are

requesting for memory access with the same memory location ‘2’ in the memory block

‘1’. Processor ‘3’ gets the priority as its has greatest processor ID of all the processors.

Hence processor ‘3’ reads '0' (reset value) from the memory location '2' in the memory

block '1', and is observed on ‘data_out’ bus on the simulation tracer ‘3’, shown in Figure

6.6, for ‘scnr’ = ‘6’.

Scenario 7:

Input stimulus:

data_in <= x"FFFF" ;

addr_bus <= x"EEAE" ;

qdep <= x"0011" ;

ctrl <= x"0" ;

rw <= x"0" ;

48

In this case, all processors are in idle state. No transactions are performed through

the interconnect. Hence no changes are observed in any of the memory locations or on

‘data_bus’ and are observed on the simulation tracer ‘3’, shown in Figure 6.6, for ‘scnr’ =

‘7’.

The simulation tracers ‘1’, ‘2’ and ‘3’ are shown in Figures 6.4, 6.5 and 6.6 in the

next three pages.

49

Figure 6.4: Simulation tracer 1

50


51


52

The simulation tracers in Figure 6.7 and 6.8 show the behaviour of the

interconnect module ‘main_ic’ in different scenarios, which are explained below. In this

chapter, the input stimulus and the output of the module ‘main_ic’ under each scenario,

observed on the simulation tracers is discussed. (All the data mentioned in different

scenarios and that are shown on the simulation tracers are represented in hexadecimal

system)

Scenario 0:

Input stimulus:

data_in_prc <= x"4321" ;

data_in_mem <= x"FEDC" ;

addr_prc <= x"FB73" ;

qdep <= x"1234" ;

ctrl <= x"F" ;

rw <= x"F" ;




Hence there is no conflict between the processors for memory accesses and hence 'FLAG'

of all the processors is '1'. Processors '0', '1', '2' and '3' are writing '1', '2', '3' and '4'. As all

the processors get the memory access and hence the data '1', '2', '3' and '4' (from

‘data_in_prc’ bus) are written to the hexbit (4-bit binary value represented in

hexadecimal system) of ‘data_out_mem’ bus corresponding to each memory block,

‘addr_mem’ for each memory blocks gets ‘3’ indicating the location address with in that

memory block, ‘rw_mem’ for all memory blocks becomes ‘1’ indicating that it is a

memory write operation. This data can be observed on the simulation tracer 4 in Figure

6.7, for ‘addr_prc’ = ‘4321’ and ‘rst’ = ‘0’.

Scenario 1:

Input stimulus:

53

data_in_prc <= x"26FE" ;


addr_prc <= x"37BF" ;

qdep <= x"1234" ;

ctrl <= x"F" ;

rw <= x"0" ;





'2' and '3' are reading 'F', 'E', 'D' and 'C' from the ‘data_in_mem’ bus. As all the

processors get the memory access and hence the data 'F', 'E', 'D' and 'C' is from the

‘data_in_mem’ bus. This data can be observed on the ‘data_out_prc’ bus, on the

simulation tracer 4 in Figure 6.7, for ‘addr_prc’ = ‘37BF’. And ‘addr_mem’ for each

memory blocks gets ‘3’ indicating the location address with in that memory block,

‘rw_mem’ for all memory blocks becomes ‘0’ indicating that it is a memory read

operation, are also observed.

Scenario 2:

Input stimulus:

data_in_prc <= x"AAAA" ;


addr_prc <= x"CD56" ;

qdep <= x"EFFF" ;

ctrl <= x"F" ;

rw <= x"5" ;



'1'. Processor ‘1’ gets the access as its processor ID is greater than that of processor ‘0’

54

and queue depths of the processors is same. So processor ‘1’ reads ‘D’ from

‘data_in_mem’ bus. This data is observed on ‘data_out_prc’ bus on the simulation tracer

‘4’ shown in Figure 6.7, for ‘addr_prc’ = `CD56’. Similarly, processor ‘2’ (write) and

processor ‘3’ (read) are requesting for memory access within same memory block '3', but

to different memory locations '0' and '1'. Processor ‘2’ gets the access as its queue depth

is greater than that of processor ‘3’. So processor ‘2’ writes 'A' to ‘data_out_mem’ bus to

the hexbit corresponding to memory location ‘3’ and the data is observed on

‘data_out_mem’ bus on the simulation tracer ‘4 shown in Figure 6.7, for ‘addr_prc’ =

`CD56’. ‘flag’ is ‘1’ for only two processors (processors ‘1’ and ‘2’), ‘addr_mem’

corresponding to memory blocks ‘1’ and ‘3’ becomes ‘1’ indicating the location address

with in those memory blocks and ‘rw_mem’ for those locations become ‘0’ and ‘1’

indicating ‘read’ and ‘write’ operation respectively.

Scenario 3:

Input stimulus:



addr_prc <= x"DC65" ;

qdep <= x"4434" ;

ctrl <= x"F" ;

rw <= x"5" ;



'2'. Processor ‘0’ gets the access as its queue depth is greater than that of processor ‘1’. So

processor ‘0’ writes ‘5’ to the hexbit on the ‘data_out_mem’ bus corresponding to the

memory block '1'. This data is observed on the hexbit on the ‘data_out_mem’ bus

corresponding to the memory block, on the simulation tracer ‘4’ shown in Figure 6.7, for

‘addr_prc’ = ‘DC65’. Similarly, processor ‘2’ ( for memory write) and processor ‘3’ (for

memory read) are requesting for memory access within same memory block '3', but to

different memory locations '0' and '1'. Processor ‘3’ gets the access as its processor ID is

55

greater than that of processor ‘2’ and queue depths of the processors is same. So

processor ‘3’ reads 'F' from hexbit on the ‘data_in_mem’ bus corresponding to memory

location ‘3’. This is observed on hexbit on the ‘data_out_prc’ bus corresponding to

processor ‘3’, on the simulation tracer ‘4’ shown in Figure 6.7, for ‘addr_prc’ = ‘DC65’.

‘flag’ is ‘1’ for only two processors (processors ‘0’ and ‘3’), ‘addr_mem’ corresponding

to memory blocks ‘1’ and ‘3’ becomes ‘1’ indicating the location address with in those

memory blocks and ‘rw_mem’ for those locations become ‘1’ and ‘0’ indicating ‘write’

and ‘read’ operation respectively.

Scenario 4:

Input stimulus:



addr_prc <= x"3399" ;

qdep <= x"4EE4" ;

ctrl <= x"F" ;

rw <= x"C" ;

In this case, processor ‘0’ and processor ‘1’ are requesting for memory access

(read) with same memory location ‘1’ within same memory block ‘2’. Processor ‘1’ gets


reads ‘E’ from the hexbit on the ‘data_in_mem’ bus corresponding to memory block ‘2’

and is observed on ‘data_out_prc’ bus on the simulation tracer ‘5’, shown in Figure 6.8,

for ‘addr_prc’ = ‘3399’. Processor ‘2’ and processor ‘3’ are requesting for memory

access (write) with same memory location within same memory block. Processor ‘2’ gets


writes ‘9’ to the hexbit on the ‘data_out_mem’ bus corresponding to the memory block

‘0’. This data is on ‘data_out_mem’ bus on the simulation tracer ‘5’, shown in Figure 6.8,

for ‘addr_prc’ = ‘3399’. . ‘flag’ is ‘1’ for only two processors (processors ‘1’ and ‘1’),

‘addr_mem’ corresponding to memory blocks ‘0’ and ‘2’ becomes ‘3’ indicating the

56

location address with in those memory blocks and ‘rw_mem’ for those locations become

‘1’ and ‘0’ indicating ‘write’ and ‘read’ operation respectively.

Scenario 5:

Input stimulus:

data_in_prc <= x"EEEE" ;


addr_prc <= x"7654" ;

qdep <= x"5556" ;

ctrl <= x"F" ;

rw <= x"1" ;

In this case, processor ‘0’ (write), processor ‘1’ (read), processor ‘2’ (read) and

processor ‘3’ (read) are requesting for memory access within same memory block ‘1’, but

to different memory locations. Processor ‘0’ gets the priority as its has greatest queue

depth of all the processors. Hence processor ‘0’ writes 'E' to the hexbit on the

‘data_out_mem’ bus corresponding to the memory block '1'. This data is on the

simulation tracer ‘5’ shown in Figure 6.8, for ‘addr_prc’ = ‘7654’. ‘flag’ is ‘1’ for only

processor ‘0’, ‘addr_mem’ corresponding to memory block ‘1’ becomes ‘0’ indicating

the location address within the memory block and ‘rw_mem’ for those memory blocks

become ‘1’ indicating ‘write’ operation.

Scenario 6:

Input stimulus:

data_in_prc <= x"CCCC" ;


addr_prc <= x"6666" ;

qdep <= x"5555" ;

ctrl <= x"E" ;

rw <= x"0" ;

57

In this case, processor ‘1’ (read), processor ‘2’ (read) and processor ‘3’ (read) are

requesting for memory access with the same memory location ‘2’ in the memory block

‘1’. Processor ‘3’ gets the priority as its has greatest processor ID of all the processors.

Hence processor ‘3’ reads 'D' from the hexbit on the ‘data_in_mem’ bus corresponding to

the memory block '1', and is observed on ‘data_out_prc’ bus on the simulation tracer ‘5’,

shown in Figure 6.8, for ‘addr_prc’ = ‘6666’. ‘flag’ is ‘1’ for only processor ‘3’,

‘addr_mem’ corresponding to memory block ‘1’ becomes ‘0’ indicating the location

address within the memory block and ‘rw_mem’ for those memory blocks become ‘0’

indicating ‘read’ operation.

Scenario 7:

Input stimulus:

data_in_prc <= x"FFFF" ;


addr_prc <= x"EEAE" ;

qdep <= x"0011" ;

ctrl <= x"0" ;

rw <= x"0" ;

In this case, all processors are in idle state. No transactions are performed through the interconnect. Hence no changes are observed in any of the memory locations or on ‘data_bus’ and are observed on the simulation tracer ‘5’, shown in Figure 6.8, ‘addr_prc’ = ‘EEAE’.

The simulation tracers ‘4’ and ‘5’ are shown in Figures 6.7 and 6.8 in the next two pages.

58


59


60

The post-implementation simulation tracers shown in Figures 6.4, 6.5 and 6.6

show that the interconnect network module ‘main’, described in Appendix A,

experimentally performed correctly from a functional stand-point for all scenarios. The

post-implementation simulation tracers shown in Figures 6.7 and 6.8 show that the

interconnect network module ‘main_ic’, described in Appendix B, experimentally

performed correctly from a functional stand-point for all scenarios that were tested

sucessfully on the interconnect module ‘main’.

61

Chapter 7

Experimental Prototype Development, Testing and Validation Results

The VHDL coded interconnect network system and interfaced memory modules

described in the code of Appendix A were synthesized, implemented, and programmed

into the Xilinx based Spartan XL FPGA chip using the Xilinx Foundation 3.1i CAD tool

set [7]. The FPGA chip used as the target in downloading of the VHDL code generated

bit stream was the earlier mentioned Xilinx XCS10PC84 from the Spartan XL family.

Since the module ‘main’ has a large number of I/O pins, a testcase ‘top’ is developed and

described in VHDL code which generates stimulus under various scenarios to the inputs

of ‘main’ and routes a selected set of outputs of ‘main’ to the LEDs of the prototype

board for output observation and evaluation. Figure 7.1 shows the block diagram of

‘top’.

Figure 7.1: Top Level Block Diagram of Testcase ‘top’.

main

Stimulus

generator

display

Input

Output

clk

rst

pid 3

scnr

3

addr

data

4

4

62

The function of ‘top’ is to generate stimulus to the input pins of module ‘main’.

The 3-bit primary input ‘scnr’ (scenario), to the testcase ‘top’, determines which set of

inputs are to be given to the module ‘main’. The signal ‘Input’ shown in the Figure 7.1 is

the stimulus given to all the input ports of module ‘main’. The signal ‘Output’ is the

output of module ‘main’. Both ‘Input’ and ‘Output’ are the input ports to the block

‘display’. There are 8 different scenarios that can be tested experimentally. There are 8

LEDs on the proto-board which are used to display and verify inputs and outputs of the

under-test module ‘main’. Another function of ‘top’ is to display the required set of

inputs or outputs depending upon the value of a 3-bit primary input ‘pid’ to the testcase

‘top’. The 4-bit ‘addr’ and 4-bit ‘data’ are the primary outputs of the testcase ‘top’. The

primary input to the testcase ‘top’ determines the set of signals that are to be displayed on

the 8 LEDs. For example, a ‘000’ value of ‘pid’ displays the 4-bit input queue depth

‘qdep’ of processor ‘0’ on 4-bit ‘data’, and ‘flag’ of processor ‘0’ on ‘addr’ bits ‘3’ and

‘2’, ‘rw’ of processor ‘0’ on ‘addr’ bit ‘1’ and ‘ctrl’ of processor ‘0’ on ‘addr’ bit ‘0’. The

‘111’ value of ‘pid’ displays the 4-bit ‘addr_bus’ of processor ‘0’ on ‘addr’ and 4-bit

‘data_in’ of processor ‘0’ on the ‘data’ pins. In a similar fashion ‘010’ and ‘011’ values

of ‘pid’ display corresponding signals of processor ‘1’, ‘100’ and ‘101’ values of ‘pid’

display corresponding signals for processor ‘2’ and ‘110’ and ‘111’ values of ‘pid’

display corresponding signals for processor ‘3’. A 50Mhz clock which is generated

internally on the FPGA chip is another primary input to the testcase ‘top’ and hence to

module ‘main’. During the implementation stage, the inputs and outputs are assigned to

the I/O pins of the FPGA chip. This information is entered in the ‘top.ucf file. The

scenario select bits scnr(0), scnr(1) and scnr(2) are assigned to ‘p25’, ‘p26’ and ‘p27’,

which are switches 4,3,2 on the proto-board. The display select bits pid(0), pid(1), and

pid(2) are assigned to pins ‘p19’, ‘p20’ and ‘p23’, which are switches 8,7,6 on the

protoboard. The 4-bit ‘addr’ is assigned to pins ‘p66’, ‘p67’, ‘p68’ and ‘p69’ which are

LEDs 4,3,2,1 on the proto-board and 4-bit data is assigned to pins ‘p60’, ‘p61’, ’p62’ and

‘p65’ which are LEDs 8,7,6,5 on the proto-board. The reset signal ‘rst’ is assigned to

‘p28’ which is switch 1 on the proto-board. By changing switches 4,3 and 2 on the proto-

board, different sets of inputs are given to the module ‘main’ and the corresponding

63

interconnection network behavior and outputs of module ‘main’ can be seen on the 8

LEDs for any processor by selection through switches 8, 7 and 6 on the proto-board.

All the scenarios that were experimentally successfully tested during post-

implementation simulations were also tested on the proto-board. The scenarios are

described in Chapter 7 (Pages 43 – 48) and in Appendix ‘A’ (Pages 68, 69). The

simulation tracers for these scenarios are shown in Figures 6.4, 6.5 and 6.6. The post-

implementation simulation results and experiments performed on the proto-board showed

consistent results and proved that the interconnet network experimentally performed

correctly from a functional stand-point for all scenarios.

Interconnect Network Performance:

The design is tested on the proto-board at 50MHz. But setup time and hold time

violations are observed on the post-synthesis simulation tracer at 50MHz clock

frequency, in some cases where there are frequent changes in input. These violations are

not observed on the simulation tracer for 20MHz clock frequency. The total delay, which

includes connection establishment and network latency, between any processor and any

memory block takes about 10ns. It is about one-half the clock period when operated at

20MHz. Hence a processor can access data from two different memory blocks on two

successive falling edges of ‘clk’. A similar scenario is tested in both post-implementation

simulations and experimental testing on the proto-board. This can observed on the

simulation tracer 1 shown in Figure 6.4, where processor ‘1’ reads ‘3’ from memory

location ‘3’ in memory block ‘3’,which is shown in ‘scnr’ = ‘1’ and reads ‘0’ from

memory location ‘2’ in memory block ‘1’,which is shown in scnr = ‘2’. On the

simulation tracer, scenario does not change (‘scnr’ from ‘1’ to ‘2’) on successive clocks.

But it is possible that similar data transfers between any processor and any memory block

happen in two successive clocks.

For this frequency, maximum bandwidth possible for data transfers through the

interconnect is 4 MB/s. This is the case when all the processors access different memory

blocks without any conflict at any memory block. On every falling edge of ‘clk’, 4-bits

of data can be transferred between any processor and any memory block. Hence in the

64

best case when all processors access different memory blocks, 16 bits (2 bytes) of data

transfer is done through the interconnect network on every falling edge of ‘clk’. The

connection establishment between any processor and any memory block does take more

than one-half of a clock period.

Shown below is the resource utilization of testcase ‘top’ implemented to a Spartan

XL FPGA (XCS10PC84) chip. This data gives the number of Configurable Logic Blocks

(CLBs), flip flops, latches, Look Up Tables (LUTs) and gate count used in implementing

the testcase ‘top’ to a Spartan XL FPGA (XCS10PC84) chip. This data is generated

during the post-implementation process.

Resource Utilization Summary of Spartan XL FPGA (XCS10PC84) Chip for Testcase ‘top’ of Figure 7.1 Programmed to Chip: Number of CLBs: 196 out of 196 100% CLB Flip Flops: 96 CLB Latches: 0 4 input LUTs: 354 3 input LUTs: 130 Total equivalent gate count for design: 3273

65

Chapter 8

Conclusions

A modular and scalable architecture and design for a crossbar

interconnect network of a HDCA single-chip multiprocessor system was first

presented. It’s development, simulation validation, and experimental hardware

prototype validation was also successfully accomplished in this project. The

design capture, simulation (pre-synthesis, post-synthesis and post-

implementation) and implementation was done in VHDL using Xilinx

Foundation PC based CAD software. The design was implemented as a

prototype in a PROM based Xilinx Spartan XL FPGA chip. The FPGA chip

that was used as the target in the downloading of a VHDL code generated bit

stream was the XCS10PC84 from the Xilinx Spartan XL family. The pre-

synthesis, post-synthesis and post-implementation VHDL functional simulation

results obtained from the designed interconnect network matched with obtained

FPGA chip hardware prototype experimental results for all test scenarios and all

original design specifications were met.

66

References 1. George Broomell and J. Robert Heath, “Classification Categories and Historical

Development of Circuit Switching Topologies”, ACM Computing Surveys, Vol.15,

No.2, pp. 95-133, June 1983.

2. J. Robert Heath, Paul Maxwell, Andrew Tan, and Chameera Fernando, “Modeling,

Design, and Experimental Verification of Major Functional Units of a Scalable Run-

Time Reconfigurable and Dynamic Hybrid Data/Command Driven Single-Chip

Multiprocessor Architecture and Development and Testing of a First-Phase

Prototype”, Private Communication, 2002.

3. M. L. Bos, “Design of a Chip Set for a Parallel Computer based on the Crossbar

Interconnection Principle”, Proceedings Circuits and Systems, 1995, Proceedings of

the 38th Midwest Symposium, Vol.2, pp. 752-756, 1996.

4. Enrique Coen-Alfaro and Gregory W. Donohoe, “A Comparison of Circuits for On-

Chip Programmable Crossbar Switches”, www.mrc.unm.edu/symposiums/symp02/

dsp/donohoe/coen_donohoe.doc.

5. J. R. Heath, and S. Balasubramanian, “Development, Analysis, and Verification of a

Parallel Dataflow Computer Architectural Framework and Associated Load-

Balancing Strategies and Algorithms via Parallel Simulation”, SIMULATION, vol. 69,

no.1, pp. 7-25, July.1997.

6. J. Robert Heath, George Broomell, Andrew Hurt, Jim Cochran, Liem Le, “A

Dynamic Pipeline Computer Architecture for Data Driven Systems”, Research

Project Report, Contract No. DASG60-79-C-0052, University of Kentucky Research

Foundation, Lexington, KY 40506, Feb., 1982

7. Xilinx Foundation Software Version 3.1 Manual, www.xilinx.com

8. Datasheets for Spartan XL Series Devices, www.digilent.cc

9. Kevin Skahill, VHDL for Programmable Logic, Addison-Wesley, 1993 .

10. Sudhakar Yalamanchali, Introductory VHDL: From Simulation to Synthesis, Prentice

Hall, 2001.

11. http://www.gup.uni-

linz.ac.at/thesis/diploma/christian_schaubschlaeger/html/chapter02a5.html

12. http://www.crhc.uiuc.edu/ece412/lectures/lecture20.pdf

67

Appendix A

Interconnect Network and Memory VHDL Code (Version 1)

The VHDL Code for the Crossbar Interconnect Network of the HDCA

System and It’s Shared Memory Organization as Structured in Figures 7.1

and 6.1.

-- This VHDL code describes the Crossbar interconnect network -- and shared memory organization with -- only one interface, on processor side. -- Equivalent Block diagram for module 'top' is shown in Figure 7.1 -- Equivalent Block diagram for module 'main' is shown in Figure 6.1 library IEEE ; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity top is port (clk: in std_logic ; rst: in std_logic ; scnr: in std_logic_vector(2 downto 0) ; pid: in std_logic_vector(2 downto 0) ; data: out std_logic_vector(3 downto 0) ; addr: out std_logic_vector(3 downto 0) ); end top ; architecture test_top of top is component main is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_bus: in std_logic_vector(15 downto 0) ; data_in : in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ; data_out: out std_logic_vector(15 downto 0) ); end component main ; signal ctrl: std_logic_vector(3 downto 0) ; signal flag: std_logic_vector(3 downto 0) ;

68

signal qdep: std_logic_vector(15 downto 0) ; signal addr_bus: std_logic_vector(15 downto 0) ; signal data_in: std_logic_vector(15 downto 0) ; signal data_out: std_logic_vector(15 downto 0) ; signal rw: std_logic_vector(3 downto 0) ; begin stim_gen: process (scnr) is begin -- This is equivalent to Stimulus generator in the figure 7.1 -- All these scenarios are discussed in Chapter 6 case scnr (2 downto 0) is when "000" => -- all processors write to memory locations in different memory blocks data_in <= x"4321" ; addr_bus <= x"FB73" ; qdep <= x"1234" ; ctrl <= x"F" ; rw <= x"F" ; when "001" =>

-- all processors read from memory locations in different memory blocks -- in the reverse order

data_in <= x"26FE" ; addr_bus <= x"37BF" ; qdep <= x"1234" ; ctrl <= x"F" ; rw <= x"0" ; when "010" => -- processors 2 (write) and 3 (read) to different memory locations in the same memory block -- processor 2 gets priority as its qdepth is greater -- processors 0 (write) and 1 (read) to different memory locations in the same memory block -- processor 1 gets priority as its processor id is greater data_in <= x"AAAA" ; addr_bus <= x"CD56" ; qdep <= x"EFFF" ; ctrl <= x"F" ; rw <= x"5" ; when "011" =>

69

-- processors 2 (write) and 3 (read) to different memory locations in the same memory block -- processor 3 gets priority as its processor id is greater -- processors 0 (write) and 1 (read) to different memory locations in the same memory block -- processor 0 gets priority as its qdepth is greater data_in <= x"5555" ; addr_bus <= x"DC65" ; qdep <= x"4434" ; ctrl <= x"F" ; rw <= x"5" ; when "100" => data_in <= x"9999" ; -- processors 2 (write) and 3 (write) to same memory location addr_bus <= x"3399" ; -- processor 2 gets priority as its qdepth is greater qdep <= x"4EE4" ; -- processors 0 (read) and 1 (read) from same memory location ctrl <= x"F" ; -- processor 1 gets priority as its qdepth is greater rw <= x"C" ; when "101" => -- processors 0 (write), 1 (read), 2 (read) and 3 (read) to different memory locations in the same memory block data_in <= x"EEEE" ; -- processor 0 gets priority as its qdepth is greater addr_bus <= x"7654" ; qdep <= x"5556" ; ctrl <= x"F" ; rw <= x"1" ; when "110" => -- processors 1 (read), 2 (read) and 3 (read) to same memory location (proc 0 is idle) data_in <= x"CCCC" ; -- processor 3 gets priority as its processor id is greater addr_bus <= x"6666" ; qdep <= x"5555" ; ctrl <= x"E" ; rw <= x"0" ; when others => -- all processors in idle state data_in <= x"FFFF" ; -- flag is ‘0000’ addr_bus <= x"EEAE" ; qdep <= x"0011" ; ctrl <= x"0" ; rw <= x"0" ; end case ; end process stim_gen ;

70

INST: main port map (clk => clk, rst => rst, data_in => data_in, qdep => qdep, addr_bus => addr_bus, rw => rw, ctrl => ctrl, flag => flag, data_out => data_out ); -- This is equivalent to 'display' block in figure 7.1 disply: process (ctrl,scnr,pid) is begin case pid (2 downto 0) is when "000" => data <= qdep(3 downto 0) ; addr(0) <= ctrl(0) ; addr(1) <= rw(0) ; addr(2) <= flag(0) ; addr(3) <= flag(0) ; when "001" => if (rw(0) = '0') then data <= data_out(3 downto 0) ; else data <= data_in(3 downto 0) ; end if ; addr <= addr_bus(3 downto 0) ; when "010" => data <= qdep(7 downto 4) ; addr(0) <= ctrl(1) ; addr(1) <= rw(1) ; addr(2) <= flag(1) ; addr(3) <= flag(1) ; when "011" => if (rw(1) = '0') then data <= data_out(7 downto 4) ; else data <= data_in(7 downto 4) ; end if ; addr <= addr_bus(7 downto 4) ; when "100" => data <= qdep(11 downto 8) ; addr(0) <= ctrl(2) ; addr(1) <= rw(2) ; addr(2) <= flag(2) ;

71

addr(3) <= flag(2) ; when "101" => if (rw(2) = '0') then data <= data_out(11 downto 8) ; else data <= data_in(11 downto 8) ; end if ; addr <= addr_bus(11 downto 8) ; when "110" => data <= qdep(15 downto 12) ; addr(0) <= ctrl(3) ; addr(1) <= rw(3) ; addr(2) <= flag(3) ; addr(3) <= flag(3) ; when others => if (rw(3) = '0') then data <= data_out(15 downto 12) ; else data <= data_in(15 downto 12) ; end if ; addr <= addr_bus(15 downto 12) ; end case ; end process disply ; end architecture test_top ; -- The main interconnect module library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity main is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_bus: in std_logic_vector(15 downto 0) ; data_in: in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ;

72

data_out: out std_logic_vector(15 downto 0) ); end main ; architecture main_arch of main is type qd is array (3 downto 0) of std_logic_vector(3 downto 0) ; type data_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type addr_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type mb is array (3 downto 0) of std_logic_vector(1 downto 0) ; type mem_array is array (15 downto 0) of std_logic_vector(3 downto 0) ; -- This function does the priority logic for all the memory blocks -- This is schemactically equivalent to Figures 4.5 and 4.6 in the report -- This can work for any number of processors and memory blocks -- by changing 'i' and 'j' values function flg (qdep, addr_bus, ctrl:std_logic_vector ) return std_logic_vector is variable qdvar: std_logic_vector (3 downto 0) ; variable flag: std_logic_vector(3 downto 0) ; variable qdv : std_logic_vector(3 downto 0) ; variable gnt : std_logic ; variable a: integer ; variable b: integer ; variable memaddr : mb ; variable qd_arr : qd ; begin qd_arr(0) := qdep(3 downto 0) ; qd_arr(1) := qdep(7 downto 4) ; qd_arr(2) := qdep(11 downto 8) ; qd_arr(3) := qdep(15 downto 12) ; memaddr(0) := addr_bus(3 downto 2) ; memaddr(1) := addr_bus(7 downto 6) ; memaddr(2) := addr_bus(11 downto 10) ; memaddr(3) := addr_bus(15 downto 14) ; L1: for i in 0 to 3 loop L2: for j in 0 to 3 loop if (ctrl(j) = '0') then flag(j) := '0' ;

73

qdv(j) := '0' ; elsif (memaddr(j) = i) then qdv(j) := '1' ; else qdv(j) := '0' ; end if ; end loop L2 ; qdvar := "0000" ; gnt := '0' ; L3: for k in 0 to 3 loop if qdv(k) = '1' then if qdvar <= qd_arr(k) then qdvar := qd_arr(k) ; a := k ; gnt := '1' ; else flag(k) := '0' ; end if; end if ; end loop L3 ; if (gnt = '1') then flag(a) := '1' ; end if ; end loop L1 ; return (flag) ; end flg; signal memory: mem_array ; begin P1 : process(ctrl, clk, qdep, addr_bus, rst, data_in) is begin if (rst = '1') then flag <= "0000" ; else flag <= flg(qdep, addr_bus, ctrl) ;

74

-- Memory transaction -- The conditional statements make sure that the connection is established -- before memory transaction -- Equivalent to Figures 4.4 and 4.1 after the completion of priority logic operation. -- This routine is to be repeated for each addition of processor if (clk 'event and clk = '0') then if (flag(0) = '1') then if (rw(0) = '1') then memory(conv_integer(addr_bus(3 downto 0))) <= data_in(3 downto 0) ; data_out(3 downto 0) <= (others => 'Z') ; else data_out(3 downto 0) <= memory(conv_integer(addr_bus(3 downto 0))) ; end if ; end if ; if (flag(1) = '1') then if (rw(1) = '1') then memory(conv_integer(addr_bus(7 downto 4))) <= data_in(7 downto 4) ; data_out(7 downto 4) <= (others => 'Z') ; else data_out(7 downto 4) <= memory(conv_integer(addr_bus(7 downto 4))) ; end if ; end if ; if (flag(2) = '1') then if (rw(2) = '1') then memory(conv_integer(addr_bus(11 downto 8))) <= data_in(11 downto 8) ; data_out(11 downto 8) <= (others => 'Z') ; else data_out(11 downto 8) <= memory(conv_integer(addr_bus(11 downto 8))) ; end if ; end if ; if (flag(3) = '1') then if (rw(3) = '1') then memory(conv_integer(addr_bus(15 downto 12))) <= data_in(15 downto 12) ; data_out(15 downto 12) <= (others => 'Z') ; else data_out(15 downto 12) <= memory(conv_integer(addr_bus(15 downto 12)))

75

; end if ; end if ; end if; end if; end process P1 ; end main_arch ;

76

Appendix B

Interconnect Network VHDL Code (Version 2)

The VHDL Code for the Crossbar Interconnect Network as Structured in

Figure 6.2.

-- This VHDL code is described for Crossbar interconnect network with -- two interfaces, on on processor side and the other on shared memory side -- Equivalent Block diagram for module 'main_ic' is shown in Figure 6.2 library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity main_ic is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_prc: in std_logic_vector(15 downto 0) ; data_in_prc: in std_logic_vector(15 downto 0) ; data_in_mem: in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ; addr_mem: out std_logic_vector(7 downto 0) ; rw_mem: out std_logic_vector(3 downto 0) ; data_out_prc: out std_logic_vector(15 downto 0) ; data_out_mem: out std_logic_vector(15 downto 0) ); end main_ic ; architecture main_arch of main_ic is type qd is array (3 downto 0) of std_logic_vector(3 downto 0) ; type data_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type addr_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type mb is array (3 downto 0) of std_logic_vector(1 downto 0) ; -- This function does the priority logic for all the memory blocks -- This is schemactically equivalent to Figures 4.5 and 4.6 in the report -- This can work for any number of processors and memory blocks

77

-- by changing 'i' and 'j' values function flg (qdep, addr_prc, ctrl:std_logic_vector ) return std_logic_vector is variable qdvar: std_logic_vector (3 downto 0) ; variable flag: std_logic_vector(3 downto 0) ; variable qdv : std_logic_vector(3 downto 0) ; variable gnt : std_logic ; variable a: integer ; variable b: integer ; variable memaddr : mb ; variable qd_arr : qd ; begin qd_arr(0) := qdep(3 downto 0) ; qd_arr(1) := qdep(7 downto 4) ; qd_arr(2) := qdep(11 downto 8) ; qd_arr(3) := qdep(15 downto 12) ; memaddr(0) := addr_prc(3 downto 2) ; memaddr(1) := addr_prc(7 downto 6) ; memaddr(2) := addr_prc(11 downto 10) ; memaddr(3) := addr_prc(15 downto 14) ; L1: for i in 0 to 3 loop L2: for j in 0 to 3 loop if (ctrl(j) = '0') then flag(j) := '0' ; qdv(j) := '0' ; elsif (memaddr(j) = i) then qdv(j) := '1' ; else qdv(j) := '0' ; end if ; end loop L2 ; qdvar := "0000" ; gnt := '0' ; L3: for k in 0 to 3 loop if qdv(k) = '1' then if qdvar <= qd_arr(k) then qdvar := qd_arr(k) ;

78

a := k ; gnt := '1' ; else flag(k) := '0' ; end if; end if ; end loop L3 ; if (gnt = '1') then flag(a) := '1' ; end if ; end loop L1 ; return (flag) ; end flg; signal data_in_mem_2d: data_array ; signal data_out_mem_2d: data_array ; signal addr_mem_2d: mb ; begin data_in_mem_2d(0) <= data_in_mem(3 downto 0) ; data_in_mem_2d(1) <= data_in_mem(7 downto 4) ; data_in_mem_2d(2) <= data_in_mem(11 downto 8) ; data_in_mem_2d(3) <= data_in_mem(15 downto 12) ; P1 : process (rst, addr_prc, ctrl) is begin if (rst = '1') then flag <= "0000" ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; else flag <= flg(qdep, addr_prc, ctrl) ; -- Memory transaction -- The conditional statements make sure that the connection is established -- before memory transaction -- Equivalent to Figures 4.4 and 4.1 after the completion of priority logic operation. -- This routine is to be repeated for each addition of processor if (clk 'event and clk = '1') then if (flag(0) = '1') then

79

addr_mem_2d (conv_integer(addr_prc(3 downto 2))) <= addr_prc(1 downto 0); rw_mem (conv_integer(addr_prc(3 downto 2))) <= rw(0) ; if (rw(0) = '1') then data_out_mem_2d (conv_integer(addr_prc(3 downto 2))) <= data_in_prc(3 downto 0) ; data_out_prc (3 downto 0) <= (others => 'Z') ; else data_out_prc (3 downto 0) <= data_in_mem_2d(conv_integer(addr_prc(3 downto 2))) ; end if ; end if ; if (flag(1) = '1') then addr_mem_2d (conv_integer(addr_prc(7 downto 6))) <= addr_prc(5 downto 4); rw_mem (conv_integer(addr_prc(7 downto 6))) <= rw(1) ; if (rw(1) = '1') then data_out_mem_2d (conv_integer(addr_prc(7 downto 6))) <= data_in_prc(7 downto 4) ; data_out_prc (7 downto 4) <= (others => 'Z') ; else data_out_prc (7 downto 4) <= data_in_mem_2d(conv_integer(addr_prc(7 downto 6))) ; end if ; end if ; if (flag(2) = '1') then addr_mem_2d (conv_integer(addr_prc(11 downto 10))) <= addr_prc(9 downto 8); rw_mem (conv_integer(addr_prc(11 downto 10))) <= rw(2) ; if (rw(2) = '1') then data_out_mem_2d (conv_integer(addr_prc(11 downto 10))) <= data_in_prc(11 downto 8) ; data_out_prc (11 downto 8) <= (others => 'Z') ; else data_out_prc (11 downto 8) <= data_in_mem_2d(conv_integer(addr_prc(11 downto 10))) ; end if ; end if ; if (flag(3) = '1') then addr_mem_2d (conv_integer(addr_prc(15 downto 14))) <= addr_prc(13 downto 12); rw_mem (conv_integer(addr_prc(15 downto 14))) <= rw(3) ; if (rw(3) = '1') then data_out_mem_2d (conv_integer(addr_prc(15 downto 14))) <= data_in_prc(15 downto 12) ; data_out_prc (15 downto 12) <= (others => 'Z') ; else data_out_prc (15 downto 12) <= data_in_mem_2d(conv_integer(addr_prc(15 downto 14))) ; end if ; end if ;

80

end if ; addr_mem(1 downto 0) <= addr_mem_2d(0) ; addr_mem(3 downto 2) <= addr_mem_2d(1) ; addr_mem(5 downto 4) <= addr_mem_2d(2) ; addr_mem(7 downto 6) <= addr_mem_2d(3) ; data_out_mem(3 downto 0) <= data_out_mem_2d(0) ; data_out_mem(7 downto 4) <= data_out_mem_2d(1) ; data_out_mem(11 downto 8) <= data_out_mem_2d(2) ; data_out_mem(15 downto 12) <= data_out_mem_2d(3) ; end if ; end process P1 ; end main_arch ;

design, development, and simulation/experimental …heath/masters_proj_report_venugopal...1 design,...

Documents