design, development, and simulation/experimental …heath/masters_proj_report_venugopal...1 design,...
TRANSCRIPT
1
Design, Development, and Simulation/Experimental Validation of a
Crossbar Interconnection Network for a Single-Chip Shared Memory
Multiprocessor Architecture
Master’s Project Report
June, 2002
Venugopal Duvvuri
Department of Electrical and Computer Engineering
University Of Kentucky
Under the Guidance of
Dr. J. Robert Heath
Associate Professor
Department of Electrical and Computer Engineering
University of Kentucky
2
Table of Contents
Topic Page Number
ABSTRACT 3
Chapter 1: Introduction, Background, and Positioning of Research 4
Chapter 2: Types of Interconnect Systems 8
Chapter 3: Multistage Interconnection Systems Complexity 16
Chapter 4: Design of the Crossbar Interconnect Network 28
Chapter 5: VHDL Design Capture, Simulation, Synthesis and
Implementation Flow 35
Chapter 6: Design Validation via Post-Implementation Simulation Testing 39
Chapter 7: Experimental Prototype Development, Testing, and Validation
Results 61
Chapter 8: Conclusions 65
References 66
Appendix A: Interconnect Network and Memory VHDL Code (Version 1) 67
Appendix B: Interconnect Network VHDL Code (Version 2) 76
3
ABSTRACT
This project involves modeling, design, Hardware Description Language (HDL) design
capture, synthesis, HDL simulation testing, and experimental validation of an
interconnect network for a Hybrid Data/Command Driven Computer Architecture
(HDCA) system, which is a single-chip shared memory multiprocessor architecture
system. Various interconnect topologies that may meet the requirements of the HDCA
system are studied and evaluated related to utilization within the HDCA system. It is
determined the Crossbar topology best meets the HDCA system requirements and it is
therefore used as the interconnect network of the HDCA system. The design capture,
synthesis, simulation and implementation is done in VHDL using XILINX Foundation
CAD software. A small reduced scale prototype design is implemented in a PROM based
Spartan XL Field Programmable Gate Array (FPGA) chip which is successfully
experimentally tested to validate the design and functionality of the described crossbar
interconnect network.
4
Chapter 1
Introduction, Background, and Positioning of Research
This project is first, the study of different kinds of interconnect networks that may
meet the requirements of a Hybrid Data/Command Driven Architecture (HDCA)
multiprocessor system [2,5,6] shown in Figure 1.1. The project then involves Vhsic
Hardware Description Language (VHDL) [10] description, synthesis, simulation testing,
and experimental prototype testing of an interconnect network which acts as a Computing
Element (CE) to data memory circuit switch for the HDCA system.
The HDCA system is a multiprocessor shared memory architecture. The shared
memory is organized as a number of individual memory blocks as shown in Figures 1.1,
3.4a, and 4.1 and is explained in detail in later chapters. This kind of memory
organization is required by this architecture. If two or more processors want to
communicate with memory locations within the same memory block, lower priority
processors have to wait until the highest priority processor gets its transaction done. Only
the highest priority processor will receive a request grant and the requests from other
lower priority processors must be queued and these requests are processed only after the
completion of the first highest priority transaction. The interconnect network to be
designed should be able to connect requesting processors on one side of the interconnect
network to the memory blocks on the other side of the interconnect network. The
efficiency of the interconnect network increases as the possible number of parallel
connections between the processors and the memory blocks increases. Interconnection
networks play a central role in determining the overall performance of a multiprocessor
system. If the network cannot provide adequate performance for a particular application,
nodes (or CE processors in this case) will frequently be forced to wait for data to arrive.
In this project, different types of interconnect networks, which may be applicable
to a HDCA system are addressed, and advantages and disadvantages of these
interconnects are discussed. Different types of interconnects, their routing mechanism,
and the complexity factor in designing the interconnects is also described in detail. This
project includes design and VHDL description and synthesis of a interconnect network
5
based on the crossbar topology, the topology which best meets HDCA system
requirements.
Input FIFOs
RAM Data Memory
CE-Data Memory Interconnect
CE0 CE1 CEn-1
Q Q Q
CE-Mapper
Control
Token Router
Control Token
Mapper (CTM) File Large File Memory File
CE-File Interconnect
=> Muiltifunctional Queue Q
… Inputs …
Outputs
Figure 1.1: Single-Chip Reconfigurable HDCA System (Large File Memory May Be Off-Chip)
FIFO
…...
…
… … …
…………………………….
.
.
.
…
…
6
The crossbar topology is a very popular interconnect network in industry today.
Interconnects are applicable to different kinds of systems having their own requirements.
In some systems, such as distributed memory systems, there should be a way that the
processors can communicate with each other. A crossbar topology (single sided topology)
[1] can be designed to meet the requirement of inter-processor communication and is
unique to distributed memory systems, because in distributed memory systems,
processors do not share a common memory. All the processors in the system are directly
connected to their own memory and caches. Any processor cannot directly access another
processor's memory. All communication between the processors is made possible through
the interconnection network. Hence, there is a need for inter-processor communication in
distributed memory architectures. The crossbar topology suitable for these architectures
is the single-sided crossbar network. All the processors are connected to an
interconnection network and communication between any two processors is possible. The
HDCA system does not need an interconnect that supports inter-processor
communication, as it is a shared memory architecture. For the shared memory
architectures, a double-sided crossbar network can be used as the interconnect network.
This design needs some kind of priority logic, which prioritizes conflicting requests for
memory accesses by the processors. This also requires a memory organization which is
shared by all processors. The HDCA system requires the memory to be divided into
memory blocks, each block containing memory locations with different address ranges.
The actual interconnect design is a combination of a crossbar interconnect (double sided
topology) [1], priority logic, and a shared memory organization. Another interconnect
architecture has been implemented as the interconnect for the CE to Data memory circuit
switch in an initial prototype of the HDCA system [2]. The initial HDCA prototype
assumes no processor conflicts in accessing a particular memory block, which can be
handled in the design presented here by the priority logic block. The input queue depth of
individual CE processors of a HDCA system is used by the priority logic block of the
proposed interconnect network in granting requests to the processor having the deepest
queue depth. The presented design is specific to the CE to Data memory circuit switch for
a HDCA system. The detailed crossbar interconnect network design is described in
7
Chapter 4. VHDL design capture, synthesis and implementation procedures are discussed
in Chapter 5. Chapter 6 includes the VHDL simulation testing setup and results. A test
case is described in chapter 6 which was tested during pre-synthesis HDL simulation,
post-synthesis HDL simulation, and the post-implementation HDL simulation process. In
Chapter 7 an experimental prototype of the crossbar interconnect network is developed
and tested to validate the presented interconnect architecture, design, and functionality.
8
Chapter 2
Types of Interconnect Systems
Interconnect networks can be classified as static or dynamic [11]. In the case of a
static interconnection network, all connections are fixed, i.e. the processors are wired
directly, whereas in the latter case there are routing switches in between. The decision
whether to use a static or dynamic interconnection network depends on the type of
problem to be solved by the computer system utilizing the interconnect system.
Generally, static topologies are suitable for problems whose communication patterns can
be predicted a priori reasonably well, whereas dynamic topologies (switching networks),
though more expensive, are suitable for a wider class of problems. Static networks are
mainly used in message passing networks and are mainly used for inter-processor
communications.
Types of Static Networks:
1. Star connected network:
Figure 2.1: Star Connected Network
In a star topology there is one central node computer, to which all other node
computers are connected; each node has one connection, except the center node, which
has N-1 connections. Routing in stars is trivial. If one of the communicating nodes is the
center node, then the path is just the edge connecting them. If not, the message is routed
from the source node to the center node, and from there to the destination node. Star
9
networks are not suitable for large systems, since the center node will become a
bottleneck with an increasing number of processors. A typical Star connected network is
shown in Figure 2.1.
2. Meshes:
Figures 2.2 and 2.3 show a typical 1-Dimensional (D) mesh and 2-D mesh
respectively. The simplest and cheapest way to connect the nodes of a parallel computer
is to use a one-dimensional mesh. Each node has two connections and boundary nodes
have one. If the boundary nodes are connected to each other, we have a ring, and all
nodes have two connections. The one-dimensional mesh can be generalized to a k-
dimensional mesh, where each node (except boundary nodes) has 2k connections. In
meshes, the dimension-order routing technique is used [12]. That is, routing is performed
in one dimension at a time. In a three-dimensional mesh for example, a message's path
from node (a,b,c) to the node (x,y,z) would be moved along the first dimension to node
(x,b,c), then, along the second dimension to node (x,y,c), and finally, in the third
dimension to the destination-node (x,y,z). This type of topology is not suitable to build
large-scale computers, since there is a wide range of latencies (the latency between
neighboring processors is much lower than between not-neighbors), and secondly the
maximum latency grows with the number of processors.
Figure 2.2: 1 - D Mesh
Figure 2.3: 2 - D Mesh
10
3. Hypercubes:
The hypercube topology is one of the most popular and used in many large-scale
systems. A k-dimensional hypercube has 2k nodes, each with k connections. Figure 2.4
shows a 4-D hypercude. Hypercubes scale very well, the maximum latency in a k-
dimensional (or "k-ary") hypercube is log2 N, with N = 2k.
An important property of hypercube interconnects is the relationship between
node-number and which nodes are connected together. The rule is that any two nodes in
the hypercube, whose binary representations differ in exactly one bit, are connected
together. For example in a four-dimensional hypercube, node 0 (0000) is connected to
node 1 (0001), node 2 (0010), node 4 (0100) and node 8 (1000). This numbering scheme
is called the Gray code scheme. A hypercube connected in this fashion is shown in Figure
2.5.
A k-dimensional hypercube is nothing more than a k-dimensional mesh with only
two nodes in each dimension, and thus the routing algorithm is the same as for meshes;
apart from one difference. The path from node A to node B is calculated by simply
calculating the Exclusive-OR, X = A XOR B, from the binary representations for node A
and B. If the ith bit in X is '1' the message is moved to the neighboring node in the ith
dimension. If the ith bit is '0', the message is not moved. This means, that it takes at most
log2 N steps for a message to reach its destination (where N is the number of nodes in the
hypercube).
Figure 2.4: 4-D Hypercube
11
Figure 2.5: Gray Code Scheme in Hypercube
Types of Dynamic Networks:
1. Bus-based networks:
They are the simplest and an efficient solution when the cost and a moderate
number of processors are involved. Their main drawback is a bottleneck to the memory
when the number of processors becomes large and also a single point of failure hangs the
system. To overcome these problems to some extent, several parallel buses can be
incorporated. Figure 2.6 shows a bus-based network incorporated with a single bus.
P0 P1 Pn-1
Figure 2.6: Bus-based Network
0000 0010
0100
0101
0001
0110
0111
0011
1000 1010
1100
1101
1001
1110
1111
1011
12
2. Crossbar switching networks:
Figure 2.7 shows a double-sided crossbar network having ‘n’ processors (Pi) and
‘m’ memory blocks (Mi). All the processors in a crossbar network have dedicated buses
directly connected to all memory blocks. This is a non-blocking network, as a connection
of a processor to a memory block does not block the connection of any other processor to
any other memory block. In spite of high speed, their use is normally limited to those
systems containing 32 or fewer processors, due to non-linear (n x m) complexity and
cost. They are applied mostly in multiprocessor vector computers and in multiprocessor
systems with multilevel interconnections.
M2M1 Mm
P1
P2
Pn
Figure 2.7: Crossbar Network
3. Multistage interconnection networks:
Multistage networks are designed with several small dimension crossbar
networks. Input/Output connection establishment is done in two or more stages. Figures
13
2.8 and 2.9 shown below are Benes multistage and Clos multistage networks. These are
non-blocking networks and are suitable for very large systems as their complexity is
much less than that of Crossbar networks. The main disadvantage of these networks is
latency. The latency increases with the size of the network.
II
BN/d,d
I
Xd, d
II
BN/d,d
II
BN/d,d
I
Xd, d
III
Xd, d
III
Xd, d
III
Xd, d
I
Xd, d
P0
P1
Pd-1
Pd
P2d-1
Pn-d
Pn-1
Pd
Pd+1
Pn-d+1
M0
M1
M0
Md-1
Md
Md+1
M2d-1
Mn-d
Mn-d+1
Mn-1
Figure 2.8: Benes Network (BN,d)
14
I(1)
I(2)
I(C1)
III(1)
III(2)
III(C2)
II(1)
II(2)
II(K)
P0
P1
Pd-1
Pd
P2d-1
Pn-d
Pn-1
PdPd+1
Pn-d+1
M0
M1
Md-1
Md
Md+1
M2d-1
Mn-d
Mn-d+1
Mn-1
Figure 2.9: Clos Network
15
Some of the multistage networks are compared with crossbar networks, primarily
from a stand-point of complexity, in the next chapter. Table 2.10 shows some general
properties of bus, crossbar and multistage interconnection networks.
Property Bus Crossbar Multistage
Speed Low High High
Cost Low High Moderate
Reliability Low High High
Complexity Low High Moderate
Table 2.1: Properties of Various Interconnection Network Topologies
16
Chapter 3
Multistage Interconnection Network Complexity
For the HDCA system, the desired interconnect should be able to establish non-
blocking, high speed connections from the requesting Computing Elements (CEs) to the
memory blocks. The interconnect should be able to sense if there are any conflicts such
as two or more processors requesting connection to the same memory block and give
only a highest priority processor the ability to connect. The multistage Benes network and
Clos network, their complexity comparison with a crossbar network, and advantages and
disadvantages of these candidate networks is discussed in this chapter.
Crossbar Topology:
A crossbar network is a highly non-blocking, very reliable, very high-speed
network. Figure 2.7 is a typical single stage crossbar network with N inputs/processors
and M outputs/memory blocks. It is denoted by XN,M . The complexity (Crosspoint
count) of a Crossbar network is given by N x M. Complexity increases with an increase
in number of inputs or number of outputs. This is the main disadvantage of the Crossbar
network. Hence there is less scope for scalability of crossbar networks. The crossbar
topology implemented for shared memory architectures is referred to as a double-sided
crossbar network.
Benes Network:
A Benes network is a multistage, non-blocking network.. For any value of N, d
should be chosen so that LogdN is an integer. The number of stages for a N x N Benes
network is given by (2logdN-1) and it has (N/d) crossbar switches in each stage. Hence
BN,d is implemented with [[(N/d).(2logdN-1)] crossbar switches. The general
architecture of a Benes network (BN,d) is shown in Figure 2.8.
In the figure,
N: Number of Inputs or Outputs,
17
d: Dimension of each crossbar switch ( Xd,d ) ,
I: First stage switch = Xd,d ,
II: Middle stage switch = BN/d,d
III: Last stage switch = Xd,d ,
The complexity (crosspoint count) of the network is given by [(N/d).(2logdN-1) .
d2 ]. Network latency is a factor of (2logdN-1), because of (2logdN-1) stages between
input stage and output stage. There are different possible routings from any input to
output. It is a limited scalability architecture. For a BN,d implementation, N has to be a
power of d. For all other configurations, a higher order Benes network can be used, but at
the cost of some hardware wastage. The main disadvantage of this network is the network
latency and limited scalability. For very large networks, a Benes network implementation
is very cost effective.
Clos Network:
Figure 2.9 shows a typical N x M Clos network represented by CN,M. The blocks
I, III are always crossbar switches and II is a crossbar switch for a 3 stage Clos network.
In implementations of higher order Clos networks, II is a lower order Clos network.
For example, for a 5 stage Clos implementation, II is a three-stage Clos network.
N: Number of processors
M: Number of Memory blocks
K: Number of Second stage switches
C1: Number of First stage switches
C2: Number of Third stage switches
For a three stage Clos network, I = X N/C1,K
, II = X C1,C2 , III = X K,M/C2 and
the condition for non-blocking Clos implementation is K = N/C1 + M/C2 - 1.
A three stage Clos implementation for N = 16, M = 32, C1 = 4, C2 = 8 has K = 16/4 +
32/8 - 1 = 7. Each 1st stage switch becomes a 4 x 7 crossbar switch and the 2nd stage
18
switch becomes a 4 x 8 Crossbar switch and each third stage switch becomes a Crossbar
switch of size 7 x 4. (I = X4,7 II = X4,8 III = X7,4 ). The complexity of a Clos network
is given by C clos = [K(N +M) + K(C1.C2 ) ]. Using the non-blocking condition, K =
N/C1 + M/C2 - 1. For N = M & C1 = C2, K = 2N/C1 - 1 and hence
C clos = (2N/C1 - 1) {2N + C12 }.
For an optimum crosspoint count for non-blocking Clos networks, N/C1 = (N/2)1/2
= > C12 = 2N
=> C clos = ((2N)1/2 - 1). 4N. (Approximately)
The main advantage of a Clos network implementation is its scalability. A Clos
network can be implemented for any non-prime value of N. The disadvantages of this
implementation are network latency and implementation for small systems. The network
latency is a factor of the number of intermediate stages between the input stage and the
output stage.
From the complexity comparison shown in Table 3.1 and charts shown in Figures
3.2 and 3.3, it can be analyzed that the crossbar topology for small systems and the Benes
network for large systems, match the requirements of the interconnect network for a
HDCA system. The number of processors on the input side and number of memory
blocks on the output side are assumed to be ‘N’ for simplicity in comparison of the
topologies. This assumption holds for any rectangular size implementations of these
topologies, which is not possible in the Benes network. The complexity comparison table
for the three topologies studied so far is given in Table 3.1. In the table “I” is the
complexity, and “II” is the corresponding network implementation for the values of N for
the respective topologies. Chart 1, shown in Figure 3.2, is the graph of complexity of the
three topologies versus N, the number of processors or memory blocks, for lower values
of N (N <= 16). Chart 2, shown in Figure 3.3, is the graph of complexity of the three
topologies versus N, the number of processors or memory blocks, for higher values of N
(N >= 16).
19
Table 3.1: Complexity Comparison Table
N Crossbar Benes Clos
I II I II I II
2 4 X(2,2) 4 B(2,2) 4 C(2,2)
3 9 X(3,3) 9 B(3,3) 9 C(3,3)
4 16 X(4,4) 24 B(4,2) 36 C(4,2)
5 25 X(5,5) 25 B(5,5) 25 C(5,5)
6 36 X(6,6) 80 B(8,2) 63 C(6,3)
7 49 X(7,7) 80 B(8,2) 96 C(8,4)
8 64 X(8,8) 80 B(8,2) 96 C(8,4)
9 81 X(9,9) 81 B(9,3) 135 C(9,3)
10 100 X(10,10) 224 B(16,2) 135 C(10,5)
11 121 X(11,11) 224 B(16,2) 180 C(12,6)
12 144 X(12,12) 224 B(16,2) 180 C(12,6)
13 169 X(13,13) 224 B(16,2) 189 C(14,7)
14 196 X(14,14) 224 B(16,2) 189 C(14,7)
15 225 X(15,15) 224 B(16,2) 275 C(15,5)
16 256 X(16,16) 224 B(16,2) 278 C(16,8)
32 1024 X(32,32) 576 B(32,2) 896 C(32,8)
64 4096 X(64,64) 1408 B(64,2) 2668 C(64,16)
81 6561 X(81,81) 1701 B(81,3) 4131 C(81,9)
128 16384 X(128,128) 3328 B(128,2) 7680 C(128,16)
20
2 3 4 5 6 7 8 9 10 11 12 13 14 15 160
255075
100125150175200225250275300
Chart 1
CrossbarBenesClos
N
Com
plex
ity
Figure 3.2: Complexity Chart for N <= 16
16 32 64 81 1280
2500
5000
7500
10000
12500
15000
17500
Chart 2
CrossbarBenesClos
N
Com
plex
ity
Figure 3.3: Complexity Chart for N >= 16
21
The following equations are used to calculate the complexity of all three topologies for
the different configurations given in Table 3.1.
C clos = (2N/C1 - 1) {2N + C12 },
For N/C1 = (N/2)1/2, taken to the closest integer value.
C benes = [(N/d).(2logdN-1). d2 ],
C crossbar = N2
From Table 3.1 and the Charts in Figures 3.2 and 3.3, the crossbar topology has
the lowest complexity for the values of N < = 16. Hence the crossbar network is the best
interconnect implementation for systems that have not more than 16 processors/memory
blocks since the hardware required for the implementation is less than all other possible
implementations, it is faster than any other network, and it is a non-blocking network as
every input has connection capability to every output in the system. And for the systems
having more than 16 x 16 configurations, and less than 64 x 64 configurations, the
designer has to tradeoff between speed and complexity. Because, for the multistage
networks such as the Benes network, complexity is less than that of the crossbar network
but at the cost of speed, as speed of multistage networks is much lower than that of the
crossbar network. For systems having more than 64 x 64 configurations, the Benes
network proves to be the best implementation.
The HDCA system normally requires an interconnect with a complexity less than
256. The crossbar implemented interconnect best suits the system as it has minimum
complexity for the sizes of interconnect needed by the HDCA system, it is highly non-
blocking as no processor has to share any bus with any other processor and it is a very
high speed implementation as it has only one intermediate stage between processors and
memory blocks.
22
Multiprocessor Shared Memory Organization:
INTERCONNECT
MB[0]
MB[1]
MB[M-1]
P[0]
P[1]
P[N-1]
PI[0]
PI[1]
PI[N-1]
IM[0]
IM[1]
IM[M-1]
Figure 3.4a: Multi-processor, Interconnect and Shared Memory Organization
Figure 3.4a shows the organization of a multiprocessor shared memory
organization. Figure 3.4b shows the organization of the shared memory used in the
HDCA system. By making 2c = M, the shared memory architecture in Figure 3.4b can be
used as the shared memory for the HDCA system. Figure 3.4c shows the organization of
each memory block. In each memory block there are 2b addressable locations each of ‘a’
bits width.
23
MB [0]
MB [1]
MB [2]
MB [2c -1]
Figure 3.4b: Shared Memory Organization
24
Figure 3.4c: Organization of each Memory Block
Related Research in Crossbar Networks:
The crossbar switches (networks) of today see a wide use in a variety of
applications including network switching, parallel computing, and various
telecommunications applications. By using the Field Programmable Gate Arrays
(FPGAs) or the Complex Programmable Logic Devices (CPLDs) to implement the
crossbar switches, design engineers have the flexibility to customize the switch to suit
their specific design goals, as well as obtain switch configurations not available with off-
the-shelf parts. In addition, the use of in-system programmable devices allows the switch
0
1
2b - 1
01a - 1
25
to be reconfigured if design modifications become necessary. There are two types of
implementations possible based on the crossbar topology. One is the single-sided
crossbar implementation as shown in Figure 3.5 and the other is the double-sided
crossbar implementation as shown in Figure 2.7.
Figure 3.5: Single-sided Crossbar network
The single-sided crossbar network is usually implemented and utilized in
distributed memory systems, where all the nodes (or processors) connected to the
interconnect need to communicate to each other. Whereas in the double-sided crossbar
networks which are usually utilized as the interconnect between processors and memory
blocks in a multiprocessor shared memory architecture as shown in Figures 3.4a, 3.4b,
and 3.4c, processors need to communicate with memory blocks but not processors with
processors and memory blocks with memory blocks.
A crossbar network is implemented with serial buses or parallel buses. In a serial-
bus, crossbar network implementation, addresses and data are sent by the processor
through a single-bit bus in a serial fashion, which are fetched by the memory blocks at
the rate of 1-bit on every active edge of the system clock. Some of the conventional
crossbar switches use this protocol for the crossbar interconnect network. The other
implementation is the parallel-bus crossbar network implementation. This
implementation is much faster than the serial-bus implementation. All the memory blocks
0
1
2
N- 2
N- 1
26
fetch addresses on one clock and data on the following clocks. This implementation
consumes more hardware than the serial-bus crossbar network implementation but is a
much faster network implementation and is hence used in some high performance
multiprocessor systems.
The main issue in implementing a crossbar network is arbitration of processor
requests for memory accesses. The processor request arbitration comes into picture when
two or more processors request for memory access within the same memory block. There
are different protocols that may be followed in designing the interconnect. One of them is
a round robin protocol. In the case of conflict among processor requests for memory
accesses within the same memory block, requests are granted to the processors in a round
robin fashion. A fixed priority protocol assigns fixed priorities to the processors. In case
of conflict, the processor having the highest priority ends up having its request granted all
the time. For a variable priority protocol, as will be used in an HDCA system, priorities
assigned to processors dynamically vary over time based on some parameter (metric) of
the system. In the HDCA system of Figure 1.1, all processors are dynamically assigned a
priority depending upon their input Queue (Q) depth at any time. In the case of conflict,
the processor having the highest (deepest) queue depth at that point of time gets the
request grant. A design engineer has to choose among the above mentioned protocols and
various kinds of implementations depending on the system requirements in designing a
crossbar network. The HDCA system will need some kind of arbitration in case of
processor conflicts.
The interconnect design presented in this project is closely related to the design of
a crossbar interconnect for the distributed memory systems presented in [3]. Both the
designs address the possibility of conflicts between the processor requests for the
memory accesses within the same memory block. Both the designs use parallel address
and data buses between every processor and every memory block. The design presented
in this project is different from the interconnect network of [3] in two ways. Firstly, the
crossbar interconnect network presented in this project is suitable for the shared memory
architecture. The crossbar topology used in this project is a double-sided topology
whereas a single-sided topology is used in the design of the crossbar interconnect
network of [3]. The priority arbitration scheme proposed in the interconnect network of
27
[3] uses a fixed priority scheme based on the physical distance between the processors
and gives the closest processor the highest priority and farthest processor the lowest
priority. The priority arbitration scheme presented in this design uses the input queue
depth of the processors in determining the priorities. The HDCA system requires a
double-sided, parallel bus crossbar network, with variable priority depending upon input
queue depth of the processors. In this project, a double-sided, parallel-bus crossbar
network using variable priority protocol is designed, implemented and tested as the
interconnect for the HDCA system. The detailed design of the implementation is
described in the next chapter.
28
Chapter 4
Design of the Crossbar Interconnect Network
This chapter presents the detailed design of a crossbar interconnect which meets
the requirements of the HDCA system of Figure1.1. The organization of processors,
interconnect and memory blocks is shown in Figure 3.4a.
Shared Memory Organization:
The shared memory organization used in this project is shown in Figures 3.4b and 3.4c.
From Figure 3.4b, there are (2c = M) memory blocks and the organization of each
memory block is shown in Figure 3.4c. In each memory block, there are 2b addressable
locations, of ‘a’ bits width. Hence the main memory which includes all the memory
blocks has 2b+c addressable locations of ‘a’ bits width. Hence the width of the address bus
of each processor is (b + c) bits wide and the data bus of each processor is ‘a’ bits wide.
Signal Description:
The schematic equivalent to the behavioral VHDL description of the interconnect
is shown in Figure 4.1. In general, a processor ‘i’ of Figure 3.4a has CTRL[i], RW[i],
ADDR[i], QD[i] as inputs to the interconnect. CTRL[i] of processor ‘i’ goes high when it
wants a connection to a memory block. RW[i] goes high when it wants to read and goes
low when it wants to write. ADDR[i] is the (b + c) bit address of the memory location
and ADDR_BL[i] is the ‘c’ bit address of the memory block with which the processor
wants to communicate. The memory block is indicated by the ‘c’ MSBs of ADDR[i].
QD[i] is the queue depth of the processor. FLAG[i], an input to the processor goes high
granting the processor’s request. FLAG[i] is decided by the priority logic of the
interconnect network. PI[i], DM[i][j] and IM[j] of Figure 4.1 are different types of buses
used in the interconnection network. The bus structure of these buses is shown in Figures
4.2 and 4.3.
At any time, processors represented by ‘i’, can request access to memory blocks
represented by MB[j]. That means CTRL[i] of those processors go high and have
memory block address, ie ADDR_BL[i] = j. Hence in Figure 4.2, the bus PI[i] of the
29
P [ 0 ] M B [ 0 ]P R L [ 0 ]
D E C [ 0 ]
P I [ 0 ]
M B _ A D D R [ 0 ]
I M [ 0 ]
P [ N - 1 ] M B [ M - 1 ]P R L [ M - 1 ]
D E C [ N - 1 ]
P I [ N - 1 ]
M B _ A D D R [ N - 1 ]
I M [ M - 1 ]
P [ 1 ] M B [ 1 ]P R L [ 1 ]
D E C [ 1 ]
P I [ 1 ]
M B _ A D D R [ 1 ]
I M [ 1 ]
D M [ 0 ] [ 0 ]
D M [ N - 1 ] [ M - 1 ]
Figure 4.1: Block Diagram of the Crossbar Interconnect Network
30
C T R L [ i ]
R W [ i ]
F L A G [ i ]
A D D R [ i ]
D A T A [ i ]
Q D E P [ i ]
B + C
A
Figure 4.2: PI[i] and DM[i][j] Bus Structures. (The PI[i] Bus and DM[i][j] Bus Have the
Same Set of Signal Lines as Shown in This Figure)
C T R L [ i ]
R W [ i ]
F L A G [ i ]
A D D R [ i ]
D A T A [ i ]
B + C
A
Figure 4.3: IM[j] Bus Structure
processors gets connected to the bus DM[i][j], through the decode logic DEC[i] of Figure
4.1 and shown again in Figure 4.4. As shown in Figure 4.4, ADDR_BL[i] of the
requesting processor is decoded by decoder DEC[i], and connects PI[i] to the DM[i][j]
output bus of DEC[i]. Every memory block has a priority logic block, PRL[j], as shown
in Figure 4.5. The function of this logic block is to grant a request to the processor having
the deepest queue depth among the processors requesting memory access to the same
memory block. As shown in Figure 4.5, once processor ‘i’ gets a grant from the priority
logic PRL[j] via the FLAG[i] signal of the DM[i][j] and PI[i] busses shown in Figures 4.1
and 4.2, the DM[i][j] bus is connected to the IM[j] bus by MUX[j] of Figure 4.5. Thus a
31
connection is established via PI[i], DM[i][j] and IM[j] between processor ‘i’ and memory
block ‘j’. This connection remains active as long as the processor holds deepest queue
depth or CTRL[i] of the processor is active. A priority logic block gives a grant only to
the highest
Figure 4.4: Decode Logic (DEC[i])
Figure 4.5: Priority Logic (PRL[j])
DEC[i]
PI[i]
DM[i][0]
DM[i][1]
DM[i][M-1]
ADDR_BL[i]
MUX[j]
PRL_LOGIC[j]
PROC SEL[j]
IM[j]
DM[0][j]
DM[1][j]
DM[N-1[j]
32
Figure 4.6: Priority Control Flow Chart for PR_LOGIC[j] in Figure 4.5.
ctrl[0] = '1' &mbaddr[0] = j
max=qd[0] i = 0max=0
ctrl[1] = '1'& mbaddr[1] = j& qd[1] >= max
max = max i = i
flag[i] = '0' max = qd[1] i = 1
ctrl[2] ='1'& mbaddr[2] =j& qd[2] >= max
flag[i] = '0' max = qd[2] i = 2
max = max i = i
ctrl[3] = '1'& mbaddr[3] = j& qd[3] >= max
flag[i] = '0' max = qd[3] i = 3
max = max i =i
flag[i] = '1'
ctrl[i] = '1'
flag[i] = '0'
T
F
T
T
T
T
F
F
F
F
33
priority processor. The queue depth of the processors is used in determining the priority.
In cases of processors having the same queue depth the processor having highest
processor identification number gets the highest priority. A processor can access a
particular memory block as long as it has highest priority to that memory block. If some
other processor gets highest priority for that particular memory block, the processor,
which is currently accessing the memory block gets its connection disconnected. It will
have to wait until it gets the highest priority for accessing that block again.
The flow chart showing the algorithmic operation of the jth priority logic block as
shown in Figure 4.5 is shown in Figure 4.6 above. To fully follow the flow chart of
Figure 4.6, we must reconcile signal names used in Figure 4.2 and Figure 4.6. The ‘c’
MSBs of ADDR[i] of Figure 4.2 correspond to the ‘mbaddr[x]’ of Figure 4.6 where x has
an integer value ranging from ‘0’ to (2C – 1). QDEP[i] of Figure 4.2 is the same as ‘qd[i]’
of Figure 4.6. The number of processors in the figure are assumed to be ‘4’ but the
algorithm holds true for any number of processors. The PR_LOGIC[j] block of the
priority logic of Figure 4.5 compares the current maximum queue depth with the queue
depth of every processor starting from the 0th processor. This comparison is done only for
those processors whose CTRL is in the logic ‘1’ state and are requesting memory access
to that memory block where the priority logic operation is performed. The integer value
‘i’ shown in Figure 4.6, is the identification number of the processor having deepest
queue depth at that time. After completion of processor prioritizing, the processor ‘i’
which has the deepest queue depth gets its request granted (FLAG[i] goes high) to access
that memory block. This logic operation is structurally equivalent to the schematic shown
in Figure 4.5, in which PROC_SEL[j] ( = ‘i’ in the flowchart shown in Figure 4.6) acts as
the select input to the multiplexer MUX[j]. This condition is achieved in VHDL code
(Appendix A and Appendix B) by giving memory access to any processor (it’s FLAG[i]
is set equal to logic ‘1’) only if its CTRL[i] is in the logic ‘1’ state and it has the deepest
queue depth among the processors requesting access within the same memory block.
The Interconnect gives all the processors the flexibility of simultaneous reads or
writes for those processors that are granted requests by the priority logic. In the best case
all processors will have their requests granted. This is the case when CTRL of all
processors is ‘1’ and no two processors have the same ADDR_BL. In this case the binary
34
value of FLAG is ‘1111’, after the completion of all iterations of the priority logic. The
VHDL description of the crossbar interconnect network implementation has a single
function which describes the processor prioritization done in all the memory blocks (ie
the corresponding priority logic blocks). The described function works for any number of
processors or memory blocks.
Figures 4.1 through 4.6 show the block level design of the interconnect network.
In the best case when all processors access different memory blocks, all the processors
receive request grants (FLAGs) in the logic ‘1’ state and all get their connections to
different memory blocks. In this case the Interconnect is used to the fullest of its capacity,
having different processors communicating with different memory blocks
simultaneously.
35
Chapter 5
VHDL Design Capture, Simulation, Synthesis, and Implementation
Flow
VHDL, the Very High Speed Integrated Circuit Hardware Description Language,
became a key tool for design capture and development of personal computers, cellular
telephones, and high-speed data communications devices during the 1990s. VHDL is a
product of the Very High Speed Integrated Circuits (VHSIC) program funded by the
department of defense in the 1970s and 80s. VHDL provides both low-level and high-
level language constructs that enable designers to describe small and large circuits and
systems. It provides portability of code between simulation and synthesis tools, as well as
device-independent design. It also facilitates converting a design from a programmable
logic to an Application Specific Integrated Circuit (ASIC) implementation. VHDL is an
industry standard for the description, simulation, modeling and synthesis of digital
circuits and systems. The main reason behind the growth in the use of VHDL can be
attributed to synthesis, the reduction of a design description to a lower-level circuit
representation. A design can be created and captured using VHDL without having to
choose a device for implementation. Hence it provides a means for a device-independent
design. Device-independent design and portability allows benchmarking a design using
different device architectures and different synthesis tools.
Electronic Design Automation (EDA) Design Tool Flow:
The EDA design tool flow is as follows:
• Design description (capture) in VHDL
• Pre-synthesis simulation for design verification/validation
• Synthesis
• Post-synthesis simulation for design verification/validation
• Implementation (Map, Place and Route)
• Post-implementation simulation for design verification/validation
• Design optimization
36
• Final implementation to FPGA/CPLD/ or ASIC technology.
The inputs to the synthesis EDA software tool are the VHDL code, synthesis directives
and the device technology selection. Synthesis directives include different kinds of
external as well as internal directives that influence the device implementation process.
The required device selection is done during this process.
Field Programmable Gate Array (FPGA):
The FPGA architecture is an array of logic cells (blocks) that communicate with
one another and with I/O via wires within routing channels. Like a semi-custom gate
array, which consists of an array of transistors, an FPGA consists of an array of logic
cells [8,10]. A FPGA chip consists of an array of logic blocks and routing channels as
shown in Figure 5.1. Each circuit or system must be mapped into the smallest square
FPGA that can accommodate it.
Figure 5.1: FPGA Architecture
Each logic block contains or consists of a number of RAM based Look Up Tables
(LUTs) used for logic function implementation and D-type flip-flops in addition to
several multiplexers used for signal routing or logic function implementation. FPGA
37
routing can be segmented and/or un-segmented. Un-segmented routing is when each
wiring segment spans only one logic block before it terminates in a switch box. By
turning on some of the programmable switches within a switch box, longer paths can be
constructed. The design for this project is implemented (prototyped) to a Spartan XL
FPGA chip which is a XILINX product [8]. It is a PROM based FPGA.
The one that is used for this project is an XCS10PC84 from the XL family. The
FPGA is implemented with a regular, flexible, programmable architecture of
Configurable Logic Blocks (CLBs), routing channels and surrounded by I/O devices. The
FPGA is provided with a clock rate of 50 Mhz. There are two more configurable clocks
on the chip. The Spartan XCS10PC84XL is an 84- pin device with 466 logic cells and is
approximately equivalent to 10,000 gates. Typically, the gate range for the XL chips will
be from 3000-12,000. The XCS10 has a 14x14 CLB matrix with 196 total CLBs. There
are 616 flip-flops in the chip and the maximum available I/Os on the chip are 112.
Digilab XL Prototype Board:
Digilab XL prototype boards [8] feature a Xilinx Spartan FPGA (either 3.3V or
5V) and all the I/O devices needed to implement a wide variety of circuits. The FPGA on
the board can be programmed directly from a PC using an included JTAG cable, or from
an on-board PROM. A view of the board is shown in the figure below. The board has one
internally generated clock and two configurable clocks.
Figure 5.2: Digilab Spartan XL Prototyping Board
38
The Digilab prototype board contains 8 LEDs and a seven-segment display which
can be used for monitoring prototype input/output signals of interest when testing the
prototype system programmed into the Spartan XL chip on the board.
VHDL Design Capture:
Behavioral VHDL description was used in design capture and coding of the
crossbar interconnect network design and logic. The structural equivalents of two
behavioral VHDL descriptions is shown in Figures 6.1 and 6.2 of the next chapter. The
scenario of a processor trying to access a particular memory block and whether its request
is granted (FLAG = ‘1’) or rejected (FLAG = ‘0’) can be generalized to all processors
and all memory blocks. Hence, the main VHDL code has a function ‘flg’, an entity
‘main’ and a process ‘P1’ and it is possible to increase the number of processors, memory
blocks (and address bus) , memory locations in each memory block (and address bus),
width of data bus, and the width of the processor queue depth bus.
Appendix A contains the VHDL code, structured as shown in Figure 6.1, which
describes the crossbar interconnect network assuming the number of processors, the
number of memory blocks, and the number of addressable locations in each memory
block to be ‘4’. The input queue depth of each processor is 4-bits wide. This VHDL code
is described considering the crossbar interconnect network and the shared memory as a
single functional unit as shown in Figure 6.1. This code is tested for correct design
capture and interconnect network functionality via the pre-synthesis, post-synthesis and
post-implementation VHDL simulation stages and is downloaded onto the XILINX based
Spartan XL FPGA [7,8] for prototype testing and evaluation. Appendix B contains the
VHDL code describing only the crossbar interconnection network as a single functional
unit as depicted in Figure 6.2. This code has more I/O pins than the previous code. This
code was tested via pre-synthesis and post-synthesis VHDL simulation. With the
exceptions of the I/O pins and shared memory, the functionality of both VHDL
interconnect network descriptions is the same and is identical to the description of the
crossbar interconnect network organization and architecture design described in Chapter
4.
39
Chapter 6
Design Validation via Post-Implementation Simulation Testing
There are two VHDL code descriptions of the interconnect network, one in
Appendix A and the other in Appendix B. Both have the same interconnect network
functionality but different modular structures. The VHDL code in Appendix A is
described considering both the crossbar interconnect network and the shared memory as a
single block as shown in Figure 6.1. The VHDL code described in Appendix A has only
processors to interface to the interconnect.
Figure 6.1: Block Diagram of VHDL Code Described in Appendix A.
MB0 MODULE main MB1 MB2 MB3
data_in
addr bus
qdep
ctrl
rw
clk
rst
flag
data_out
16
16
16
4
4
4
16
40
Figure 6.2: Block Diagram of VHDL Code Described in Appendix B.
main_ic
addr_prc
16
data_in_prc
16
qdep
16
4 ctrl
4 rw
clk
rst
data_out_prc
16
flag
4
8
addr_mem
16
data_in_mem 16
data_out_mem
4
rw_mem
41
The VHDL code in Appendix B describes the crossbar interconnect network as a
single block as shown in Figure 6.2. The VHDL code in Appendix B has two interfaces,
the processors to interconnect network interface and interconnect to memory blocks
interface. In both cases (Appendix A and Appendix B) the VHDL code describes a
crossbar interconnect network interfaced to four processors and four memory blocks and
each memory block has 4 addressable locations. The data bus of each processor is taken
to be 4-bits wide. Entity 'main' corresponds to the interconnect module described in
Appendix A and entity 'main_ic' corresponds to the interconnect described in Appendix
B. The VHDL descriptions of the interconnect network in Appendices A and B are
written in a generic parameterized manner such that additional processors and memory
modules may be interfaced to the interconnect and also, the size of the memory modules
may be increased. The size of the prototyped interconnect was kept small so that it would
fit into the Xilinx Spartan XL chip on the prototype board shown in Figure 5.2. No
functionality of the interconnect network was compromised by keeping the prototype to a
small size.
The VHDL description in Appendix A has a 16-bit ‘data_in’ input port to the
interconnect in which 4-bit data buses of all the 4 processors are included as shown in
Figure 6.3. The same is the case of ‘addr_bus’, ‘qdep’, ‘ctrl’, ‘rw’, ‘data_out’, ‘flag’.
Various scenarios such as:
1. all the processors writing to different memory locations in different memory ,
2. all the processors reading from different memory locations in different memory,
3. two or more processors requesting access to different memory locations within the
same memory block, and only the highest priority processor getting the grant to access
the memory block,
4. two or more processors requesting access to the same memory location, and only the
highest priority processor getting the grant to access the memory block,
5. two or more processors requesting access to the same memory block, and some of the
processors having the same queue depth, with only the highest priority processor getting
the grant to access the memory block,
42
Figure 6.3: Input Data Bus Format
6. one or more processors in idle state,
are tested in the module 'main' during the pre-synthesis, post-synthesis and post-
implementation simulations. This module is also downloaded onto a Xilinx based Spartan
XL FPGA chip and is tested under the various scenarios described above. Figures 6.4,
6.5, and 6.6 show the post-implementation simulation tracers, which show the behaviour
of the interconnect network and shared memory under different scenarios, described in
VHDL code in Appendix A.
The same coding style is followed in the VHDL code for ‘main_ic’ described in
Appendix B. The above mentioned scenarios of processor requests are tested on the
module 'main_ic' also during the pre-synthesis, post-synthesis simulations. Figures 6.7
and 6.8 show the post-synthesis simulation tracers, which shows the behaviour of the
interconnect network under different scenarios.
D A T A 0
D A T A 1
D A T A 2
D A T A 3
4
4
4
4
16
DATA_IN[15:0]
0
15
43
The simulation tracers in Figure 6.4 – 6.6 show the behaviour of the interconnect
module output ‘data_out’ and shared memory in different scenarios, which are explained
below. A testcase ‘top’ is developed to generate input stimulus to the module ‘main’ and
to display the control signals, address, data of all processors on LEDs of the Spartan XL
FPGA chip. ‘scnr’, ‘pid’, ‘addr’ and ‘data’ signals observed on the simulation tracer are
used in developing the testcase ‘top’, which is described in detail in Chapter 7. In this
chapter, the input stimulus and output (data_out and data in memory locations) observed
on the simulation tracers is discussed. (All the data mentioned in different scenarios and
that are shown on the simulation tracers are represented in hexadecimal system)
Scenario 0:
Input stimulus:
data_in <= x"4321" ;
addr_bus <= x"FB73" ;
qdep <= x"1234" ;
ctrl <= x"F" ;
rw <= x"F" ;
In this case, processor '0' is requesting memory access within memory block '0'
(location '3'), processor '1' within memory block '1' (location '3') , processor '2' within
memory block '2' (location '3') and processor '3' within memory block '3' (location '3').
Hence there is no conflict between the processors for memory accesses. Processors '0', '1',
'2' and '3' (from 16-bit 'data_in' bus ) are writing '1', '2', '3' and '4' to the corresponding
memory locations. As all the processors get the memory access and hence the data '1', '2',
'3' and '4' is written to the memory location '3' in each of the four memory blocks. This
data can be observed in the corresponding memory locations, on the simulation tracer 1
in Figure 6.4, for ‘scnr’ = ‘0’.
Scenario 1:
Input stimulus:
data_in <= x"26FE" ;
44
addr_bus <= x"37BF" ;
qdep <= x"1234" ;
ctrl <= x"F" ;
rw <= x"0" ;
In this case, processor '0' is requesting memory access within memory block '3'
(location '3'), processor '1' within memory block '2' (location '3') , processor '2' within
memory block '1' (location '3') and processor '3' within memory block '0' (location '3').
Hence there is no conflict between the processors for memory accesses. Processors '0', '1',
'2' and '3' are reading '4', '3', '2' and '1' from the corresponding memory locations. As all
the processors get the memory access and hence the data '4', '3', '2' and '1' is read from to
the memory locations 'F', 'B', '7' and '3'. This data can be observed on the ‘data_out’ bus,
on the simulation tracer 1 in Figure 6.4, for ‘scnr’ = ‘1’.
Scenarios '0' and '1' test the case of data exchange between processors. In the
scenario '0', processor '0' writes '1' to memory location '3' in memory block '0' and
processor '3' writes '4' to memory location '3' in memory block '3'. In scenario '1'
processor '0' reads '4' (Data written by processor '3' in scenario '0') from memory location
'3' in memory block '3' and processor '3' reads '1' (Data written by processor '0' in
scenario '0') from memory location '3' in memory block '0'. Similarly data exchange
between processors '1' and '2' is also tested.
Scenario 2:
Input stimulus:
data_in <= x"AAAA" ;
addr_bus <= x"CD56" ;
qdep <= x"EFFF" ;
ctrl <= x"F" ;
rw <= x"5" ;
45
In this case, processor ‘0’ (write) and processor ‘1’ (read) are requesting for
memory access within the same memory block '1', but to different memory locations '2'
and '1'. Processor ‘1’ gets the access as its processor ID is greater than that of processor
‘0’ and queue depths of the processors is same. So processor ‘1’ reads ‘0’ from the
memory location '1' in the memory block '1'. This data is observed on ‘data_out’ bus on
the simulation tracer ‘1’ shown in Figure 6.4, for ‘scnr’ = ‘2’. Similarly, processor ‘2’
(for memory write) and processor ‘3’ (for memory read) are requesting for memory
access within same memory block '3', but to different memory locations '0' and '1'.
Processor ‘2’ gets the access as its queue depth is greater than that of processor ‘3’. So
processor ‘2’ writes 'A' to the memory location '1' in the memory block '3'. This data is
observed in the memory location ‘D’ on the simulation tracer ‘1’ shown in Figure 6.4, for
‘scnr’ = ‘2’.
Scenario 3:
Input stimulus:
data_in <= x"5555" ;
addr_bus <= x"DC65" ;
qdep <= x"4434" ;
ctrl <= x"F" ;
rw <= x"5" ;
In this case, processor ‘0’ (write) and processor ‘1’ (read) are requesting for
memory access within same memory block '1', but to different memory locations '1' and
'2'. Processor ‘0’ gets the access as its queue depth is greater than that of processor ‘1’. So
processor ‘0’ writes ‘5’ to the memory location '1' in the memory block '1'. This data is
observed in the memory location ‘5’, on the simulation tracer ‘2’ shown in Figure 6.5, for
‘scnr’ = ‘3’. Similarly, processor ‘2’ (for memory write) and processor ‘3’ (for memory
read) are requesting for memory access within same memory block '3', but to different
memory locations '0' and '1'. Processor ‘3’ gets the access as its processor ID is greater
than that of processor ‘2’ and queue depths of the processors is same. So processor ‘3’
reads 'A' (which was written by processor ‘1’ in the previous scenario) from the memory
46
location '1' in the memory block '3'. This is observed on ‘data_out’ bus , on the simulation
tracer ‘2’ shown in Figure 6.5, for ‘scnr’ = ‘3’.
Scenario 4:
Input stimulus:
data_in <= x"9999" ;
addr_bus <= x"3399" ;
qdep <= x"4EE4" ;
ctrl <= x"F" ;
rw <= x"C" ;
In this case, processor ‘0’ and processor ‘1’ are requesting for memory access
(read) with same memory location ‘1’ within same memory block ‘2’. Processor ‘1’ gets
the priority as its queue depth is greater than that of processor ‘0’. Hence processor ‘1’
reads ‘0’ (which is reset value) from memory location ‘9’ and is observed on ‘data_out’
bus on the simulation tracer ‘2’, shown in Figure 6.5, for ‘scnr’ = ‘4’. Processor ‘2’ and
processor ‘3’ are requesting for memory access (write) with same memory location
within same memory block. Processor ‘2’ gets the priority as its queue depth is greater
than that of processor ‘3’. Hence processor ‘2’ writes ‘9’ to the memory location ‘3’ in
memory block ‘0’. This data is observed in the memory location ‘3’, on the simulation
tracer ‘2’ shown in Figure 6.5, for ‘scnr’ = ‘4’.
Scenario 5:
Input stimulus:
data_in <= x"EEEE" ;
addr_bus <= x"7654" ;
qdep <= x"5556" ;
ctrl <= x"F" ;
rw <= x"1" ;
47
In this case, processor ‘0’ (write), processor ‘1’ (read), processor ‘2’ (read) and
processor ‘3’ (read) are requesting for memory access within same memory block ‘1’, but
to different memory locations. Processor ‘0’ gets the priority as it has greatest queue
depth of all the processors. Hence processor ‘0’ writes 'E' to the memory location '0' in
the memory block '1'. This data is observed in the memory location ‘4’, on the simulation
tracer ‘2’ shown in Figure 6.5, for ‘scnr’ = ‘5’.
Scenario 6:
Input stimulus:
data_in <= x"CCCC" ;
addr_bus <= x"6666" ;
qdep <= x"5555" ;
ctrl <= x"E" ;
rw <= x"0" ;
In this case, processor ‘1’ (read), processor ‘2’ (read) and processor ‘3’ (read) are
requesting for memory access with the same memory location ‘2’ in the memory block
‘1’. Processor ‘3’ gets the priority as its has greatest processor ID of all the processors.
Hence processor ‘3’ reads '0' (reset value) from the memory location '2' in the memory
block '1', and is observed on ‘data_out’ bus on the simulation tracer ‘3’, shown in Figure
6.6, for ‘scnr’ = ‘6’.
Scenario 7:
Input stimulus:
data_in <= x"FFFF" ;
addr_bus <= x"EEAE" ;
qdep <= x"0011" ;
ctrl <= x"0" ;
rw <= x"0" ;
48
In this case, all processors are in idle state. No transactions are performed through
the interconnect. Hence no changes are observed in any of the memory locations or on
‘data_bus’ and are observed on the simulation tracer ‘3’, shown in Figure 6.6, for ‘scnr’ =
‘7’.
The simulation tracers ‘1’, ‘2’ and ‘3’ are shown in Figures 6.4, 6.5 and 6.6 in the
next three pages.
49
Figure 6.4: Simulation tracer 1
50
Figure 6.5: Simulation tracer 2
51
Figure 6.6: Simulation tracer 3
52
The simulation tracers in Figure 6.7 and 6.8 show the behaviour of the
interconnect module ‘main_ic’ in different scenarios, which are explained below. In this
chapter, the input stimulus and the output of the module ‘main_ic’ under each scenario,
observed on the simulation tracers is discussed. (All the data mentioned in different
scenarios and that are shown on the simulation tracers are represented in hexadecimal
system)
Scenario 0:
Input stimulus:
data_in_prc <= x"4321" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"FB73" ;
qdep <= x"1234" ;
ctrl <= x"F" ;
rw <= x"F" ;
In this case, processor '0' is requesting memory access within memory block '0'
(location '3'), processor '1' within memory block '1' (location '3') , processor '2' within
memory block '2' (location '3') and processor '3' within memory block '3' (location '3').
Hence there is no conflict between the processors for memory accesses and hence 'FLAG'
of all the processors is '1'. Processors '0', '1', '2' and '3' are writing '1', '2', '3' and '4'. As all
the processors get the memory access and hence the data '1', '2', '3' and '4' (from
‘data_in_prc’ bus) are written to the hexbit (4-bit binary value represented in
hexadecimal system) of ‘data_out_mem’ bus corresponding to each memory block,
‘addr_mem’ for each memory blocks gets ‘3’ indicating the location address with in that
memory block, ‘rw_mem’ for all memory blocks becomes ‘1’ indicating that it is a
memory write operation. This data can be observed on the simulation tracer 4 in Figure
6.7, for ‘addr_prc’ = ‘4321’ and ‘rst’ = ‘0’.
Scenario 1:
Input stimulus:
53
data_in_prc <= x"26FE" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"37BF" ;
qdep <= x"1234" ;
ctrl <= x"F" ;
rw <= x"0" ;
In this case, processor '0' is requesting memory access within memory block '3'
(location '3'), processor '1' within memory block '2' (location '3') , processor '2' within
memory block '1' (location '3') and processor '3' within memory block '0' (location '3').
Hence there is no conflict between the processors for memory accesses. Processors '0', '1',
'2' and '3' are reading 'F', 'E', 'D' and 'C' from the ‘data_in_mem’ bus. As all the
processors get the memory access and hence the data 'F', 'E', 'D' and 'C' is from the
‘data_in_mem’ bus. This data can be observed on the ‘data_out_prc’ bus, on the
simulation tracer 4 in Figure 6.7, for ‘addr_prc’ = ‘37BF’. And ‘addr_mem’ for each
memory blocks gets ‘3’ indicating the location address with in that memory block,
‘rw_mem’ for all memory blocks becomes ‘0’ indicating that it is a memory read
operation, are also observed.
Scenario 2:
Input stimulus:
data_in_prc <= x"AAAA" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"CD56" ;
qdep <= x"EFFF" ;
ctrl <= x"F" ;
rw <= x"5" ;
In this case, processor ‘0’ (write) and processor ‘1’ (read) are requesting for
memory access within same memory block '1', but to different memory locations '2' and
'1'. Processor ‘1’ gets the access as its processor ID is greater than that of processor ‘0’
54
and queue depths of the processors is same. So processor ‘1’ reads ‘D’ from
‘data_in_mem’ bus. This data is observed on ‘data_out_prc’ bus on the simulation tracer
‘4’ shown in Figure 6.7, for ‘addr_prc’ = `CD56’. Similarly, processor ‘2’ (write) and
processor ‘3’ (read) are requesting for memory access within same memory block '3', but
to different memory locations '0' and '1'. Processor ‘2’ gets the access as its queue depth
is greater than that of processor ‘3’. So processor ‘2’ writes 'A' to ‘data_out_mem’ bus to
the hexbit corresponding to memory location ‘3’ and the data is observed on
‘data_out_mem’ bus on the simulation tracer ‘4 shown in Figure 6.7, for ‘addr_prc’ =
`CD56’. ‘flag’ is ‘1’ for only two processors (processors ‘1’ and ‘2’), ‘addr_mem’
corresponding to memory blocks ‘1’ and ‘3’ becomes ‘1’ indicating the location address
with in those memory blocks and ‘rw_mem’ for those locations become ‘0’ and ‘1’
indicating ‘read’ and ‘write’ operation respectively.
Scenario 3:
Input stimulus:
data_in_prc <= x"5555" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"DC65" ;
qdep <= x"4434" ;
ctrl <= x"F" ;
rw <= x"5" ;
In this case, processor ‘0’ (write) and processor ‘1’ (read) are requesting for
memory access within same memory block '1', but to different memory locations '1' and
'2'. Processor ‘0’ gets the access as its queue depth is greater than that of processor ‘1’. So
processor ‘0’ writes ‘5’ to the hexbit on the ‘data_out_mem’ bus corresponding to the
memory block '1'. This data is observed on the hexbit on the ‘data_out_mem’ bus
corresponding to the memory block, on the simulation tracer ‘4’ shown in Figure 6.7, for
‘addr_prc’ = ‘DC65’. Similarly, processor ‘2’ ( for memory write) and processor ‘3’ (for
memory read) are requesting for memory access within same memory block '3', but to
different memory locations '0' and '1'. Processor ‘3’ gets the access as its processor ID is
55
greater than that of processor ‘2’ and queue depths of the processors is same. So
processor ‘3’ reads 'F' from hexbit on the ‘data_in_mem’ bus corresponding to memory
location ‘3’. This is observed on hexbit on the ‘data_out_prc’ bus corresponding to
processor ‘3’, on the simulation tracer ‘4’ shown in Figure 6.7, for ‘addr_prc’ = ‘DC65’.
‘flag’ is ‘1’ for only two processors (processors ‘0’ and ‘3’), ‘addr_mem’ corresponding
to memory blocks ‘1’ and ‘3’ becomes ‘1’ indicating the location address with in those
memory blocks and ‘rw_mem’ for those locations become ‘1’ and ‘0’ indicating ‘write’
and ‘read’ operation respectively.
Scenario 4:
Input stimulus:
data_in_prc <= x"9999" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"3399" ;
qdep <= x"4EE4" ;
ctrl <= x"F" ;
rw <= x"C" ;
In this case, processor ‘0’ and processor ‘1’ are requesting for memory access
(read) with same memory location ‘1’ within same memory block ‘2’. Processor ‘1’ gets
the priority as its queue depth is greater than that of processor ‘0’. Hence processor ‘1’
reads ‘E’ from the hexbit on the ‘data_in_mem’ bus corresponding to memory block ‘2’
and is observed on ‘data_out_prc’ bus on the simulation tracer ‘5’, shown in Figure 6.8,
for ‘addr_prc’ = ‘3399’. Processor ‘2’ and processor ‘3’ are requesting for memory
access (write) with same memory location within same memory block. Processor ‘2’ gets
the priority as its queue depth is greater than that of processor ‘3’. Hence processor ‘2’
writes ‘9’ to the hexbit on the ‘data_out_mem’ bus corresponding to the memory block
‘0’. This data is on ‘data_out_mem’ bus on the simulation tracer ‘5’, shown in Figure 6.8,
for ‘addr_prc’ = ‘3399’. . ‘flag’ is ‘1’ for only two processors (processors ‘1’ and ‘1’),
‘addr_mem’ corresponding to memory blocks ‘0’ and ‘2’ becomes ‘3’ indicating the
56
location address with in those memory blocks and ‘rw_mem’ for those locations become
‘1’ and ‘0’ indicating ‘write’ and ‘read’ operation respectively.
Scenario 5:
Input stimulus:
data_in_prc <= x"EEEE" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"7654" ;
qdep <= x"5556" ;
ctrl <= x"F" ;
rw <= x"1" ;
In this case, processor ‘0’ (write), processor ‘1’ (read), processor ‘2’ (read) and
processor ‘3’ (read) are requesting for memory access within same memory block ‘1’, but
to different memory locations. Processor ‘0’ gets the priority as its has greatest queue
depth of all the processors. Hence processor ‘0’ writes 'E' to the hexbit on the
‘data_out_mem’ bus corresponding to the memory block '1'. This data is on the
simulation tracer ‘5’ shown in Figure 6.8, for ‘addr_prc’ = ‘7654’. ‘flag’ is ‘1’ for only
processor ‘0’, ‘addr_mem’ corresponding to memory block ‘1’ becomes ‘0’ indicating
the location address within the memory block and ‘rw_mem’ for those memory blocks
become ‘1’ indicating ‘write’ operation.
Scenario 6:
Input stimulus:
data_in_prc <= x"CCCC" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"6666" ;
qdep <= x"5555" ;
ctrl <= x"E" ;
rw <= x"0" ;
57
In this case, processor ‘1’ (read), processor ‘2’ (read) and processor ‘3’ (read) are
requesting for memory access with the same memory location ‘2’ in the memory block
‘1’. Processor ‘3’ gets the priority as its has greatest processor ID of all the processors.
Hence processor ‘3’ reads 'D' from the hexbit on the ‘data_in_mem’ bus corresponding to
the memory block '1', and is observed on ‘data_out_prc’ bus on the simulation tracer ‘5’,
shown in Figure 6.8, for ‘addr_prc’ = ‘6666’. ‘flag’ is ‘1’ for only processor ‘3’,
‘addr_mem’ corresponding to memory block ‘1’ becomes ‘0’ indicating the location
address within the memory block and ‘rw_mem’ for those memory blocks become ‘0’
indicating ‘read’ operation.
Scenario 7:
Input stimulus:
data_in_prc <= x"FFFF" ;
data_in_mem <= x"FEDC" ;
addr_prc <= x"EEAE" ;
qdep <= x"0011" ;
ctrl <= x"0" ;
rw <= x"0" ;
In this case, all processors are in idle state. No transactions are performed through the interconnect. Hence no changes are observed in any of the memory locations or on ‘data_bus’ and are observed on the simulation tracer ‘5’, shown in Figure 6.8, ‘addr_prc’ = ‘EEAE’.
The simulation tracers ‘4’ and ‘5’ are shown in Figures 6.7 and 6.8 in the next two pages.
58
Figure 6.7: Simulation tracer 4
59
Figure 6.8: Simulation tracer 5
60
The post-implementation simulation tracers shown in Figures 6.4, 6.5 and 6.6
show that the interconnect network module ‘main’, described in Appendix A,
experimentally performed correctly from a functional stand-point for all scenarios. The
post-implementation simulation tracers shown in Figures 6.7 and 6.8 show that the
interconnect network module ‘main_ic’, described in Appendix B, experimentally
performed correctly from a functional stand-point for all scenarios that were tested
sucessfully on the interconnect module ‘main’.
61
Chapter 7
Experimental Prototype Development, Testing and Validation Results
The VHDL coded interconnect network system and interfaced memory modules
described in the code of Appendix A were synthesized, implemented, and programmed
into the Xilinx based Spartan XL FPGA chip using the Xilinx Foundation 3.1i CAD tool
set [7]. The FPGA chip used as the target in downloading of the VHDL code generated
bit stream was the earlier mentioned Xilinx XCS10PC84 from the Spartan XL family.
Since the module ‘main’ has a large number of I/O pins, a testcase ‘top’ is developed and
described in VHDL code which generates stimulus under various scenarios to the inputs
of ‘main’ and routes a selected set of outputs of ‘main’ to the LEDs of the prototype
board for output observation and evaluation. Figure 7.1 shows the block diagram of
‘top’.
Figure 7.1: Top Level Block Diagram of Testcase ‘top’.
main
Stimulus
generator
display
Input
Output
clk
rst
pid 3
scnr
3
addr
data
4
4
62
The function of ‘top’ is to generate stimulus to the input pins of module ‘main’.
The 3-bit primary input ‘scnr’ (scenario), to the testcase ‘top’, determines which set of
inputs are to be given to the module ‘main’. The signal ‘Input’ shown in the Figure 7.1 is
the stimulus given to all the input ports of module ‘main’. The signal ‘Output’ is the
output of module ‘main’. Both ‘Input’ and ‘Output’ are the input ports to the block
‘display’. There are 8 different scenarios that can be tested experimentally. There are 8
LEDs on the proto-board which are used to display and verify inputs and outputs of the
under-test module ‘main’. Another function of ‘top’ is to display the required set of
inputs or outputs depending upon the value of a 3-bit primary input ‘pid’ to the testcase
‘top’. The 4-bit ‘addr’ and 4-bit ‘data’ are the primary outputs of the testcase ‘top’. The
primary input to the testcase ‘top’ determines the set of signals that are to be displayed on
the 8 LEDs. For example, a ‘000’ value of ‘pid’ displays the 4-bit input queue depth
‘qdep’ of processor ‘0’ on 4-bit ‘data’, and ‘flag’ of processor ‘0’ on ‘addr’ bits ‘3’ and
‘2’, ‘rw’ of processor ‘0’ on ‘addr’ bit ‘1’ and ‘ctrl’ of processor ‘0’ on ‘addr’ bit ‘0’. The
‘111’ value of ‘pid’ displays the 4-bit ‘addr_bus’ of processor ‘0’ on ‘addr’ and 4-bit
‘data_in’ of processor ‘0’ on the ‘data’ pins. In a similar fashion ‘010’ and ‘011’ values
of ‘pid’ display corresponding signals of processor ‘1’, ‘100’ and ‘101’ values of ‘pid’
display corresponding signals for processor ‘2’ and ‘110’ and ‘111’ values of ‘pid’
display corresponding signals for processor ‘3’. A 50Mhz clock which is generated
internally on the FPGA chip is another primary input to the testcase ‘top’ and hence to
module ‘main’. During the implementation stage, the inputs and outputs are assigned to
the I/O pins of the FPGA chip. This information is entered in the ‘top.ucf file. The
scenario select bits scnr(0), scnr(1) and scnr(2) are assigned to ‘p25’, ‘p26’ and ‘p27’,
which are switches 4,3,2 on the proto-board. The display select bits pid(0), pid(1), and
pid(2) are assigned to pins ‘p19’, ‘p20’ and ‘p23’, which are switches 8,7,6 on the
protoboard. The 4-bit ‘addr’ is assigned to pins ‘p66’, ‘p67’, ‘p68’ and ‘p69’ which are
LEDs 4,3,2,1 on the proto-board and 4-bit data is assigned to pins ‘p60’, ‘p61’, ’p62’ and
‘p65’ which are LEDs 8,7,6,5 on the proto-board. The reset signal ‘rst’ is assigned to
‘p28’ which is switch 1 on the proto-board. By changing switches 4,3 and 2 on the proto-
board, different sets of inputs are given to the module ‘main’ and the corresponding
63
interconnection network behavior and outputs of module ‘main’ can be seen on the 8
LEDs for any processor by selection through switches 8, 7 and 6 on the proto-board.
All the scenarios that were experimentally successfully tested during post-
implementation simulations were also tested on the proto-board. The scenarios are
described in Chapter 7 (Pages 43 – 48) and in Appendix ‘A’ (Pages 68, 69). The
simulation tracers for these scenarios are shown in Figures 6.4, 6.5 and 6.6. The post-
implementation simulation results and experiments performed on the proto-board showed
consistent results and proved that the interconnet network experimentally performed
correctly from a functional stand-point for all scenarios.
Interconnect Network Performance:
The design is tested on the proto-board at 50MHz. But setup time and hold time
violations are observed on the post-synthesis simulation tracer at 50MHz clock
frequency, in some cases where there are frequent changes in input. These violations are
not observed on the simulation tracer for 20MHz clock frequency. The total delay, which
includes connection establishment and network latency, between any processor and any
memory block takes about 10ns. It is about one-half the clock period when operated at
20MHz. Hence a processor can access data from two different memory blocks on two
successive falling edges of ‘clk’. A similar scenario is tested in both post-implementation
simulations and experimental testing on the proto-board. This can observed on the
simulation tracer 1 shown in Figure 6.4, where processor ‘1’ reads ‘3’ from memory
location ‘3’ in memory block ‘3’,which is shown in ‘scnr’ = ‘1’ and reads ‘0’ from
memory location ‘2’ in memory block ‘1’,which is shown in scnr = ‘2’. On the
simulation tracer, scenario does not change (‘scnr’ from ‘1’ to ‘2’) on successive clocks.
But it is possible that similar data transfers between any processor and any memory block
happen in two successive clocks.
For this frequency, maximum bandwidth possible for data transfers through the
interconnect is 4 MB/s. This is the case when all the processors access different memory
blocks without any conflict at any memory block. On every falling edge of ‘clk’, 4-bits
of data can be transferred between any processor and any memory block. Hence in the
64
best case when all processors access different memory blocks, 16 bits (2 bytes) of data
transfer is done through the interconnect network on every falling edge of ‘clk’. The
connection establishment between any processor and any memory block does take more
than one-half of a clock period.
Shown below is the resource utilization of testcase ‘top’ implemented to a Spartan
XL FPGA (XCS10PC84) chip. This data gives the number of Configurable Logic Blocks
(CLBs), flip flops, latches, Look Up Tables (LUTs) and gate count used in implementing
the testcase ‘top’ to a Spartan XL FPGA (XCS10PC84) chip. This data is generated
during the post-implementation process.
Resource Utilization Summary of Spartan XL FPGA (XCS10PC84) Chip for Testcase ‘top’ of Figure 7.1 Programmed to Chip: Number of CLBs: 196 out of 196 100% CLB Flip Flops: 96 CLB Latches: 0 4 input LUTs: 354 3 input LUTs: 130 Total equivalent gate count for design: 3273
65
Chapter 8
Conclusions
A modular and scalable architecture and design for a crossbar
interconnect network of a HDCA single-chip multiprocessor system was first
presented. It’s development, simulation validation, and experimental hardware
prototype validation was also successfully accomplished in this project. The
design capture, simulation (pre-synthesis, post-synthesis and post-
implementation) and implementation was done in VHDL using Xilinx
Foundation PC based CAD software. The design was implemented as a
prototype in a PROM based Xilinx Spartan XL FPGA chip. The FPGA chip
that was used as the target in the downloading of a VHDL code generated bit
stream was the XCS10PC84 from the Xilinx Spartan XL family. The pre-
synthesis, post-synthesis and post-implementation VHDL functional simulation
results obtained from the designed interconnect network matched with obtained
FPGA chip hardware prototype experimental results for all test scenarios and all
original design specifications were met.
66
References 1. George Broomell and J. Robert Heath, “Classification Categories and Historical
Development of Circuit Switching Topologies”, ACM Computing Surveys, Vol.15,
No.2, pp. 95-133, June 1983.
2. J. Robert Heath, Paul Maxwell, Andrew Tan, and Chameera Fernando, “Modeling,
Design, and Experimental Verification of Major Functional Units of a Scalable Run-
Time Reconfigurable and Dynamic Hybrid Data/Command Driven Single-Chip
Multiprocessor Architecture and Development and Testing of a First-Phase
Prototype”, Private Communication, 2002.
3. M. L. Bos, “Design of a Chip Set for a Parallel Computer based on the Crossbar
Interconnection Principle”, Proceedings Circuits and Systems, 1995, Proceedings of
the 38th Midwest Symposium, Vol.2, pp. 752-756, 1996.
4. Enrique Coen-Alfaro and Gregory W. Donohoe, “A Comparison of Circuits for On-
Chip Programmable Crossbar Switches”, www.mrc.unm.edu/symposiums/symp02/
dsp/donohoe/coen_donohoe.doc.
5. J. R. Heath, and S. Balasubramanian, “Development, Analysis, and Verification of a
Parallel Dataflow Computer Architectural Framework and Associated Load-
Balancing Strategies and Algorithms via Parallel Simulation”, SIMULATION, vol. 69,
no.1, pp. 7-25, July.1997.
6. J. Robert Heath, George Broomell, Andrew Hurt, Jim Cochran, Liem Le, “A
Dynamic Pipeline Computer Architecture for Data Driven Systems”, Research
Project Report, Contract No. DASG60-79-C-0052, University of Kentucky Research
Foundation, Lexington, KY 40506, Feb., 1982
7. Xilinx Foundation Software Version 3.1 Manual, www.xilinx.com
8. Datasheets for Spartan XL Series Devices, www.digilent.cc
9. Kevin Skahill, VHDL for Programmable Logic, Addison-Wesley, 1993 .
10. Sudhakar Yalamanchali, Introductory VHDL: From Simulation to Synthesis, Prentice
Hall, 2001.
11. http://www.gup.uni-
linz.ac.at/thesis/diploma/christian_schaubschlaeger/html/chapter02a5.html
12. http://www.crhc.uiuc.edu/ece412/lectures/lecture20.pdf
67
Appendix A
Interconnect Network and Memory VHDL Code (Version 1)
The VHDL Code for the Crossbar Interconnect Network of the HDCA
System and It’s Shared Memory Organization as Structured in Figures 7.1
and 6.1.
-- This VHDL code describes the Crossbar interconnect network -- and shared memory organization with -- only one interface, on processor side. -- Equivalent Block diagram for module 'top' is shown in Figure 7.1 -- Equivalent Block diagram for module 'main' is shown in Figure 6.1 library IEEE ; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity top is port (clk: in std_logic ; rst: in std_logic ; scnr: in std_logic_vector(2 downto 0) ; pid: in std_logic_vector(2 downto 0) ; data: out std_logic_vector(3 downto 0) ; addr: out std_logic_vector(3 downto 0) ); end top ; architecture test_top of top is component main is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_bus: in std_logic_vector(15 downto 0) ; data_in : in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ; data_out: out std_logic_vector(15 downto 0) ); end component main ; signal ctrl: std_logic_vector(3 downto 0) ; signal flag: std_logic_vector(3 downto 0) ;
68
signal qdep: std_logic_vector(15 downto 0) ; signal addr_bus: std_logic_vector(15 downto 0) ; signal data_in: std_logic_vector(15 downto 0) ; signal data_out: std_logic_vector(15 downto 0) ; signal rw: std_logic_vector(3 downto 0) ; begin stim_gen: process (scnr) is begin -- This is equivalent to Stimulus generator in the figure 7.1 -- All these scenarios are discussed in Chapter 6 case scnr (2 downto 0) is when "000" => -- all processors write to memory locations in different memory blocks data_in <= x"4321" ; addr_bus <= x"FB73" ; qdep <= x"1234" ; ctrl <= x"F" ; rw <= x"F" ; when "001" =>
-- all processors read from memory locations in different memory blocks -- in the reverse order
data_in <= x"26FE" ; addr_bus <= x"37BF" ; qdep <= x"1234" ; ctrl <= x"F" ; rw <= x"0" ; when "010" => -- processors 2 (write) and 3 (read) to different memory locations in the same memory block -- processor 2 gets priority as its qdepth is greater -- processors 0 (write) and 1 (read) to different memory locations in the same memory block -- processor 1 gets priority as its processor id is greater data_in <= x"AAAA" ; addr_bus <= x"CD56" ; qdep <= x"EFFF" ; ctrl <= x"F" ; rw <= x"5" ; when "011" =>
69
-- processors 2 (write) and 3 (read) to different memory locations in the same memory block -- processor 3 gets priority as its processor id is greater -- processors 0 (write) and 1 (read) to different memory locations in the same memory block -- processor 0 gets priority as its qdepth is greater data_in <= x"5555" ; addr_bus <= x"DC65" ; qdep <= x"4434" ; ctrl <= x"F" ; rw <= x"5" ; when "100" => data_in <= x"9999" ; -- processors 2 (write) and 3 (write) to same memory location addr_bus <= x"3399" ; -- processor 2 gets priority as its qdepth is greater qdep <= x"4EE4" ; -- processors 0 (read) and 1 (read) from same memory location ctrl <= x"F" ; -- processor 1 gets priority as its qdepth is greater rw <= x"C" ; when "101" => -- processors 0 (write), 1 (read), 2 (read) and 3 (read) to different memory locations in the same memory block data_in <= x"EEEE" ; -- processor 0 gets priority as its qdepth is greater addr_bus <= x"7654" ; qdep <= x"5556" ; ctrl <= x"F" ; rw <= x"1" ; when "110" => -- processors 1 (read), 2 (read) and 3 (read) to same memory location (proc 0 is idle) data_in <= x"CCCC" ; -- processor 3 gets priority as its processor id is greater addr_bus <= x"6666" ; qdep <= x"5555" ; ctrl <= x"E" ; rw <= x"0" ; when others => -- all processors in idle state data_in <= x"FFFF" ; -- flag is ‘0000’ addr_bus <= x"EEAE" ; qdep <= x"0011" ; ctrl <= x"0" ; rw <= x"0" ; end case ; end process stim_gen ;
70
INST: main port map (clk => clk, rst => rst, data_in => data_in, qdep => qdep, addr_bus => addr_bus, rw => rw, ctrl => ctrl, flag => flag, data_out => data_out ); -- This is equivalent to 'display' block in figure 7.1 disply: process (ctrl,scnr,pid) is begin case pid (2 downto 0) is when "000" => data <= qdep(3 downto 0) ; addr(0) <= ctrl(0) ; addr(1) <= rw(0) ; addr(2) <= flag(0) ; addr(3) <= flag(0) ; when "001" => if (rw(0) = '0') then data <= data_out(3 downto 0) ; else data <= data_in(3 downto 0) ; end if ; addr <= addr_bus(3 downto 0) ; when "010" => data <= qdep(7 downto 4) ; addr(0) <= ctrl(1) ; addr(1) <= rw(1) ; addr(2) <= flag(1) ; addr(3) <= flag(1) ; when "011" => if (rw(1) = '0') then data <= data_out(7 downto 4) ; else data <= data_in(7 downto 4) ; end if ; addr <= addr_bus(7 downto 4) ; when "100" => data <= qdep(11 downto 8) ; addr(0) <= ctrl(2) ; addr(1) <= rw(2) ; addr(2) <= flag(2) ;
71
addr(3) <= flag(2) ; when "101" => if (rw(2) = '0') then data <= data_out(11 downto 8) ; else data <= data_in(11 downto 8) ; end if ; addr <= addr_bus(11 downto 8) ; when "110" => data <= qdep(15 downto 12) ; addr(0) <= ctrl(3) ; addr(1) <= rw(3) ; addr(2) <= flag(3) ; addr(3) <= flag(3) ; when others => if (rw(3) = '0') then data <= data_out(15 downto 12) ; else data <= data_in(15 downto 12) ; end if ; addr <= addr_bus(15 downto 12) ; end case ; end process disply ; end architecture test_top ; -- The main interconnect module library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity main is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_bus: in std_logic_vector(15 downto 0) ; data_in: in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ;
72
data_out: out std_logic_vector(15 downto 0) ); end main ; architecture main_arch of main is type qd is array (3 downto 0) of std_logic_vector(3 downto 0) ; type data_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type addr_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type mb is array (3 downto 0) of std_logic_vector(1 downto 0) ; type mem_array is array (15 downto 0) of std_logic_vector(3 downto 0) ; -- This function does the priority logic for all the memory blocks -- This is schemactically equivalent to Figures 4.5 and 4.6 in the report -- This can work for any number of processors and memory blocks -- by changing 'i' and 'j' values function flg (qdep, addr_bus, ctrl:std_logic_vector ) return std_logic_vector is variable qdvar: std_logic_vector (3 downto 0) ; variable flag: std_logic_vector(3 downto 0) ; variable qdv : std_logic_vector(3 downto 0) ; variable gnt : std_logic ; variable a: integer ; variable b: integer ; variable memaddr : mb ; variable qd_arr : qd ; begin qd_arr(0) := qdep(3 downto 0) ; qd_arr(1) := qdep(7 downto 4) ; qd_arr(2) := qdep(11 downto 8) ; qd_arr(3) := qdep(15 downto 12) ; memaddr(0) := addr_bus(3 downto 2) ; memaddr(1) := addr_bus(7 downto 6) ; memaddr(2) := addr_bus(11 downto 10) ; memaddr(3) := addr_bus(15 downto 14) ; L1: for i in 0 to 3 loop L2: for j in 0 to 3 loop if (ctrl(j) = '0') then flag(j) := '0' ;
73
qdv(j) := '0' ; elsif (memaddr(j) = i) then qdv(j) := '1' ; else qdv(j) := '0' ; end if ; end loop L2 ; qdvar := "0000" ; gnt := '0' ; L3: for k in 0 to 3 loop if qdv(k) = '1' then if qdvar <= qd_arr(k) then qdvar := qd_arr(k) ; a := k ; gnt := '1' ; else flag(k) := '0' ; end if; end if ; end loop L3 ; if (gnt = '1') then flag(a) := '1' ; end if ; end loop L1 ; return (flag) ; end flg; signal memory: mem_array ; begin P1 : process(ctrl, clk, qdep, addr_bus, rst, data_in) is begin if (rst = '1') then flag <= "0000" ; else flag <= flg(qdep, addr_bus, ctrl) ;
74
-- Memory transaction -- The conditional statements make sure that the connection is established -- before memory transaction -- Equivalent to Figures 4.4 and 4.1 after the completion of priority logic operation. -- This routine is to be repeated for each addition of processor if (clk 'event and clk = '0') then if (flag(0) = '1') then if (rw(0) = '1') then memory(conv_integer(addr_bus(3 downto 0))) <= data_in(3 downto 0) ; data_out(3 downto 0) <= (others => 'Z') ; else data_out(3 downto 0) <= memory(conv_integer(addr_bus(3 downto 0))) ; end if ; end if ; if (flag(1) = '1') then if (rw(1) = '1') then memory(conv_integer(addr_bus(7 downto 4))) <= data_in(7 downto 4) ; data_out(7 downto 4) <= (others => 'Z') ; else data_out(7 downto 4) <= memory(conv_integer(addr_bus(7 downto 4))) ; end if ; end if ; if (flag(2) = '1') then if (rw(2) = '1') then memory(conv_integer(addr_bus(11 downto 8))) <= data_in(11 downto 8) ; data_out(11 downto 8) <= (others => 'Z') ; else data_out(11 downto 8) <= memory(conv_integer(addr_bus(11 downto 8))) ; end if ; end if ; if (flag(3) = '1') then if (rw(3) = '1') then memory(conv_integer(addr_bus(15 downto 12))) <= data_in(15 downto 12) ; data_out(15 downto 12) <= (others => 'Z') ; else data_out(15 downto 12) <= memory(conv_integer(addr_bus(15 downto 12)))
75
; end if ; end if ; end if; end if; end process P1 ; end main_arch ;
76
Appendix B
Interconnect Network VHDL Code (Version 2)
The VHDL Code for the Crossbar Interconnect Network as Structured in
Figure 6.2.
-- This VHDL code is described for Crossbar interconnect network with -- two interfaces, on on processor side and the other on shared memory side -- Equivalent Block diagram for module 'main_ic' is shown in Figure 6.2 library IEEE; use IEEE.std_logic_1164.all; use IEEE.std_logic_arith.all; use IEEE.std_logic_unsigned.all; entity main_ic is port (ctrl: in std_logic_vector(3 downto 0) ; qdep: in std_logic_vector(15 downto 0) ; addr_prc: in std_logic_vector(15 downto 0) ; data_in_prc: in std_logic_vector(15 downto 0) ; data_in_mem: in std_logic_vector(15 downto 0) ; rw: in std_logic_vector(3 downto 0) ; clk: in std_logic ; rst: in std_logic ; flag: inout std_logic_vector(3 downto 0) ; addr_mem: out std_logic_vector(7 downto 0) ; rw_mem: out std_logic_vector(3 downto 0) ; data_out_prc: out std_logic_vector(15 downto 0) ; data_out_mem: out std_logic_vector(15 downto 0) ); end main_ic ; architecture main_arch of main_ic is type qd is array (3 downto 0) of std_logic_vector(3 downto 0) ; type data_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type addr_array is array (3 downto 0) of std_logic_vector(3 downto 0) ; type mb is array (3 downto 0) of std_logic_vector(1 downto 0) ; -- This function does the priority logic for all the memory blocks -- This is schemactically equivalent to Figures 4.5 and 4.6 in the report -- This can work for any number of processors and memory blocks
77
-- by changing 'i' and 'j' values function flg (qdep, addr_prc, ctrl:std_logic_vector ) return std_logic_vector is variable qdvar: std_logic_vector (3 downto 0) ; variable flag: std_logic_vector(3 downto 0) ; variable qdv : std_logic_vector(3 downto 0) ; variable gnt : std_logic ; variable a: integer ; variable b: integer ; variable memaddr : mb ; variable qd_arr : qd ; begin qd_arr(0) := qdep(3 downto 0) ; qd_arr(1) := qdep(7 downto 4) ; qd_arr(2) := qdep(11 downto 8) ; qd_arr(3) := qdep(15 downto 12) ; memaddr(0) := addr_prc(3 downto 2) ; memaddr(1) := addr_prc(7 downto 6) ; memaddr(2) := addr_prc(11 downto 10) ; memaddr(3) := addr_prc(15 downto 14) ; L1: for i in 0 to 3 loop L2: for j in 0 to 3 loop if (ctrl(j) = '0') then flag(j) := '0' ; qdv(j) := '0' ; elsif (memaddr(j) = i) then qdv(j) := '1' ; else qdv(j) := '0' ; end if ; end loop L2 ; qdvar := "0000" ; gnt := '0' ; L3: for k in 0 to 3 loop if qdv(k) = '1' then if qdvar <= qd_arr(k) then qdvar := qd_arr(k) ;
78
a := k ; gnt := '1' ; else flag(k) := '0' ; end if; end if ; end loop L3 ; if (gnt = '1') then flag(a) := '1' ; end if ; end loop L1 ; return (flag) ; end flg; signal data_in_mem_2d: data_array ; signal data_out_mem_2d: data_array ; signal addr_mem_2d: mb ; begin data_in_mem_2d(0) <= data_in_mem(3 downto 0) ; data_in_mem_2d(1) <= data_in_mem(7 downto 4) ; data_in_mem_2d(2) <= data_in_mem(11 downto 8) ; data_in_mem_2d(3) <= data_in_mem(15 downto 12) ; P1 : process (rst, addr_prc, ctrl) is begin if (rst = '1') then flag <= "0000" ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; data_out_prc (3 downto 0) <= (others => '0') ; else flag <= flg(qdep, addr_prc, ctrl) ; -- Memory transaction -- The conditional statements make sure that the connection is established -- before memory transaction -- Equivalent to Figures 4.4 and 4.1 after the completion of priority logic operation. -- This routine is to be repeated for each addition of processor if (clk 'event and clk = '1') then if (flag(0) = '1') then
79
addr_mem_2d (conv_integer(addr_prc(3 downto 2))) <= addr_prc(1 downto 0); rw_mem (conv_integer(addr_prc(3 downto 2))) <= rw(0) ; if (rw(0) = '1') then data_out_mem_2d (conv_integer(addr_prc(3 downto 2))) <= data_in_prc(3 downto 0) ; data_out_prc (3 downto 0) <= (others => 'Z') ; else data_out_prc (3 downto 0) <= data_in_mem_2d(conv_integer(addr_prc(3 downto 2))) ; end if ; end if ; if (flag(1) = '1') then addr_mem_2d (conv_integer(addr_prc(7 downto 6))) <= addr_prc(5 downto 4); rw_mem (conv_integer(addr_prc(7 downto 6))) <= rw(1) ; if (rw(1) = '1') then data_out_mem_2d (conv_integer(addr_prc(7 downto 6))) <= data_in_prc(7 downto 4) ; data_out_prc (7 downto 4) <= (others => 'Z') ; else data_out_prc (7 downto 4) <= data_in_mem_2d(conv_integer(addr_prc(7 downto 6))) ; end if ; end if ; if (flag(2) = '1') then addr_mem_2d (conv_integer(addr_prc(11 downto 10))) <= addr_prc(9 downto 8); rw_mem (conv_integer(addr_prc(11 downto 10))) <= rw(2) ; if (rw(2) = '1') then data_out_mem_2d (conv_integer(addr_prc(11 downto 10))) <= data_in_prc(11 downto 8) ; data_out_prc (11 downto 8) <= (others => 'Z') ; else data_out_prc (11 downto 8) <= data_in_mem_2d(conv_integer(addr_prc(11 downto 10))) ; end if ; end if ; if (flag(3) = '1') then addr_mem_2d (conv_integer(addr_prc(15 downto 14))) <= addr_prc(13 downto 12); rw_mem (conv_integer(addr_prc(15 downto 14))) <= rw(3) ; if (rw(3) = '1') then data_out_mem_2d (conv_integer(addr_prc(15 downto 14))) <= data_in_prc(15 downto 12) ; data_out_prc (15 downto 12) <= (others => 'Z') ; else data_out_prc (15 downto 12) <= data_in_mem_2d(conv_integer(addr_prc(15 downto 14))) ; end if ; end if ;
80
end if ; addr_mem(1 downto 0) <= addr_mem_2d(0) ; addr_mem(3 downto 2) <= addr_mem_2d(1) ; addr_mem(5 downto 4) <= addr_mem_2d(2) ; addr_mem(7 downto 6) <= addr_mem_2d(3) ; data_out_mem(3 downto 0) <= data_out_mem_2d(0) ; data_out_mem(7 downto 4) <= data_out_mem_2d(1) ; data_out_mem(11 downto 8) <= data_out_mem_2d(2) ; data_out_mem(15 downto 12) <= data_out_mem_2d(3) ; end if ; end process P1 ; end main_arch ;