design overviewcalazans/research/projects/mul…  · web viewthe upper part of figure 8 presents...

18
GAPH Application Note GAPP 01 (v1.0) June 30, 2004 MultiNoC: A Multiprocessing System Enabled by a Network on Chip Authors: Leandro Heleno Möller, Aline Vieira de Mello, Everton Carara 1 DESIGN OVERVIEW Due to the technological evolution, the complexity of projects and consequently the number of intellectual property cores (IPs) in the same integrated circuit are increasing. This increase leads to the research of new intra-chip interconnection structures to cope with the communication requirements of future System-on-Chip architectures, called SoCs. Also of importance is the current trend to increase the number of embedded processors in SoCs, leading to the concept of sea of processors systems [HEN03]. The MultiNoC system implements a programmable on-chip multiprocessing platform built on top of an efficient, low area overhead intra-chip interconnection scheme. The employed interconnection structure is a Network on Chip [BEN02], or NoC, called Hermes, a wormhole packet-switching communication module developed by the authors. NoCs are emerging as a possible solution to the existing interconnection architecture constraints, due to the following characteristics: (i) energy efficiency and reliability; (ii) scalability of bandwidth when compared to traditional bus architectures; (iii) reusability; (iv) distributed routing decisions. The router is the main component of a NoC, responsible for providing transfer of packets between IPs [RIP01]. The Hermes router has a single, centralized control logic and up to five bi- directional ports: East, West, North, South and Local. Each port has an input buffer for temporary storage of packets. The Local port establishes the communication between the router and its local IP. The other ports connect the routers to their neighbor routers. The control logic implements the arbitration logic and a routing algorithm. At the operating frequency of 50MHz, with a word size (flit) of 8 bits the theoretical peak throughput of each Hermes router is 1Gbits/s. Applications running on MultiNoC work as follows. A host processor feeds MultiNoC with application instructions and data. After this initialization procedure, MultiNoC executes the algorithm. After finishing execution of the algorithm the output data can be read back by the host. Clearly, any sequential or parallel algorithm conveniently adapted to the MultiNoC structure can be executed. The MultiNoC system is shipped together with a parallel sort application algorithm, detailed in Section 6, which illustrates how the system can be employed. Also, it shows how the MultiNoC functionality can be defined through programming. The limitations imposed to the current version of the MultiNoC system arise from the employed FPGA area restrictions, as well as from the choice of using serial low cost, low performance external communication. The approach can in thesis be extended to any number of processor IPs and/or memory IPs, using the natural scalability of NoCs. It can also be adapted to faster external interface protocols, such as USB, PCI, Firewire, etc. 2 BLOCK DIAGRAM GAPP01 www.inf.pucrs.br/~gaph +55 51 3320.3611 (ext. 35) 1

Upload: others

Post on 29-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

GAPH Application Note

GAPP 01 (v1.0) June 30, 2004

MultiNoC: A Multiprocessing System Enabled by a Network on ChipAuthors: Leandro Heleno Möller, Aline Vieira de Mello, Everton Carara

1 DESIGN OVERVIEWDue to the technological evolution, the complexity of projects and consequently the number of intellectual property cores (IPs) in the same integrated circuit are increasing. This increase leads to the research of new intra-chip interconnection structures to cope with the communication requirements of future System-on-Chip architectures, called SoCs. Also of importance is the current trend to increase the number of embedded processors in SoCs, leading to the concept of sea of processors systems [HEN03].

The MultiNoC system implements a programmable on-chip multiprocessing platform built on top of an efficient, low area overhead intra-chip interconnection scheme. The employed interconnection structure is a Network on Chip [BEN02], or NoC, called Hermes, a wormhole packet-switching communication module developed by the authors. NoCs are emerging as a possible solution to the existing interconnection architecture constraints, due to the following characteristics: (i) energy efficiency and reliability; (ii) scalability of bandwidth when compared to traditional bus architectures; (iii) reusability; (iv) distributed routing decisions.

The router is the main component of a NoC, responsible for providing transfer of packets between IPs [RIP01]. The Hermes router has a single, centralized control logic and up to five bi-directional ports: East, West, North, South and Local. Each port has an input buffer for temporary storage of packets. The Local port establishes the communication between the router and its local IP. The other ports connect the routers to their neighbor routers. The control logic implements the arbitration logic and a routing algorithm. At the operating frequency of 50MHz, with a word size (flit) of 8 bits the theoretical peak throughput of each Hermes router is 1Gbits/s.

Applications running on MultiNoC work as follows. A host processor feeds MultiNoC with application instructions and data. After this initialization procedure, MultiNoC executes the algorithm. After finishing

execution of the algorithm the output data can be read back by the host. Clearly, any sequential or parallel algorithm conveniently adapted to the MultiNoC structure can be executed. The MultiNoC system is shipped together with a parallel sort application algorithm, detailed in Section 6, which illustrates how the system can be employed. Also, it shows how the MultiNoC functionality can be defined through programming.

The limitations imposed to the current version of the MultiNoC system arise from the employed FPGA area restrictions, as well as from the choice of using serial low cost, low performance external communication. The approach can in thesis be extended to any number of processor IPs and/or memory IPs, using the natural scalability of NoCs. It can also be adapted to faster external interface protocols, such as USB, PCI, Firewire, etc.

2 BLOCK DIAGRAMMultiNoC comprises four IP cores connected to the Hermes NoC, as illustrated in Figure 1:

2 R8 embedded processors. R8 is a load-store 16-bit processor architecture, containing a 16x16 bit register file, and supporting execution of 36 distinct instructions. Each R8 processor has an attached local memory for program and data (1K 16-bit words), acting as a unified cache.

1 memory IP, implemented with 4 Block RAMs, resulting in a capacity of 1024 16-bit words.

1 RS-232 serial IP, providing bi-directional communication with a host processor.

The MultiNoC system is a NUMA (non-uniform memory access) architecture, in which each processor has its own local memory, but can also have access to memory owned by other processors, or to remote memory.

The external interface of the MultiNoC system is very simple. It comprises 4 signals:

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 1

Page 2: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

reset, responsible to initialize the MultiNoC system; clock, a basic synchronization signal;

Figure 1: MultiNoC system block diagram.

tx, data from the host computer to the MultiNoC system;

rx, data from the MultiNoC system to the host computer.

3 DETAILED DESCRIPTIONThe next Sections present in detail each of the MultiNoC IP cores.

3.1 Hermes IP core

To understand how the communication in MultiNoC system works, some basic definitions need to be advanced:

A message is a certain amount of data to be exchanged between parts of the system.

A packet is the unit of data sent across a network.

In the context of NoCs, packets are a fraction of a message. A header, a payload, and a trailer often compose packets.

The Hermes NoC employs packet switching, a communication mechanism in which packets are individually routed between nodes, with no previously established communication path [GUE00].

In particular, the Hermes NoC uses the wormhole packet switching mode to avoid the need for large buffer spaces [MOH98]. A packet is transmitted between routers in units called flits (flow control digits – the smallest unit of flow control). Only the header flit has the routing information. Thus, the rest of the flits that compose a packet must follow the same path reserved for the header.

The routing algorithm defines the path taken by a

2 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01

Page 3: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

packet between the source and the destination. The Hermes NoC employs the deterministic XY routing algorithm.

The Hermes NoC follows a mesh topology as depicted in Figure 2, justified to facilitate routing, IP cores placement and chip layout generation. The routers in MultiNoC use an 8-bit flit size, and the number of flits in a packet is fixed at 2(flit size in bits). The first and the second flits of a packet are header information, being respectively the address of the target router, named header flit, and the number of flits in the packet payload. Each router must have a unique address in the network. To simplify routing on the network this address is expressed in XY coordinates, where X represents the horizontal position and Y the vertical position.

Figure 2: An example 3x3 Mesh topology for the Hermes NoC. C marks IP cores, router addresses indicate the XY

position in network.

An asynchronous handshake protocol is used between neighbor routers. The physical interface between routers is presented in Figure 3 and is composed by the following signals:

tx: control signal indicating data availability;

data_out: data to be sent;

ack_tx: control signal indicating successful data reception.

rx: control signal indicating data availability;

data_in: data to be received;

ack_rx: control signal indicating successful data reception.

Figure 3: Physical interface between each pair of routers.

The basic Hermes router has a control logic and five bi-directional ports: East, West, North, South, and Local. In MultiNoC, the number of ports in each router is limited to three, due to the reduced size of the network (2 x 2). Each port has an input buffer for temporary storage of information. The Local port establishes a communication between the router and its local core and is always present. The other ports of the router are connected to neighbor routers, as shown in Figure 4.

Figure 4: Hermes router architecture. B indicates input buffers.

The control logic implements the routing and arbitration algorithms. When a router receives a header flit, arbitration is performed, and if the incoming packet request is granted, an XY routing algorithm is executed to connect the input port data to the correct output port. If the chosen port is busy, the header flit, as well as all subsequent flits of this packet, will be blocked in the input buffers. The routing request for this packet will remain active until a connection is established in some future execution of the procedure in this router. When the XY routing algorithm finds a free output port to use,

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 3

Page 4: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

the connection between the input port and the output port is established. After routing all flits of the packet, connection is closed.

A router can establish up to five connections simultaneously. Arbitration logic is used to grant access to an output port when one or more input ports require a connection at the same time. A round-robin arbitration scheme is used to avoid starvation.

When a flit is blocked in a given router, the performance of the network is affected, since several flits belonging to the same packet may be blocked in several intermediate routers. To lessen the performance loss, a 2-flit buffer is added to each input router port, reducing the number of routers affected by the blocked flits. Larger buffers can provide enhanced NoC performance. MultiNoC employs small buffers to cope with FPGA area restrictions. The inserted buffers work as circular FIFOs.

The minimal latency in clock cycles to transfer a packet from source to destination is given by:

21

PRlatency n

i i

where: n is the number of routers in the communication path (source and target included), Ri is the time required by the routing algorithm at each router (at least 7 clock cycles), and P is the packet size. This number is multiplied by 2 because each flit requires at least 2 clock cycles to be sent, due to the handshake protocol.

Packet Formats

The Hermes NoC in the MultiNoC system internally supports eight distinct packet formats, which define a set of services offered by the communication network to the IP Cores connected to it. These packet formats are presented in Figure 5, each identified by a binary code in the command flit, the second flit of the payload (4th flit in the packet). Observe that the scanf and the scanf return commands have the same identifier, being used in different directions (scanf: ProcessorSerialHost, scanf return: HostSerial Processor). The packets denominations/ functions are:

1. read from memory (command 0), is used to request data from memory;

2. write in memory (command 1), is used to store data into some memory of the system;

3. activate processor (command 2), initiates the processor, that then starts executing instructions from the first position of its local memory;

4. printf (command 3), is used by processors to send data to the host computer;

5. scanf (command 4), is used by processors to request user input data from the host computer;

6. scanf return (command 4), receives the requested input data from the host computer;

7. notify (command 8), is used to wake up a processor that has been blocked by a wait command;

8. read return (command 9), receives instruction and data from memory.

4 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01

Page 5: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

Figure 5 - MultiNoC packet formats.

3.2 Serial IP Core

The Serial IP Core is responsible to provide communication between the user working in a host computer and the modules of the system connected through the NoC. This communication is performed by an RS-232 protocol standard serial interface.

Figure 6 presents the Serial IP external interface. The signals at the top of the figure connect the module with the host computer. The signals at the bottom connect the module with the NoC. The function of each signal is:

txd: receives data serially from the host computer.

rxd: sends data serially to the host computer.

tx, data_out, ack_tx, rx , data_in, ack_rx: interface with the HERMES NoC.

Figure 6: Serial IP external interface.

The first task of the Serial IP is to adjust the transmission/reception data rate of the MultiNoC with

the host transmission/reception data rate. This is done by sending a 55H byte from the host computer to the Serial IP. This procedure must be done each time the MultiNoC system is initialized. After the initialization, this module waits for data from the host computer or the HERMES NoC, transmitting the received data to the opposite side.

The basic function of the Serial IP is to assemble and disassemble packets. When information comes from the host computer, the Serial IP creates a valid NoC packet. When a packet (8 bits) is received from the NoC it must be disassembled, and sent serially to the host computer.

The Serial IP accepts seven commands. Four commands are handled by the host computer: (i) read from memory; (ii) write to memory; (iii) activate processor; (iv) scanf return. The upper part of Figure7 presents the byte structure of the messages sent by the host computer for each command, while the lower part displays the packets as assembled by the Serial IP and sent to the NoC. Each write or read message generates a separate packet. Burst transactions are not allowed in the MultiNoC system due to area overhead they impose, which justifies the missing command codes in the previous Packet Formats Section.

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 5

Page 6: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

Figure 7: Data received from the host computer by the Serial IP, and corresponding packets sent to the

HERMES NoC.

The other three commands accepted by the serial IP come from the HERMES NoC to the host computer: (i) printf; (ii) scanf; (iii) read return. The upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding message sent to the host computer.

The printf and scanf commands are preceded by four 55H bytes, which indicate that a command rather than data information is coming, followed by the source address, indicating which module has sent the printf or scanf command.

Figure 8: Packets received from the HERMES NoC, by the Serial IP, and corresponding data sent to the host

computer.

3.3 Memory IP core

The Memory IP core provides storage for data and/or instructions, and can be accessed through the processor-memory bus or through the NoC. Three Memory IP cores are used in the MultiNoC system: two are internally connected to an R8 processor to form the processor IP core and one is an independently accessible remote memory.

Each Memory IP contains 4 BlockRAM modules, each organized as 1024 4-bit words, and control logic to arbitrate the access to the memory banks. Figure 9 shows the external interface of the Memory IP core and the BlockRAMs organization. The access to the memory banks is done in parallel, reading and writing 16-bit words. This access may be done by the processor interface or the NoC interface. The processor interface is not existent in the remote memory IP core.

Figure 9: Memory IP block diagram.

The memory IP external interface is composed by the following signals:

clock: system clock (not shown in Figure 9).

reset: when asserted, initializes the control logic (not shown in Figure 9).

addressCore: system address provided to this IP (not shown in Figure 9).

tx, data_out, ack_tx, rx , data_in, ack_rx: interface with the HERMES NoC

interface with the processor:- ceR8: enables the memory to read/write

operations.

6 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01

Page 7: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

- rwR8: read or write operation selection. - addrR8: processor address bus (16-bit).- dinR8: processor input data bus (16-bit).- doutR8: processor output data bus (16-bit). - busyNoCR8: signals to the memory banks that

an operation of the processor with the HERMES NoC is under way (e.g. I/O, read/write).

- busyNoCMem: signals to the processor that an operation of the memory with the HERMES NoC is under way (e.g. return read).

The busyNoCR8 and busyNoCMem are responsible to prevent the processor and the memory from using the same interface with the NoC simultaneously. The highest priority to access the memory banks is given to the processor.

3.4 Processor IP core

The Processor IP external interface, as well as its two main internal modules, is presented in Figure 10. The Processor IP includes the R8 soft core processor [GAP04a]; a Memory IP, acting as a unified cache; control logic responsible for interfacing these modules to the HERMES NoC. The R8 soft core processor was chosen due to its simplicity, low area footprint and flexibility to be modified.

Figure 10: Processor IP block diagram.

The R8 processor is a Von Neumann architecture (unified instruction/data memory), with a CPI (Clocks Per Instruction) between 2 and 4. The complete processor instruction set is presented in Table 1. The datapath contains 16 general purpose registers, an

instruction 16-bit register (IR), program counter (PC), stack pointer (SP), and 4 status flags (negative, zero, carry and overflow).

Table 1: R8 processor instruction-set architecture.

The Processor IP control logic commands the execution of the R8 processor, putting it in wait state each time the processor executes a load-store instruction (see waitR8 signal in Figure 10). Load-store operations can access: (i) the local memory; (ii) a remote memory; (iii) I/O devices; (iv) other processors, for synchronization purposes. The next 3 Sections detail these different access modes.

Memory Accesses

To determine which device of the MultiNoC system the R8 processor is accessing by load-store instructions, address ranges were defined to each memory. The address ranges for the MultiNoC system are presented in Figure 11.

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 7

Page 8: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

Figure 11: C code illustrating the access to local/remote memories. The Processor IP control logic arbitrates

which memory block is accessed.

I/O Operations

The I/O operations are mapped to the FFFFh memory address. Thus, when the ST instruction is executed the printf is performed and when the LD instruction is executed the scanf is performed.

Synchronization Operations

Multiprocessor systems require synchronization mechanisms among processors to implement distributed applications. The synchronization among processors can be done through shared memory or explicit message exchange. Due to the use of NoCs, the second mechanism was naturally chosen.

The Processor IP control logic implements the wait and notify commands, both memory-mapped. The wait command is responsible to block the execution of the processor until the reception of a notify command. The wait command is identified through the execution of a store instruction (ST) at address FFFEH with the number of the processor that will restart the processor executing ST with a notify command. The notify command is identified through the execution of ST at address FFFDH with the number of the processor that will be restarted.

For example, when the R8 processor with address 1 executes the instruction “ST R3, R1, R2” (R2=2, R1=0, R3=FFFEH), the Processor IP control logic pauses the R8 processor until receiving a packet with a notify command from the IP with address 2 (see Figure 5 – command 8). The R8 processor with

address 2 should execute “ST R3, R1, R2” (R2=1, R1=0, R3=FFFDH) to create the notify command. Clearly, the result of the addition R1+R2 indicates the target processor, in this case processor 1.

4 SYSTEM PROTOTYPING Due to the scarce area offered by the Spartan-IIe XC2S200E device [XIL04], the original design of MultiNoC IPs was reduced:

R8 processors: four shift/rotate operations were suppressed from the original R8 instruction-set architecture;

Serial IP: read and write burst operations were suppressed, simplifying the finite state machine responsible to control this IP;

HERMES-NoC: the ideal flit size would be 16, equal to the processor word size. The flit width was adjusted to 8, and the input buffers were reduced from 8 to 2 flits.

The original clock of the prototyping board, 50MHz, was divided by two, using a clkdll component. The frequency was reduced, due to the delay estimated by the timing analysis tool, 21.23 MHz. Despite the fact that the employed frequency is higher (25 MHz), the circuit worked correctly.

The Leonardo Spectrum tool was used to synthesize the MultiNoC system. Some constraints must be inserted to obtain a working prototype. The design was optimized to delay, with maximum fan out equal to 20, usage of low skew lines and preserve hierarchy (to be possible to use the floorplaning tool). During physical synthesis it is also necessary to use the speed optimization strategy.

The MultiNoC system uses 98% of the available slices and 78% of the LUTs, according to the physical synthesis area report presented in Figure 12.

8 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01

Page 9: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

Figure 12: MultiNoC physical synthesis area report.

It is important to stress the value of floorplanning in designs using most of the FPGA surface. This generates a complex optimization problem that had to be solved. The use of synthesis and implementation options alone was not sufficient to make the design fit in the restricted area of the XC2S200E device. Even multiple choices of alternate synthesis parameters could not handle the 98% occupation of the design adequately. Figure 13 illustrates the MultiNoC system floorplan design that enabled physical synthesis to be successful. The reasoning behind the placement design of the IPs is justified as follows:

the NoC IP is placed in the middle of the FPGA, offering easy access to it from all IPs in the system;

the Serial IP is placed next to the I/O pins responsible for the data transmission/reception to reduce global wire length and routing congestion;

the Processor IPs are placed in the left/right side of the FPGA, near to the corresponding BlockRAMs for the same reason of the previous item;

the Memory IP (the smallest IP) is placed in the remaining area.

Figure 13: MultiNoC design floorplan.

The NoC area can be seen to be an important part of the design when compared to the other IPs. In fact, NoCs trade increased bandwidth (and thus performance) for increased area. However, NoCs are in principle designed for much bigger systems than this prototype. It is not uncommon to consider that NoCs are a feasible communication medium for systems containing more than a hundred IPs (e.g. 10x10 NoCs) [KUM02]. When more area is available, the IPs connected to the NoC can increase in area and functionality. The router surface will remain constant and the NoC dimensions will scale less than the IPs, becoming a very small fraction of the whole system, typically less than 10 or 5%.

5 GETTING STARTEDFigure 14 illustrates the data flow to execute the MultiNoC system over a Digilab 2E development board featuring the Xilinx Spartan-IIe XC2S200E FPGA. Each step of this flow is detailed below.

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 9

Page 10: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

Figure 14: MultiNoC system flow diagram.

Figure 15: A simple assembly code example.

5.1 Write the Assembly CodeThe assembly code of the R8 processor must be writ-ten according to the R8 instruction-set, presented in Table 1. Figure 15 presents an example of R8 assem-bly code, which sums n values entered by the user. The number of values and the values themselves are read by the MultiNoC using scanf instructions. The re-sult is presented by the printf instruction and is also stored at address 20H the processor local memory.

5.2 Simulate the Assembly CodeThe R8 Simulator environment (Figure 16) allows writ-ing, simulating and debugging assembly code, gener-ating automatically the object code that must be opened in the current version of the Serial software to send it to the R8 processor. Unfortunately, the R8 Simulator is not able to simulate a multiprocessed ap-plication. This software can be downloaded at [GAP04b].

1. open the R8 Simulator;

2. load the ASM file, in item load from menu file;

3. simulate the assembly code, interacting with the navigation buttons;

4. verify the correct algorithm functionality through the analysis of the R8 registers and memory values;

5. modify the assembly code until getting the algorithm correct answer.

It is important to note that when the assembly file (.asm) is loaded (step 2), the R8 simulator automatically generate a text file (.txt) with the object code presented in Figure 17. An additional line must be added by the user, in the first file, with the write command, as presented in Figure 7. This line is composed by the write memory command (01), the IP target address to receive the write command (01), the number of words in the object code (10H) and the initial address of this write command (00 00H).

10 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01

Page 11: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

Figure 16: R8 Simulator user interface. The highlighted numbers correspond to the steps described before.

Figure 17: Object code to be written in P1 processor local memory.

Figure 18: Using the Impact software to configure the FPGA.

5.3 Configure the FPGA

To configure the FPGA, connect one of the possible configuration cables to the prototyping board and turn board on. The MultiNoC system was tested only with the JTAG Parallel Cable 3. Start the Xilinx Impact Software, and detect the correct device: XC2S200E. Then, select the multinoc.bit bitstream (distributed with this application note) (step 1 in Figure 18), downloading it into the device (step 2 in Figure 18).

5.4 Reset the FPGA

After downloading the multinoc.bit bitstream into the FPGA, the next step is the hardware reset. Press the button BTN1 of the Digilab 2E development board to set all system modules to the initial state. This button is highlighted in Figure 19.

Figure 19: Digilab 2E development board.

5.5 Start the Serial Software

The Serial software enables the host computer to communicate with the Spartan-IIe device.

This software requires JDK V1.2.2 (or superior) and javax.comm API (Java Communication API). The Serial software distribution is available at [GAP04c].

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 11

Page 12: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

Follows the help.htm file to install the software and the communication API.

Start the Serial software, setting the parameters (step 1 in Figure 20) to: baud rate equal 4800 bps, no parity, eight bits, one stop bit and the configuration port COM1 (computer dependent). Next, connect the software with the prototyping board selecting the connect option (step 2 from Figure 20).

Figure 20: Serial software implemented by the authors.

5.6 Synchronize SW/HW

The MultiNoc system must receive from the Serial software the host computer baud rate, to correctly receive data, as explained in Section 3.2. Select the sync.txt, as shown in Figure 21, and press send to transmit the 55H value to the MultiNoc system.

Figure 21: Synchronization data.

It the file names are not available in the choice box, the files can be opened in the File menu, option Open.

5.7 Send Generated Object Code

The sum_hw.txt file presents in Figure 17, must be sent by the Serial software to the MultiNoC system. Figure 22 presents the Serial software with the object code loaded to fill the P1 processor local memory.

Figure 22: R8 object code loaded on Serial software.

5.8 Fill Memory Contents

Optionally, data may be also sent to the remote memory or the local memories of the MultiNoC. This process of sending data to the memory is similar to sending the object code to the instruction memory.

5.9 Activate Processors

The system is now ready to execute the application by sending the activate processor command, as presented in Figure 23. The first byte sent indicates the activate processor command (02) and the second indicates the P1 processor address (01).

Figure 23: Activate P1 Processor.

12 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01

Page 13: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

5.10 I/O Operations

In order to execute I/O operations, the Serial software has interaction monitors for each processor. These monitors are configured by the processor.cfg file from the Serial software. For two processors of the MultiNoC system, the processor.cfg is configured as presented in Figure 24.

Figure 24: Processor.cfg file processor configuration.

The I/O steps to execute the sum application on the P1 processor are described below, where numbering has correspondences in Figure 25. Figure 25(a) shows input operations and Figure 25(b) presents the same monitor with an output operation indicating the result of the application execution:

1. Enter the number of input values. In this case, five values will be added;

2. Enter the five values to be added. Each number is entered in the blue field, in the bottom of the monitor.

3. The result corresponding to the addition operation.

Figure 25: I/O steps performed during execution of the sum application.

5.11 Debug

There are two ways of verifying the prototype correct functionality. One is directly reading memory values (step 1 in Figure 26) and the other is using printf instructions executed by some processor (step 2 in Figure 26). Printf instructions are part of the application code and can help verifying intermediate values. Memory reads can be used to verify the memory contents at the end of the execution. In the example presented in Figure 26, the user has typed “00 01 01 00 20”, meaning a read operation (00) from P1 processor local memory (01), reading just one memory position (01) and starting at address 0020H (see Figure 7).

Figure 26: Two ways of debugging the MultiNoC system: 1) read operation; 2) printf operation.

If the application results present incorrect values after executing and debugging the system, a possible error is in the synchronization between processors

6 Demo ApplicationMultiprocessed applications have to be adequately partitioned between processors of the MultiNoC system during the mapping phase. The object code of each processor must use the synchronization operations carefully to avoid e.g. deadlock or indefinite postponement situations. Another issue that must be taken into account during application development is that the device used to implement the MultiNoC system forces the use of a 1024-word limit to each local memory and to the remote memory as

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 13

Page 14: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

well.

The demo application should be representative to illustrate the mapping issues. It should also use intensively the Hermes NoC, providing performance data to compare the network-on-chip solution against traditional bus architectures.

To fulfill these requirements, a parallel sorting application was chosen, using a 512-word integer vector to be sorted stored in the remote memory.

The parallel sorting application works combining bubble sort and merge sort algorithms. The bubble sort algorithm is executed in parallel by both processors (P1 and P2) on each half of the vector, while the merge sort algorithm is executed only by one processor (P1) to generate the final result from the two sorted halves.

When processor P1 ends sorting, a wait instruction is executed, blocking it until receiving a notify instruction from processor P2. This pause is needed to synchronize the end of sorting by both P1 and P2. After receiving the notify instruction, processor P1 executes the merge algorithm over all values.

To execute the sorting application in the MultiNoC system, the general flow diagram presented in Figure14 is followed. Some specific points are detailed below.

The codeP1.asm and codeP2.asm assembly codes were written and validated with the R8 simulator environment in order to generate the processor object codes. It is important to remember that a write command must be inserted in the first line of each object code, as explained in the second step of the flow diagram presented in Section 5.1. The codeP1_hw.txt and codeP2_hw.txt must be sent by the Serial software to the MultiNoC system, filling the local memories of processors P1 and P2.

Afterwards, the user has to feed the remote memory with the 512 values to be sorted. The user must send the files vector_part1.txt and vector_part2.txt, presented in Figure 27. The 512 values are divided into two files because the message size is expressed

in 8 bits (see Figure 7). The headers of these files express a write command (01) in the remote memory (11) of 256 words (FF) starting at the first memory position (0000H) on the first vector part and continuing at position 0100H in the second vector part.

Figure 27: Data values to be written in theremote memory.

After filling local and remote memories, the user must activate the processors by issuing “02 01” (activate P1 processor) and “02 10” (activate P2 processor).

To obtain the sorted vector, the user must write two read commands in the Serial Software, as presented in Figure 28. The first position of the sorted vector is written at address 0200H of the remote memory.

Figure 28: Reading the sorted vector.

Summarizing, the steps to execute the demo applica-tion are:

1) Download the multinoc.bit into the prototyping platform, and press the reset button;

2) Start the serial software;

3) Sending the following files/commands: sync.txt, codeP1_hw.txt, codeP2_hw.txt, vec-

14 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01

Page 15: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

tor_part1.txt, vector_part2.txt, activateP1P2.txt, result_part1.txt and result_part2.txt.

7 Conclusions and Future IdeasThe MultiNoC system is in fact an exercise of implementing and making available a design platform on top of which applications can be effectively and rapidly prototyped. This is indeed a recently proposed new design paradigm called platform based design, as opposed to existing paradigms, system level design and component based design [KEU00].

Future research with the MultiNoC system includes the development of a multiprocessed simulator. This tool is important to detect distributed application errors and synchronizing software running on different processors. Another important tool is a C compiler to automatically generate R8 assembly code, allowing faster software implementation. Also, an operating system could be implemented to execute different tasks concurrently.

Mapping the MultiNoC system in a larger FPGA device would allow increasing the NoC dimension and the number of IPs connected to it. This new system can be composed by more instances of the presented pre-designed and pre-verified IP cores, adopting the concept of design reuse, or by implementing new IP cores. Increasing the number of identical IPs enhances the parallelism degree. On the other hand, increasing the amount of different IPs contributes with new functionalities to the MultiNoC system.

One of the current research foci is on partial and dynamic reconfiguration applied to the MultiNoC system. Partial and dynamic reconfiguration allows, for example, that the IP cores position be modified in execution at runtime, favoring the IPs communication with improved throughput. Reconfiguration can also be used to reduce system area consumption through insertion and removal of IP cores on demand.

The MultiNoC system can be very useful to undergraduate/graduate students for learning concepts arising in hardware description languages,

distributed systems, parallel processing hardware development environments, prototyping designs and applying digital systems concepts to related disciplines. Additionally, the MultiNoC system has a low area overhead, being able to be prototyped in small devices, decreasing the cost to be acquired in large numbers by academy.

8 References

[BEN02] Benini, L.; De Micheli, G. “Networks on chips: a new SoC paradigm”. IEEE Com-puter, v. 35(1), pp. 70-78.

[GAP04a] GAPH – Hardware Design Support Group. “R8 Processor – Architecture and Organiza-tion”. http://www.inf.pucrs.br/~gaph/Projects /R8 /public/R8_arq_spec_eng.pdf.

[GAP04b] GAPH – Hardware Design Support Group. “R8 Simulator”.

http://www.inf.pucrs.br/~gaph/homepage/download/software/simulatorR8.zip.

[GAP04c] GAPH – Hardware Design Support Group. “Serial Software”.

http://www.inf.pucrs.br/~gaph/homepage/download/software/SerialSoftware.zip.

[GUE00] Guerrier. P.; Greiner. A. “A generic architec-ture for on-chip packet-switched intercon-nections”. In: DATE'00, pp. 250-256.

[HEN03] J. Henkel. “Closing the SoC Design Gap”. IEEE Computer, vol 36(9), pp. 119-121, 2003.

[KEU00] Keutzer, K.; Newton, A.R.; Rabaey, J.M.; Sangiovanni-Vincentelli, A. “System-level design: orthogonalization of concerns and platform-based design”. IEEE Transactions on CAD of Integrated Circuits and Systems, vol 19 (12), pp. 1523-1543, 2000.

[KUM02] Kumar, S.; et al. “A Network on Chip Archi-tecture and Design Methodology”. In: IEEE Computer Society Annual Symposium on

GAPP01 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) 15

Page 16: Design Overviewcalazans/research/Projects/Mul…  · Web viewThe upper part of Figure 8 presents the packets coming from the NoC, while the lower part displays the corresponding

MultiNoC: A Multiprocessing System Enabled by a Network on Chip

VLSI. (ISVLSI’02), Apr. 2002, pp. 105-112.

[MOH98] Mohapatra, P.; “Wormhole routing tech-niques for directly connected multicomputer systems”. ACM Computing Surveys, 30(3), Sep. 1998, pp 374-410.

[RIP01] Rijpkema, E.; Goossens, K; Wielage, P. “A Router Architecture for Networks on Sili-con”. In: 2nd Workshop on Embedded Sys-tems (PROGRESS´2001), Nov. 2001, pp. 181-188.

[XIL04] Xilinx, Inc. http://www.xilinx.com.

16 www.inf.pucrs.br/~gaph+55 51 3320.3611 (ext. 35) GAPP01