chapter 4 gals architecture -...

64

CHAPTER 4

GALS ARCHITECTURE

The aim of this chapter is to implement an application on GALS

architecture. The synchronous and asynchronous implementations are

compared in FFT design. The power consumption is reduced and the

maximum frequency of operation increases when the FFT design is

implemented in GALS architecture.

4.1 CLOCK DISTRIBUTION

Generating high frequency clock signal throughout the die is a

problem as skew becomes inevitable. In most processors, a PLL (Phase Lock

Loop) generates a high frequency clock signal from a slow external clock.

There are two approaches to distribute the clock:

H - Tree distribution

Buffer driven clock distribution

A combination of a metal grid and tree of buffers are used to

distribute the clock throughout the chip. These metal grids provide an ideal

path for low skew as they provide direct interconnections between clock pins.

In most of the clock hierarchy, there is a transition from global

clocks which are sub-divided into major clocks and finally to local clocks.

This novel approach helps in reducing the skew within the synchronous

blocks itself (Restle et al 2001). The basic distribution of global clock into

smaller independent local clocks is shown in Figure 4.1.

65

Figure 4.1 Global clock distribution

The H-Tree structure has low latency, less wiring and low power

consumption. Trees work well if the clock loading is uniform across the chip

area; unfortunately, most microprocessors have widely varying clock loads.

This is the main reason why H-type clock distribution scheme is used for

asynchronous circuit design. The advantage of H-tree structure design is the

fact that ideal duty cycle is reduced by 50% and reflection is prevented. Rise

and fall times are preserved. There is reduced load capacitance which results

in lesser fan-out for our design.

In buffer driven clock distribution the buffer acts as a driver to the

integrated circuit. Buffers are to amplify and reduce the skew between the

intermediate clock domains (Olsson et al 2000).

Using PLL to generate clock might pose a significant problem

when it comes to pausable clock. So we use a ring oscillator to generate clock

and distribute it through the chip area. PLL produces more mechanical noise

compared to ring oscillator. Crystal/ring oscillator is used for clock

generation. The local clock generator is constructed from an inverter and a

delay line, similar to an inverter ring oscillator (Nilisson et al 1996). The

problem with using inverters alone as a delay line is that it is difficult to

accurately tune the clock period as process variations and changes in

66

temperature affect the delay. Hence, accurate delay lines have been developed

which are capable of maintaining a stable clock frequency. These use a global

reference clock for calibration. The former can use either standard cells or full

custom blocks for the tunable delay and have been shown to maintain a

frequency within 1% of the chosen value or Pierce crystal oscillator with

stable frequency is used for clock generation.

4.2 GALS LOCAL CLOCK CIRCUITRY DESIGN

Local clock generation is carried out using on chip clock

generators. This design is governed by multiple clocks. Since we have

multiple clocks, synchronization of these clocks is a major a problem as

failures are more prone under this configuration.

Globally asynchronous locally synchronous (GALS) systems

combine the benefits of synchronous and asynchronous systems (Chapiro et al

1984). Modules can be designed like modules in a globally synchronous

design, (Muttersback et al 2000) using the same tools and methodologies.

Each block is independently clocked, which helps to alleviate clock skew.

Connections between the synchronous blocks are asynchronous. When data

enters a synchronous system from an asynchronous environment, registers at

the input are prone to metastability. To avoid this, the arrival of data is

indicated by an asynchronous handshaking protocol. When data arrives, the

locally generated clock is paused: in practice, the rising edge of the clock is

delayed. Once data has safely arrived, the clock can be released so data is

latched with zero probability of metastability on the data path. Early work on

GALS systems (Yen et al 1998) introduced clock stretching or pausing.

Synchronization is done using two methods. One is stretchable

clock and the other is pausable clocking methodology. In stretchable clocking

67

scheme, the clocking period is unknown during the stretch time to attain

synchronization, but in the pausable clocking scheme, the period after the

pause is known and much of synchronization failures can be averted. Due to

the inconsistency of the stretchable clock, we take into consideration the

pausable clocking scheme for our synchronization methodology.

We have proposed a new asynchronous communication

methodology for the communication between any two synchronous blocks.

Methods to achieve asynchronous communication:

Handshake circuits clocking versus handshaking

Muller C elements

Stretchable clock/pausable clock

4.2.1 Handshake circuits

The handshake circuits are used to generate signals in response to

their input signals (from environment or so called dummy signals) which in

combination form a specific protocol to implement asynchronous data

communication between two modules. The design of the handshake circuits

depends on the specific data communication protocols, the structures of

modules and the organization of systems. The asynchronous handshake is

shown in Figure 4.2. Two or four-phase protocols and various data encoding

schemes are used for asynchronous data communication.

The main disadvantage of the 2-phase handshake protocol is that its

state differs from the state before that handshake. A 4-phase consists of two

transitions more to return to the initial state. This can be seen in Figures 4.3

and 4.4. In comparison with the 2-phase handshake protocol, a 4-phase is

slower and consumes more power. In general, the circuits are simpler and less

68

expensive. Due to the complexity of control circuits for implementing 2-phase

protocols and the large overhead for dual-rail data transference to generate the

completed signal. The four-phase bundled data communication is commonly

employed in most of the fully asynchronous systems. Other types of

handshaking are also possible. The circuits which are proposed, has only one

wire with active circuits at both ends to pull the wire up or down.

The introduced protocols above all assume that the sender is the

active party that initiates the data transfer over the channel. This is known as a

‘push channel’. The opposite, the receiver asking for new data is also possible

and is called a ‘pull channel’. The direction of the request and acknowledge

signals are then being reversed. The validity of the data is indicated in the

acknowledge signal from the sender to the receiver.

Figure 4.2 Asynchronous handshake

69

Figure 4.3 Two phase circuit

Figure 4.4 Four phase circuit

70

4.2.2 Muller C-element

The Muller C element is an important state holding component of

asynchronous circuits. When both inputs are 0, the output of the Muller C

element is set to 0 and when they are 1, the output changes to 1. Since

handshaking involves cyclic transitions between 0 and 1, it is clear that the

Muller C element is a fundamental component and is the AND function for

two events.

The fundamental circuit implementing four-phase handshaking

protocol between asynchronous blocks can be composed of Muller C

elements which are shown in Figure 4.5. The Muller C element truth table is

shown in Table 4.1. For a successful data transfer, the handshake circuits will

go through the following signal transitions viewed from outside of the

interface circuits: Rin+, Aout+, Rout+, Ain+, Rin-, Aout-, Rout- and Ain- by

which data communications in the whole system are consecutively coupled

with behaviors of subsequent modules.

Figure 4.5 Muller C element circuit diagram

71

Table 4.1 Truth table for Muller C element

A B Y

0 0 0

0 1 No change

1 0 No change

1 1 1

4.2.3 Stretchable clock controller

The main motivation behind machines with stretchable clocks has

been to avoid the metastability problems. Stretchable clock can stretch a clock

phase for an unbounded period of time. In the meanwhile, inputs and outputs

will become valid (Moore et al 2000). Therefore, they are suitable for

interaction with the global asynchronous characteristics of a GALS. The

stretchable clock consists of a ring oscillator and a Muller C element as

shown in Figure 4.6. For safety and reliability of the clock, a Muller C

element is been used. If Stretch is not asserted as low, the output and inputs of

the C element will follow the signal transitions of Figure 4.7. If Stretch is

asserted as high, the input Xa is set to low, the output of the C element could

be either low or high. The output will eventually be maintained at a low level.

The next rising of lclk edge has postponed the Stretch+. An OR gate is used

for multiple requests for stretching the clock.

72

Figure 4.6 Implementation of a stretchable clock

Figure 4.7 Stretchable clock signal transition graph

4.2.4 Pausable clock controller

To make the clock pausable, an ME element is added to the ring as

shown in Figure 4.8. This arbitrates between the rising edge of the clock and

an incoming request (Yen et al 1996). Hence, the clock is prevented from

73

rising as the input registers are being enabled by the request and metastability

is prevented. For each bundle of data, a port controller, request and ME

element is required. Only when all of the ME elements have been locked out

by the clock, the rising clock edge is permitted to occur.

Figure 4.8 Clock pausing scheme

A problem with clock pausing is that the clock is delayed as it is

distributed across the clock domain, but the clock must be paused at its source

(Yen et al 1999). When the clock releases the ME elements, there may still be

a clock edge in the buffer tree. Hence, it is possible that registers will be

enabled as the clock rises, as shown in Figure 4.9. However, while the source

clock is high, ME elements will remain locked so for this phase of the cycle,

no requests are permitted. For this reason, we must ensure that the delay of

the clock buffer is shorter than the duration of the high phase of the clock.

Limiting this delay limits the size of the clock tree, hence defining the size of

GALS blocks.

74

Figure 4.9 Clock buffering problem

The standard components which can be placed around synchronous

modules to provide the handshake signals and make them GALS modules, are

called asynchronous wrappers (Carlsson et al 2002). In GALS systems,

interface circuit with Muller C elements can not be applied directly to data

communications between two locally synchronous modules, because there is a

significant difference between GALS and fully asynchronous systems that the

activations of data input and output must be synchronized with their local

clocks. Hence, for GALS systems to complete data transfers between the

locally synchronous modules, it is necessary to use special ports (Shengxian

et al 2002) to implement the handshaking and the stretching of local clocks.

Such LS modules with an asynchronous wrapper could be

connected with ease (Hanck et al 1994). Port controllers are required to

generate and accept handshaking signals at the inputs and outputs of modules.

These port controllers are asynchronous state machines, which are similar to

inputs rather than a clock. To simplify the design of the asynchronous

wrapper, but without loss of generality, we assume the handshake circuits in

GALS work in the following mode:

75

a) The request of data communication is always activated by a

data output interface circuit, namely W-port, which is

equipped to a master LS module, The data input interface

circuit, namely R-port, which is equipped to a slave LS

module, is always passive for accepting the data.

b) When the W-port activates data output, it might stop its

internal clock and wait for the acknowledgement from the

corresponding R-port. Likewise, when the R-port initializes

reading a data, it must maintain the state until the W-port

sends a request. This means every activation of each port

completes an effective data transmission.

c) Both the W-port and the R-port are independently enabled by

the internal requests from their own LS modules. This is the

case of data communications in many GALS systems.

4.3 IMPLEMENTATION OF FFT USING GALS TECHNOLOGY

FFT processors are involved in a wide range of applications today.

Not only as a very important block in broadband systems, digital TV, etc., but

also used in areas like radar, medical electronics and also in real time systems

(Torkelson et al 1996). The workload for FFT computations is also high and a

better approach than a general-purpose processor is required, to fulfill the

requirements at a reasonable cost. For instance, using application specific

processors, algorithm specific processors or ASICs could be the solution to

these problems. ASIC is the choice because of its lower power consumption

and higher throughput.

In synchronous circuits, the clock signal switches at every clock

cycle and drives a large capacitance. As a result, the clock signal is a major

76

source of dynamic power dissipation. For example, studies have attributed

upto 30% of the total power dissipation in general purpose processor to clock

network (Hemani et al 1999). Globally asynchronous and locally synchronous

circuits are one in which, different block work asynchronously with respect to

the other blocks but synchronously. Asynchronous circuits activate

components that are necessary to perform the given operation by the use of

local handshake protocols.

Asynchronous circuits have many advantages over synchronous

ones, like:

Performance of an asynchronous system depends on the

average case latency, not the worst case latency as in

synchronous circuits.

Global clock timing problems are avoided.

The power consumption can be lower in asynchronous

circuits, despite the fact that they require more hardware.

4.4 FFT ALGORITHMS

The mixed radix algorithm is used in this implementation. Any

composite number N can split in to a product of prime factors. In each stage, a

single prime factor is taken into consideration. The only way to calculate FFT

for a prime number of point is through DFT. This stage involves calculation

of the prime factor point DFT, with the proper selection of points. This step is

followed by proper twiddling at this stage of output. The same procedure is

repeated for other prime factors. The net result of this process yields the FFT

of the N-point input. The reordering algorithm is used to obtain the FFT

output in the natural order. Thus, this is the final order of output obtained by a

77

FFT algorithm and by knowing this, the output is rearranged to obtain the

output in natural order.

4.5 FFT ARCHITECTURE

There are various architectures present to implement FFT and they

are listed below:

Array architectures: Used for short FFTs computations and it

requires extensive chip area.

Column architecture: It is done by collapsing all the columns

in array architecture into one column

Pipelined architectures: Pipelined architectures are useful for

FFTs that requires high data throughput.

Reconfigurable architecture: It is useful for FFTs that require

FPGA and the implementation requires less area but less data

throughput.

New FFT architecture: GALS implementation to reduce

power consumption.

The array architecture can only be used for very short FFTs,

because of the extensive use of chip-area. The column architecture uses an

approach that requires less area on the chip than the array architecture. The

architecture is still not small enough to be taken into account for long FFTs.

The advantages of pipeline architecture are high data throughput,

relatively small area and a simple control unit. If the design uses synchronous

operation, delays had to be added between each radix-22 stage and between

78

the two butterfly elements inside the radix-22 stage. This will result in the

need for more hardware and increase in the latency between input and output

frames. The latency is not a big problem, but the extra hardware will increase

the die size and the power consumption. The advantage of reconfigurable

architecture can be implemented using FPGA (since it has dynamic

reconfiguration) and the data throughput is less but the area is less compared

other architectures.

The above problem could be solved in two ways, either changing

the control unit or creating a system consisting of locally synchronous blocks

communicating asynchronously (GALS) (Jonas et al 2003). The first choice

of keeping the FFT completely synchronous would increase the complexity of

the control unit, resulting in a system that is harder to understand. The second

choice would only have a slightly different control structure, but very similar

to the original one. The blocks would also be more separated from each other

functionally, which could be a good property when improving the design later

in the future. These pros and cons lead to the implementation of the FFT

processor as a GALS-system.

4.6 GALS IMPLEMENTATION

The implementation FFT using GALS architecture consists of three

basic building blocks and they are:

Input block

Central block

Output block

79

4.6.1 Features of architecture

The implemented processor is a 16-bit floating point processor

(8 bits for integer and 8 bits decimal part) for real and

imaginary parts to enable high range and resolution.

The resolution of this processor is thus 2-8 = 0.00390625.

This FFT processor can handle from 2-point FFT to 32-point

FFT.

The processor, which makes use of a single multiplier, is

implemented using modified Booth’s algorithm.

Only one block is active at a time in the processor, thus

reducing power consumption.

The input block, after getting the inputs, sends a request signal

to the central block and places the data in the bus which goes

to central block.

The same process is followed in between the central block and

the output block.

To enable the features of asynchronous circuits, each block

has dedicated request and acknowledge pins.

The sine and cosine functions are implemented in this module.

Hence, twiddle factors required are generated using these

functions.

80

4.6.2 Input block

The inputs to the processor are provided through this block. The

input to the FFT processor is 32-points. If all the 32-points are to be provided

as input simultaneously, the processor must have 32 16 2 = 1024 pins for

data bits alone, which is not practically feasible. The input block consists of

the following pins: 16-bit data lines, 5-bit selection lines, 5-bit selection1

lines, Global Enable and Local Enable Pins (active high), Clock and Start

Pins, 64 16 output lines, request and acknowledge handshaking signals to

communicate with the next block. The internal structure of input block is

shown in Figure 4.10.

Figure 4.10 Internal structure of input block

81

Global Enable Pin should be given high throughout the process of

FFT computation. Local Enable Pin is given high only when providing inputs

to the processor. The start signal is given after feeding all the inputs to the

processor for FFT computation. The input block then sends a request signal

and places the data of 64 32 bits at the output pins. The input block after

getting the acknowledge signal from the central block, turns the request signal

low.

4.6.3. Central block

This block, also called the FFT computational block, is the heart of

the processor. This block receives its input from the input block and outputs

the result to the output block. This block is enabled by the input block by

sending a request handshaking signal. Following the completion of the

computations, this block sends a request signal to the output block. The inputs

to this block are 64 16 data lines, Global enable, clock, select1 lines and the

handshaking signal pairs (reqa, acka) and (reqc, ackc). This block comprises

of three sets of 64, 32-bit registers. This block generates the twiddle factors

by using the dedicated sine and cosine function generation components.

The central block’s main operation is to select the inputs from the

registers and the corresponding twiddle factor generation using the sine and

cosine generating components for computing the prime factor-point DFT.

This operation is performed by using three loops to reduce the size of the

program. The first loop is for specifying the stage of the FFT Computation.

The second loop is for specifying the set in the stage specified by the first

loop. The third loop is for specifying the point to be selected in a specified

set.

82

This FFT computation block then enables the output block by

sending a request signal. The output block samples the data produced by the

computational block after giving the acknowledge signal.

4.6.4 Output block

The positive edge in the acknowledge pulse is the instant at which

the data from the central block is sampled and stored in the temporary pages.

This block is utilized to display the output. It receives input from the central

block It consists of input lines (64, 16), output lines (1, 16), global enable,

select lines (5), select1 lines (5), clock signal. In both real and imaginary

parts, 8 bits are allocated for integer part and 8 bits for mantissa. The internal

structure of output block is shown in Figure 4.11.

Figure 4.11 Internal structure of output block

83

4.7 IMPLEMENTATION OF GALS

The implementation makes use of the GALS technology to

interface the input, central and the output blocks. The input and the output

blocks are made to run at the same clock speed. The central block runs at a

different clock, thus moving compatible with the GALS technology. No

global clock is used in asynchronous circuits; instead some form of

handshaking is used in the communication between systems. These systems

are not completely asynchronous; they consist of synchronous sub-systems

communicating asynchronously. In Figure 4.12, the LS-system is a locally

synchronous system, Req is short for request and Ack is short for

acknowledge. Req and Ack perform the handshaking. In this project, a push

communication channel is used, which means that the producer of data

initiates the handshaking.

A 4-phase handshaking cycle is performed in the following way

(Req+ means Req goes high): Req+, Ack+, Req- and finally Ack-. Data should

be valid between Req+ and Ack-, but is often sampled on the Ack+ edge.

Figure 4.12 GALS asynchronous communication

The GALS technology is implemented in the three blocks of the

FFT processor as follows: The data transfer is always along a single direction

i.e. the data always flows from the input block to the central block and then

84

from the central block to the output block. So the input block is designed with

a single pair of request and acknowledges signals. The central block has to

communicate with input block as well as the output block and the data

transfer is only one way. Hence, the central block is designed with two sets of

handshaking signals (request and acknowledge signals). The output block

communicates only with the central block and the data transfer is only one

way and for this reason, output block is designed with a single pair of the

handshaking signals. The GALS FFT processor and variable point FFT

architecture are shown in Figures 4.13 and 4.14 respectively.

Figure 4.13 GALS FFT processor implementation

Figure 4.14 Variable point FFT architecture

85

The input block and the central block communicate through the

reqa (request) and the acka (acknowledge) signals. The reqa signal is raised

by the input block and the central block responds by raising the acka signal

and sampling the data from the data bus. The input block then turns the reqa

signal and the central block follows this by turning the acka signal low. This

same process applies to the communication between the central block and the

output block. The signals involved in this asynchronous communication are

reqc and ackc handshaking pair of signals.

4.8 SIMULATION RESULTS

The implementation of all the design blocks are implemented using

VHDL code and simulation result is shown in Figure 4.15.

Figure 4.15 Output waveform of GALS implemented FFT processor

86

4.9 SYNTHESIZED REPORT

The FFT processor is synthesized for 8-points. Since this is a

variable point processor, the number of loops i.e., the corresponding number

of operations varies with the number of points. The synthesized 8-point FFT

result is shown in Figure 4.18. The tool used for synthesizing is Leonardo

Spectrum.

The report obtained by synthesizing the 8-point FFT processor,

which is taken as snap shot from the system monitor, is shown in Figure 4.16.

Figure 4.16 Synthesized report for 8-point FFT processor

87

This synthesized report is obtained before the implementation of

GALS in the processor. The LC utilization as from the report is 5010 and for

Altera APEX20KE technology, it is found to be 60.22%. The clock frequency

report generated by the tool Leonardo spectrum, which is also obtained as a

snap shot, is shown in Table 4.2.

Table 4.2 Clock frequency report

The clock frequency report gives a clear picture of the frequency of

operation of the blocks and it is shown in Table 4.2. The input block and the

output block can operate at a maximum frequency of 64.0 MHz. The central

block can operate at a maximum frequency of 13.7 MHz. Since the central

block involves numerous computation processes, this block would operate

slowly with respect to others. When the time period taken for computing FFT

for a set of eight points, then the frequency of operation is determined by the

input and output blocks, since the time period for eight clock cycles for a

64 MHz clock is greater than the single clock period of a 13.7 MHz clock.

88

The RTL schematic diagram obtained using the Leonardo Spectrum

tool is shown in Figure 4.17.

Figure 4.17 RTL schematic of 8-point FFT processor

89

The technology schematic diagram is shown in the Figure 4.18.

Figure 4.18 Technology schematic of 8-point FFT processor

4.10 CONCLUSIONS

The various modules are implemented using GALS technology.

Each module will run with different frequency. If the modules are operating

synchronously, the critical path will be further reduced since it has to take

care of input block delay, central block delay and output block delay (The

total delay = 1/64 + 1/64 + 1/13.7 = 9.6 ns). The throughput will be reduced.

90

If the module is operated with 64 MHz, then the central block will

take few clock cycles to finish the operation. The delay of the whole block

will be 15.625 ns. If the FFT design uses GALS approach, the power required

will be (P α f i.e. P1 α 64 and P2 α 13.7) 13.7/64 = 21 % of the synchronous

operation. The speed of each block is shown in Table 4.3.

Table 4.3 Speed of operation of various blocks

Sl.

No. Description

Delay (speed of

operation)

1 Input block – GALS implementation 64 MHz

2 Output block – GALS implementation 64 MHz

3 Central block – GALS implementation 13.7 MHz

4 Synchronous implementation 9.6 MHz

Another advantage is that there will not be any clock skew, since

the design is divided into three blocks and each block is operating at different

frequency.

chapter 4 gals architecture -...

Documents