chapter 4 gals architecture -...
TRANSCRIPT
64
CHAPTER 4
GALS ARCHITECTURE
The aim of this chapter is to implement an application on GALS
architecture. The synchronous and asynchronous implementations are
compared in FFT design. The power consumption is reduced and the
maximum frequency of operation increases when the FFT design is
implemented in GALS architecture.
4.1 CLOCK DISTRIBUTION
Generating high frequency clock signal throughout the die is a
problem as skew becomes inevitable. In most processors, a PLL (Phase Lock
Loop) generates a high frequency clock signal from a slow external clock.
There are two approaches to distribute the clock:
H - Tree distribution
Buffer driven clock distribution
A combination of a metal grid and tree of buffers are used to
distribute the clock throughout the chip. These metal grids provide an ideal
path for low skew as they provide direct interconnections between clock pins.
In most of the clock hierarchy, there is a transition from global
clocks which are sub-divided into major clocks and finally to local clocks.
This novel approach helps in reducing the skew within the synchronous
blocks itself (Restle et al 2001). The basic distribution of global clock into
smaller independent local clocks is shown in Figure 4.1.
65
Figure 4.1 Global clock distribution
The H-Tree structure has low latency, less wiring and low power
consumption. Trees work well if the clock loading is uniform across the chip
area; unfortunately, most microprocessors have widely varying clock loads.
This is the main reason why H-type clock distribution scheme is used for
asynchronous circuit design. The advantage of H-tree structure design is the
fact that ideal duty cycle is reduced by 50% and reflection is prevented. Rise
and fall times are preserved. There is reduced load capacitance which results
in lesser fan-out for our design.
In buffer driven clock distribution the buffer acts as a driver to the
integrated circuit. Buffers are to amplify and reduce the skew between the
intermediate clock domains (Olsson et al 2000).
Using PLL to generate clock might pose a significant problem
when it comes to pausable clock. So we use a ring oscillator to generate clock
and distribute it through the chip area. PLL produces more mechanical noise
compared to ring oscillator. Crystal/ring oscillator is used for clock
generation. The local clock generator is constructed from an inverter and a
delay line, similar to an inverter ring oscillator (Nilisson et al 1996). The
problem with using inverters alone as a delay line is that it is difficult to
accurately tune the clock period as process variations and changes in
66
temperature affect the delay. Hence, accurate delay lines have been developed
which are capable of maintaining a stable clock frequency. These use a global
reference clock for calibration. The former can use either standard cells or full
custom blocks for the tunable delay and have been shown to maintain a
frequency within 1% of the chosen value or Pierce crystal oscillator with
stable frequency is used for clock generation.
4.2 GALS LOCAL CLOCK CIRCUITRY DESIGN
Local clock generation is carried out using on chip clock
generators. This design is governed by multiple clocks. Since we have
multiple clocks, synchronization of these clocks is a major a problem as
failures are more prone under this configuration.
Globally asynchronous locally synchronous (GALS) systems
combine the benefits of synchronous and asynchronous systems (Chapiro et al
1984). Modules can be designed like modules in a globally synchronous
design, (Muttersback et al 2000) using the same tools and methodologies.
Each block is independently clocked, which helps to alleviate clock skew.
Connections between the synchronous blocks are asynchronous. When data
enters a synchronous system from an asynchronous environment, registers at
the input are prone to metastability. To avoid this, the arrival of data is
indicated by an asynchronous handshaking protocol. When data arrives, the
locally generated clock is paused: in practice, the rising edge of the clock is
delayed. Once data has safely arrived, the clock can be released so data is
latched with zero probability of metastability on the data path. Early work on
GALS systems (Yen et al 1998) introduced clock stretching or pausing.
Synchronization is done using two methods. One is stretchable
clock and the other is pausable clocking methodology. In stretchable clocking
67
scheme, the clocking period is unknown during the stretch time to attain
synchronization, but in the pausable clocking scheme, the period after the
pause is known and much of synchronization failures can be averted. Due to
the inconsistency of the stretchable clock, we take into consideration the
pausable clocking scheme for our synchronization methodology.
We have proposed a new asynchronous communication
methodology for the communication between any two synchronous blocks.
Methods to achieve asynchronous communication:
Handshake circuits clocking versus handshaking
Muller C elements
Stretchable clock/pausable clock
4.2.1 Handshake circuits
The handshake circuits are used to generate signals in response to
their input signals (from environment or so called dummy signals) which in
combination form a specific protocol to implement asynchronous data
communication between two modules. The design of the handshake circuits
depends on the specific data communication protocols, the structures of
modules and the organization of systems. The asynchronous handshake is
shown in Figure 4.2. Two or four-phase protocols and various data encoding
schemes are used for asynchronous data communication.
The main disadvantage of the 2-phase handshake protocol is that its
state differs from the state before that handshake. A 4-phase consists of two
transitions more to return to the initial state. This can be seen in Figures 4.3
and 4.4. In comparison with the 2-phase handshake protocol, a 4-phase is
slower and consumes more power. In general, the circuits are simpler and less
68
expensive. Due to the complexity of control circuits for implementing 2-phase
protocols and the large overhead for dual-rail data transference to generate the
completed signal. The four-phase bundled data communication is commonly
employed in most of the fully asynchronous systems. Other types of
handshaking are also possible. The circuits which are proposed, has only one
wire with active circuits at both ends to pull the wire up or down.
The introduced protocols above all assume that the sender is the
active party that initiates the data transfer over the channel. This is known as a
‘push channel’. The opposite, the receiver asking for new data is also possible
and is called a ‘pull channel’. The direction of the request and acknowledge
signals are then being reversed. The validity of the data is indicated in the
acknowledge signal from the sender to the receiver.
Figure 4.2 Asynchronous handshake
69
Figure 4.3 Two phase circuit
Figure 4.4 Four phase circuit
70
4.2.2 Muller C-element
The Muller C element is an important state holding component of
asynchronous circuits. When both inputs are 0, the output of the Muller C
element is set to 0 and when they are 1, the output changes to 1. Since
handshaking involves cyclic transitions between 0 and 1, it is clear that the
Muller C element is a fundamental component and is the AND function for
two events.
The fundamental circuit implementing four-phase handshaking
protocol between asynchronous blocks can be composed of Muller C
elements which are shown in Figure 4.5. The Muller C element truth table is
shown in Table 4.1. For a successful data transfer, the handshake circuits will
go through the following signal transitions viewed from outside of the
interface circuits: Rin+, Aout+, Rout+, Ain+, Rin-, Aout-, Rout- and Ain- by
which data communications in the whole system are consecutively coupled
with behaviors of subsequent modules.
Figure 4.5 Muller C element circuit diagram
71
Table 4.1 Truth table for Muller C element
A B Y
0 0 0
0 1 No change
1 0 No change
1 1 1
4.2.3 Stretchable clock controller
The main motivation behind machines with stretchable clocks has
been to avoid the metastability problems. Stretchable clock can stretch a clock
phase for an unbounded period of time. In the meanwhile, inputs and outputs
will become valid (Moore et al 2000). Therefore, they are suitable for
interaction with the global asynchronous characteristics of a GALS. The
stretchable clock consists of a ring oscillator and a Muller C element as
shown in Figure 4.6. For safety and reliability of the clock, a Muller C
element is been used. If Stretch is not asserted as low, the output and inputs of
the C element will follow the signal transitions of Figure 4.7. If Stretch is
asserted as high, the input Xa is set to low, the output of the C element could
be either low or high. The output will eventually be maintained at a low level.
The next rising of lclk edge has postponed the Stretch+. An OR gate is used
for multiple requests for stretching the clock.
72
Figure 4.6 Implementation of a stretchable clock
Figure 4.7 Stretchable clock signal transition graph
4.2.4 Pausable clock controller
To make the clock pausable, an ME element is added to the ring as
shown in Figure 4.8. This arbitrates between the rising edge of the clock and
an incoming request (Yen et al 1996). Hence, the clock is prevented from
73
rising as the input registers are being enabled by the request and metastability
is prevented. For each bundle of data, a port controller, request and ME
element is required. Only when all of the ME elements have been locked out
by the clock, the rising clock edge is permitted to occur.
Figure 4.8 Clock pausing scheme
A problem with clock pausing is that the clock is delayed as it is
distributed across the clock domain, but the clock must be paused at its source
(Yen et al 1999). When the clock releases the ME elements, there may still be
a clock edge in the buffer tree. Hence, it is possible that registers will be
enabled as the clock rises, as shown in Figure 4.9. However, while the source
clock is high, ME elements will remain locked so for this phase of the cycle,
no requests are permitted. For this reason, we must ensure that the delay of
the clock buffer is shorter than the duration of the high phase of the clock.
Limiting this delay limits the size of the clock tree, hence defining the size of
GALS blocks.
74
Figure 4.9 Clock buffering problem
The standard components which can be placed around synchronous
modules to provide the handshake signals and make them GALS modules, are
called asynchronous wrappers (Carlsson et al 2002). In GALS systems,
interface circuit with Muller C elements can not be applied directly to data
communications between two locally synchronous modules, because there is a
significant difference between GALS and fully asynchronous systems that the
activations of data input and output must be synchronized with their local
clocks. Hence, for GALS systems to complete data transfers between the
locally synchronous modules, it is necessary to use special ports (Shengxian
et al 2002) to implement the handshaking and the stretching of local clocks.
Such LS modules with an asynchronous wrapper could be
connected with ease (Hanck et al 1994). Port controllers are required to
generate and accept handshaking signals at the inputs and outputs of modules.
These port controllers are asynchronous state machines, which are similar to
inputs rather than a clock. To simplify the design of the asynchronous
wrapper, but without loss of generality, we assume the handshake circuits in
GALS work in the following mode:
75
a) The request of data communication is always activated by a
data output interface circuit, namely W-port, which is
equipped to a master LS module, The data input interface
circuit, namely R-port, which is equipped to a slave LS
module, is always passive for accepting the data.
b) When the W-port activates data output, it might stop its
internal clock and wait for the acknowledgement from the
corresponding R-port. Likewise, when the R-port initializes
reading a data, it must maintain the state until the W-port
sends a request. This means every activation of each port
completes an effective data transmission.
c) Both the W-port and the R-port are independently enabled by
the internal requests from their own LS modules. This is the
case of data communications in many GALS systems.
4.3 IMPLEMENTATION OF FFT USING GALS TECHNOLOGY
FFT processors are involved in a wide range of applications today.
Not only as a very important block in broadband systems, digital TV, etc., but
also used in areas like radar, medical electronics and also in real time systems
(Torkelson et al 1996). The workload for FFT computations is also high and a
better approach than a general-purpose processor is required, to fulfill the
requirements at a reasonable cost. For instance, using application specific
processors, algorithm specific processors or ASICs could be the solution to
these problems. ASIC is the choice because of its lower power consumption
and higher throughput.
In synchronous circuits, the clock signal switches at every clock
cycle and drives a large capacitance. As a result, the clock signal is a major
76
source of dynamic power dissipation. For example, studies have attributed
upto 30% of the total power dissipation in general purpose processor to clock
network (Hemani et al 1999). Globally asynchronous and locally synchronous
circuits are one in which, different block work asynchronously with respect to
the other blocks but synchronously. Asynchronous circuits activate
components that are necessary to perform the given operation by the use of
local handshake protocols.
Asynchronous circuits have many advantages over synchronous
ones, like:
Performance of an asynchronous system depends on the
average case latency, not the worst case latency as in
synchronous circuits.
Global clock timing problems are avoided.
The power consumption can be lower in asynchronous
circuits, despite the fact that they require more hardware.
4.4 FFT ALGORITHMS
The mixed radix algorithm is used in this implementation. Any
composite number N can split in to a product of prime factors. In each stage, a
single prime factor is taken into consideration. The only way to calculate FFT
for a prime number of point is through DFT. This stage involves calculation
of the prime factor point DFT, with the proper selection of points. This step is
followed by proper twiddling at this stage of output. The same procedure is
repeated for other prime factors. The net result of this process yields the FFT
of the N-point input. The reordering algorithm is used to obtain the FFT
output in the natural order. Thus, this is the final order of output obtained by a
77
FFT algorithm and by knowing this, the output is rearranged to obtain the
output in natural order.
4.5 FFT ARCHITECTURE
There are various architectures present to implement FFT and they
are listed below:
Array architectures: Used for short FFTs computations and it
requires extensive chip area.
Column architecture: It is done by collapsing all the columns
in array architecture into one column
Pipelined architectures: Pipelined architectures are useful for
FFTs that requires high data throughput.
Reconfigurable architecture: It is useful for FFTs that require
FPGA and the implementation requires less area but less data
throughput.
New FFT architecture: GALS implementation to reduce
power consumption.
The array architecture can only be used for very short FFTs,
because of the extensive use of chip-area. The column architecture uses an
approach that requires less area on the chip than the array architecture. The
architecture is still not small enough to be taken into account for long FFTs.
The advantages of pipeline architecture are high data throughput,
relatively small area and a simple control unit. If the design uses synchronous
operation, delays had to be added between each radix-22 stage and between
78
the two butterfly elements inside the radix-22 stage. This will result in the
need for more hardware and increase in the latency between input and output
frames. The latency is not a big problem, but the extra hardware will increase
the die size and the power consumption. The advantage of reconfigurable
architecture can be implemented using FPGA (since it has dynamic
reconfiguration) and the data throughput is less but the area is less compared
other architectures.
The above problem could be solved in two ways, either changing
the control unit or creating a system consisting of locally synchronous blocks
communicating asynchronously (GALS) (Jonas et al 2003). The first choice
of keeping the FFT completely synchronous would increase the complexity of
the control unit, resulting in a system that is harder to understand. The second
choice would only have a slightly different control structure, but very similar
to the original one. The blocks would also be more separated from each other
functionally, which could be a good property when improving the design later
in the future. These pros and cons lead to the implementation of the FFT
processor as a GALS-system.
4.6 GALS IMPLEMENTATION
The implementation FFT using GALS architecture consists of three
basic building blocks and they are:
Input block
Central block
Output block
79
4.6.1 Features of architecture
The implemented processor is a 16-bit floating point processor
(8 bits for integer and 8 bits decimal part) for real and
imaginary parts to enable high range and resolution.
The resolution of this processor is thus 2-8 = 0.00390625.
This FFT processor can handle from 2-point FFT to 32-point
FFT.
The processor, which makes use of a single multiplier, is
implemented using modified Booth’s algorithm.
Only one block is active at a time in the processor, thus
reducing power consumption.
The input block, after getting the inputs, sends a request signal
to the central block and places the data in the bus which goes
to central block.
The same process is followed in between the central block and
the output block.
To enable the features of asynchronous circuits, each block
has dedicated request and acknowledge pins.
The sine and cosine functions are implemented in this module.
Hence, twiddle factors required are generated using these
functions.
80
4.6.2 Input block
The inputs to the processor are provided through this block. The
input to the FFT processor is 32-points. If all the 32-points are to be provided
as input simultaneously, the processor must have 32 16 2 = 1024 pins for
data bits alone, which is not practically feasible. The input block consists of
the following pins: 16-bit data lines, 5-bit selection lines, 5-bit selection1
lines, Global Enable and Local Enable Pins (active high), Clock and Start
Pins, 64 16 output lines, request and acknowledge handshaking signals to
communicate with the next block. The internal structure of input block is
shown in Figure 4.10.
Figure 4.10 Internal structure of input block
81
Global Enable Pin should be given high throughout the process of
FFT computation. Local Enable Pin is given high only when providing inputs
to the processor. The start signal is given after feeding all the inputs to the
processor for FFT computation. The input block then sends a request signal
and places the data of 64 32 bits at the output pins. The input block after
getting the acknowledge signal from the central block, turns the request signal
low.
4.6.3. Central block
This block, also called the FFT computational block, is the heart of
the processor. This block receives its input from the input block and outputs
the result to the output block. This block is enabled by the input block by
sending a request handshaking signal. Following the completion of the
computations, this block sends a request signal to the output block. The inputs
to this block are 64 16 data lines, Global enable, clock, select1 lines and the
handshaking signal pairs (reqa, acka) and (reqc, ackc). This block comprises
of three sets of 64, 32-bit registers. This block generates the twiddle factors
by using the dedicated sine and cosine function generation components.
The central block’s main operation is to select the inputs from the
registers and the corresponding twiddle factor generation using the sine and
cosine generating components for computing the prime factor-point DFT.
This operation is performed by using three loops to reduce the size of the
program. The first loop is for specifying the stage of the FFT Computation.
The second loop is for specifying the set in the stage specified by the first
loop. The third loop is for specifying the point to be selected in a specified
set.
82
This FFT computation block then enables the output block by
sending a request signal. The output block samples the data produced by the
computational block after giving the acknowledge signal.
4.6.4 Output block
The positive edge in the acknowledge pulse is the instant at which
the data from the central block is sampled and stored in the temporary pages.
This block is utilized to display the output. It receives input from the central
block It consists of input lines (64, 16), output lines (1, 16), global enable,
select lines (5), select1 lines (5), clock signal. In both real and imaginary
parts, 8 bits are allocated for integer part and 8 bits for mantissa. The internal
structure of output block is shown in Figure 4.11.
Figure 4.11 Internal structure of output block
83
4.7 IMPLEMENTATION OF GALS
The implementation makes use of the GALS technology to
interface the input, central and the output blocks. The input and the output
blocks are made to run at the same clock speed. The central block runs at a
different clock, thus moving compatible with the GALS technology. No
global clock is used in asynchronous circuits; instead some form of
handshaking is used in the communication between systems. These systems
are not completely asynchronous; they consist of synchronous sub-systems
communicating asynchronously. In Figure 4.12, the LS-system is a locally
synchronous system, Req is short for request and Ack is short for
acknowledge. Req and Ack perform the handshaking. In this project, a push
communication channel is used, which means that the producer of data
initiates the handshaking.
A 4-phase handshaking cycle is performed in the following way
(Req+ means Req goes high): Req+, Ack+, Req- and finally Ack-. Data should
be valid between Req+ and Ack-, but is often sampled on the Ack+ edge.
Figure 4.12 GALS asynchronous communication
The GALS technology is implemented in the three blocks of the
FFT processor as follows: The data transfer is always along a single direction
i.e. the data always flows from the input block to the central block and then
84
from the central block to the output block. So the input block is designed with
a single pair of request and acknowledges signals. The central block has to
communicate with input block as well as the output block and the data
transfer is only one way. Hence, the central block is designed with two sets of
handshaking signals (request and acknowledge signals). The output block
communicates only with the central block and the data transfer is only one
way and for this reason, output block is designed with a single pair of the
handshaking signals. The GALS FFT processor and variable point FFT
architecture are shown in Figures 4.13 and 4.14 respectively.
Figure 4.13 GALS FFT processor implementation
Figure 4.14 Variable point FFT architecture
85
The input block and the central block communicate through the
reqa (request) and the acka (acknowledge) signals. The reqa signal is raised
by the input block and the central block responds by raising the acka signal
and sampling the data from the data bus. The input block then turns the reqa
signal and the central block follows this by turning the acka signal low. This
same process applies to the communication between the central block and the
output block. The signals involved in this asynchronous communication are
reqc and ackc handshaking pair of signals.
4.8 SIMULATION RESULTS
The implementation of all the design blocks are implemented using
VHDL code and simulation result is shown in Figure 4.15.
Figure 4.15 Output waveform of GALS implemented FFT processor
86
4.9 SYNTHESIZED REPORT
The FFT processor is synthesized for 8-points. Since this is a
variable point processor, the number of loops i.e., the corresponding number
of operations varies with the number of points. The synthesized 8-point FFT
result is shown in Figure 4.18. The tool used for synthesizing is Leonardo
Spectrum.
The report obtained by synthesizing the 8-point FFT processor,
which is taken as snap shot from the system monitor, is shown in Figure 4.16.
Figure 4.16 Synthesized report for 8-point FFT processor
87
This synthesized report is obtained before the implementation of
GALS in the processor. The LC utilization as from the report is 5010 and for
Altera APEX20KE technology, it is found to be 60.22%. The clock frequency
report generated by the tool Leonardo spectrum, which is also obtained as a
snap shot, is shown in Table 4.2.
Table 4.2 Clock frequency report
The clock frequency report gives a clear picture of the frequency of
operation of the blocks and it is shown in Table 4.2. The input block and the
output block can operate at a maximum frequency of 64.0 MHz. The central
block can operate at a maximum frequency of 13.7 MHz. Since the central
block involves numerous computation processes, this block would operate
slowly with respect to others. When the time period taken for computing FFT
for a set of eight points, then the frequency of operation is determined by the
input and output blocks, since the time period for eight clock cycles for a
64 MHz clock is greater than the single clock period of a 13.7 MHz clock.
88
The RTL schematic diagram obtained using the Leonardo Spectrum
tool is shown in Figure 4.17.
Figure 4.17 RTL schematic of 8-point FFT processor
89
The technology schematic diagram is shown in the Figure 4.18.
Figure 4.18 Technology schematic of 8-point FFT processor
4.10 CONCLUSIONS
The various modules are implemented using GALS technology.
Each module will run with different frequency. If the modules are operating
synchronously, the critical path will be further reduced since it has to take
care of input block delay, central block delay and output block delay (The
total delay = 1/64 + 1/64 + 1/13.7 = 9.6 ns). The throughput will be reduced.
90
If the module is operated with 64 MHz, then the central block will
take few clock cycles to finish the operation. The delay of the whole block
will be 15.625 ns. If the FFT design uses GALS approach, the power required
will be (P α f i.e. P1 α 64 and P2 α 13.7) 13.7/64 = 21 % of the synchronous
operation. The speed of each block is shown in Table 4.3.
Table 4.3 Speed of operation of various blocks
Sl.
No. Description
Delay (speed of
operation)
1 Input block – GALS implementation 64 MHz
2 Output block – GALS implementation 64 MHz
3 Central block – GALS implementation 13.7 MHz
4 Synchronous implementation 9.6 MHz
Another advantage is that there will not be any clock skew, since
the design is divided into three blocks and each block is operating at different
frequency.