unit ii : cpld & fpga architecture & applications

8/13/2019 UNIT II : CPLD & FPGA ARCHITECTURE & APPLICATIONS

1/20

Dr.Y.Narasimha Murthy [email protected]

UNIT II : CPLD & FPGA ARCHITECTURE & APPLICATIONS

INTRODUCTION : The Xilinx Programmable Gate Array, known as a Logic Cell Array (LCA),

is a high-density CMOS IC that combines user programmability with the flexibility of a gate

array architecture and the economy and testability of standard products. Xilinx reprogrammable

architectures are used because of their flexibility, low prices for small quantities, testability and

short development time. Most design changes can be implemented by reprogramming the LCAs.

Thus, use of the LCAs , allows the design to go directly from schematic capture to a production

board. The programmable logic blocks in the Xilinx family of FPGAs are called Configurable

Logic Blocks (CLBs).The Xilinx architecture uses, CLBs, I/O blocks, switch matrix and an

external memory chip to realize a logic function. It uses external memory to store theinterconnection information. Therefore, the device can be reprogrammed by simply changing the

configuration data stored in the memory.

XILINX Logic Cell Array : This is the novel architectural feature introduced by XILINX in

the year 1985 for their FPGA devices. It is almost like a proprietary or trade mark property of

XILINX implemented for FPGA devices. The XILINX LCA architecture consists of three major

Components. They are

(i).Configurable Logic Blocks (CLBs) (ii).Input/Output Blocks (lOBs) and

(iii). Programmable Interconnect.

In addition, configuration memory is used to hold the configuration program bits which control

the configuration of CLRM, IOBs and interconnect.

This LCA architecture consists of an interior matrix of logic blocks and a surrounding ring of I/O

interface blocks. Interconnect resources occupy the channels between the rows and columns of

logic blocks and between the logic blocks and I/O blocks. Like a microprocessor the LCA is a

program driven logic device. The functions of the LCAs configurable logic blocks and I/O

blocks and their interconnection are controlled by a configuration program stored in an on-chip

memory. The configuration program is loaded automatically from an external memory on power-

up or on command, or is programmed by a microprocessor as part of system initialization.


2/20


As shown below diagram the configuration memory consists of a distributed array of static

memory cells .During configuration the cell is written through the data line and is read through

the data line during read back operation.

During normal operation the pass transistor is off and continuous configuration control is

provided. There are five methods for loading configuration program data into configuration

memory. Among them two methods load the data serially and three methods load the data in a

byte wide parallel manner.

The LCA performance is determined by the speed of logic , storage elements and programmable

interconnect.LCA performance is specified by the maximum toggle rate for a logic block

storage element configured as a toggle flip-flop. For typical application system clock rates are

one third to one-half the maximum flip-flop toggle rate.

The core of the LCA is a matrix of identical Configurable Blocks (CLBs).Each CLB contains

programmable combinational logic and storage registers. The combinational logic section of of

the block is capable of implementing any Boolean function of its input variables.The registers

can be loaded from the combinational logic or directly from a CLB input the register outputs can

be inputs to the combinational logic via an internal feedback path.


3/20


The periphery of the Logic Cell Array is made up of user programmable input/output blocks

(IOBs).Each block can be programmed independently to be an input ,an output or bi-directional

pin with three state control. Inputs can be programmed to recognize either TTL or CMOSthresholds. Each IOB also includes flip-flops that can be used to buffer inputs and outputs.

The flexibility of the LCA is due to resources that permit program control of the interconnection

of any two points on the chip. The LCA interconnection resources include a two-layer metal net-

work of lines that run horizontally and vertically in the rows and columns between the CLBs.

Programmable switches connect the inputs and outputs of IOBs and CLBs to the nearest metal

lines. Cross point switches and interchanges at the interconnections of rows and columns allow

signals to be switched from one path to another. Long lines run the entire length or breadth of the

chip ,by passing interchanges to provide distribution of critical signals with minimum delay or

skew.

Configurable Block(CLB) : The core of the FPGA is a matrix of identical Configurable

Blocks(CLBs) .Each CLB contains a combinational logic array, program controlled data


4/20


multiplexers, and flip-flops. The CLB also contains RAM memory cells and can be programmed

to realize any function of five variables or any two functions of four variables. The functions are

stored in the truth table form, so the number of gates required to realize the functions is not

important. In the fig below each trapezoidal block represents a multiplexer, which can be

programmed to select one of its inputs . The block diagram of the CLB is shown below

The array of CLBs provides the functional elements from which the users logic is constructed.

The logic blocks are arranged in a matrix within the perimeter of IOBs. Forexample, the

XC3020A has 64 such blocks arranged in 8rows and 8 columns. The development system is used

tocompile the configuration data which is to be loaded intothe internal configuration memory to

define the operationand interconnection of each block. User definition of CLBsand their

interconnecting networks may be done by automatic translation from a schematic-capture logic

diagram oroptionally by installing library or user macros. Each CLB has a combinatorial logic

section, two flip-flops,and an internal control section. There are : five logic inputs (A, B, C, D

and E); a common clock input (K); an asynchronous direct RESET input (RD); and an enable

clock (EC). All may be driven from the interconnect resources adjacent to the blocks. Each CLB

also has two outputs (X and Y) which may drive interconnect networks. Data input for either


5/20


flip-flop within a CLB is supplied from the function F or G outputs of the combinatorial logic, or

the block input, DI. Both flip-flops in each CLB share the asynchronous RD which, when

enabled and High , is dominant over clocked inputs. All flip-flops are reset by the active-Low

chip input, RESET, or during the configuration process. The flip-flops share the enable clock

(EC) which, when Low, re circulates the flip- flops present states and inhibits response to the

data-in or combinatorial function inputs on a CLB. The user may enable these control inputsand

select their sources. The user may also select theclock net input (K), as well as its active sense

within each CLB. This programmable inversion eliminates the need toroute both phases of a

clock signal throughout the device.

The combinatorial-logic portion of the CLB uses a 32 by 1 look-up table to implement Boolean

functions. Variables selected from the five logic inputs and two internal block flip-flops are used

as table address inputs. The combinatorial propagation delay through the network is independent

of the logic function generated and is spike free for singleinput variable changes. The partial

functions of six or seven variables are implemented using the input variable (E) to dynamically

select between two functions of four different variables. For thetwo functions of four variables

each, the independent results (F and G) may be used as data inputs to either flip-flop or either

logic block output. For the single function of five variables and merged functions of six or seven

variables, the F and G outputs are identical. Symmetry of the F and G functions and the flip-flops

allows the interchange of CLB outputs to optimize routing efficiencies of the networksinterconnecting the CLBs and IOBs

Input/Output Blocks ( I/O Block):

The periphery of the Logic Cell Array is made up of user programmable input/output blocks

(IOBs) .Each block can be programmed independently to be an input ,an output or bi-directional

pin with three state control. So, each user-configurable IOB , provides an interface between the

external package pin of the device and the internal user logic. This IOB includes both registered

and direct input paths. Also each IOB provides a programmable3-state output buffer, which may be driven by a registered or direct output signal. Configuration options allow the IOB an

inversion, a controlled slew rate and a high impedance pull-up. Each input circuit also provides

input clamping diodes to provide electrostatic protection, and circuits to inhibit latch-up

produced by input currents


6/20


The IOB also includes input and output storage elements and I/O options selected by

configuration memory cells. A choice of two clocks is available on each die edge. The polarity of

each clock line (not each flip-flop or latch) is programmable. A clock line that triggers the flip-

flop on the rising edge is an active Low Latch Enable (Latch transparent) signal and vice versa.

Passive pull-up can only be enabled on inputs, not on outputs. All user inputs are programmedfor TTL or CMOS thresholds.

The input-buffer portion of each IOB provides threshold detection to translate external signals

applied to the package pin to internal logic levels. The global input-buffer threshold of the IOBs

can be programmed to be compatible with either TTL or CMOS levels. The buffered input signal

drives the data input of a storage element, which may be configured as either a flip-flop or a

latch. The clocking polarity (rising/falling edge-triggered flip-flop, High/Low transparent latch)

is programmable for each of the two clock lines on each of the four die edges. Note that a clock

line driving a rising edge-triggered flip-flop makes any latch driven by the same line on the same

edge Low-level transparent and vice versa (falling edge, High transparent). All Xilinx primitives

in the supported schematic-entry packages, however, are positive edge-triggered flip-flops or

High transparent latches. When one clock line must drive flip-flops as well as latches, it is

necessary to compensate for the difference in clocking polarities with an additional inverter


7/20


either in the flip-flop clock input or the latch-enable input. I/O storage elements are reset during

configuration or by the active-Low chip RESET input. Both direct input (from IOB pin I) and

registered input (from IOB pin Q) signals are available for interconnect.

Programmable-interconnection resources in the Field Programmable Gate Array provide routing

paths to connect inputs and outputs of the IOBs and CLBs into logic networks .Interconnections

between blocks are composed of a two-layer grid of metal segments. Specially designed pass

transistors, each controlled by a configuration bit, form programmable interconnect points (PIPs)

and switching matrices used to implement the necessary connections between selected metal

segments and block pins.

Figure below is an example of a routed net. The development system provides automatic

routing of these interconnections. Interactive routing is also available for design optimization.

The inputs of the CLBs or IOBs are multiplexers which can be programmed to select an input

network from the adjacent interconnect segments. Since the switch connections to block inputs

are unidirectional, as are block outputs, they are usable only for block input connection and not

for routing. Figure below illustrates routing access to logic block input variables, control inputs

and block outputs.


8/20


Three types of metal resources are provided to fulfill various network interconnect

requirements.

General Purpose Interconnect

Direct Connection

Long lines (multiplexed busses and wide AND gates)

General Purpose Interconnect

It consists of a grid of five horizontal and five vertical metal segments located between the rows

and columns of logic and IOBs. Each segment is the height or width of a logic block. Switching

matrices join the ends of these segments and allow programmed interconnections between the

metal grid segments of adjoining rows and columns. The switches of an un-programmed device

are all non-conducting. The connections through the switch matrix may be established by the

automatic routing or by selecting the desired pairs of matrix pins to be connected or

disconnected.

Special buffers within the general interconnect areas provide periodic signal isolation and

restoration for improved performance of lengthy nets. The interconnect buffers are available to

propagate signals in either direction on a given general interconnect segment. These bidirectional

(bidi) buffers are found adjacent to the switching matrices, above and to the right. The other PIPs


9/20


adjacent to the matrices are accessed to or from Long lines. The development system

automatically defines the buffer direction based on the location of the interconnection network

source. The delay calculator of the development system automatically calculates and displays the

block, interconnect and buffer delays for any paths selected. Generation of the simulation net list

with a worst-case delay model is provided.

Direct Interconnect

Direct interconnect provides the most efficient implementation of networks between adjacent

CLBs or I/O Blocks. Signals routed from block to block using the direct interconnect exhibit

minimum interconnect propagation and use no general interconnect resources.

For each CLB, the X output may be connected directly to the B input of the CLB immediately to

its right and to the C input of the CLB to its left. The Y output can use direct interconnect to

drive the D input of the block immediately above and the A input of the block below.Direct

interconnect should be used to maximize the speed of high-performance portions of logic. Where

logic blocks are adjacent to IOBs, direct connect is provided alternately to the IOB inputs (I) and


10/20


outputs (O) on all four edges of the die. The right edge provides additional direct connects from

CLB outputs to adjacent IOBs.

Long lines

The Long lines bypass the switch matrices and are intended primarily for signals that must travela long distance, or must have minimum skew among multiple destinations. Long lines, run

vertically and horizontally the height or width of the interconnect area. Each interconnection

column has three vertical Long lines, and each interconnection row has two horizontal Long

lines.

Vertical and Horizontal Long Lines


11/20


Two additional Long lines are located adjacent to the outer sets of switching matrices. Long lines

can be driven by a logic block or IOB output on a column-by-column basis. This capability

provides a common low skew control or clock line within each column of logic blocks. Isolation

buffers are provided at each input to a Long line and are enabled automatically by the

development system when a connection is made.

Technology Mapping for FPGA :

An FPGA consists of a regular array of logic blocks that implement combinational and

sequential logic functions and a user programmable routing network that provides connections

between the logic blocks . In conventional ASIC implementation technologies such as Mask

Programmed Gate Arrays (MPGAs) and Standard Cells the connections between logic blocks

are implemented by metallization at a fabrication facility. In an FPGA the connections are

implemented in the field using the user programmable routing network. This reduces

manufacturing turn-around times drastically from weeks to minutes and reduces prototype

costs.

But the limitations are , density and performance penalties associated with user programmable

routing. The programmable connections which consist of metal wire segments connected by

programmable switches occupy greater area and incur greater delay than simple metal wires. To

reduce the density penalty FPGA architectures employ highly functional logic blocks such as

lookup tables that reduce the total number of logic blocks and hence the number of programmable connections needed to implement a given application. These complex logic

blocks also reduce the performance penalty by reducing the number of logic blocks and

programmable conections on the critical paths in the circuit.

The high functionality of FPGA logic blocks presents new challenges for logic synthesis. So,the

technology mapping provides a solution for FPGAs that use lookup tables to implement

combinational logic. i.e Technology mapping is a process of transforming a technology

independent Boolean network into a technology dependent network. For example a K input

lookup table (LUT) is a digital memory that can implement any Boolean function of K variables.

The K inputs are used to address a 2 K by 1 bit memory that stores the truth table of the Boolean

function. It is a proven fact that lookup tables are an area efficient method of implementing

combinational functions and that the delays of LUT based FPGAs are minimum when compared


12/20


to the delays of FPGAs using other types of logic blocks .The goal of the technology mapping is

to reduce area, delay or a combination of both.

Technology mapping is the logic synthesiss task that is directly concerned with selecting the

circuit elements used to implement the optimized circuit. Previous approaches to technology

mapping have focused on using circuit elements from a limited set of simple gates. However

such approaches are inappropriate for complex logic blocks where each logic block can

implement a large number of functions . A K input lookup table can implement 2 K different

functions. For values of K greater than 3 the number of different functions becomes too large

for conventional technology mapping Therefore new approaches to technology mapping are

required for LUT based FPGAs.

Library-Based Technology Mapping : In library based mapping, gates or components are

selected from a technology library to implement a circuit. Hence it is also referred to as library

binding. So, this method generates a technology mapping for a given Boolean network using a

characterized cell library with the objective of cost optimization or delay optimization. Standard

Cells and Mask Programmed Gate Arrays implement combinational functions using a limited

set of simple gates. For such ASIC technologies library-based technology mapping is very

useful.

In this methodology the set of available circuit elements is represented as a library of functions

and the construction of the optimized circuit is divided into three sub problems(i). Decomposition, (ii). Matching and (iii) Covering.

The original network is first decomposed into a canonical representation that uses limited fan in

NAND nodes. This decomposition guarantees that there will be no nodes in the network that are

too large to be implemented by any library element provided the library includes NAND gates

that reach the fan in limit.

After decomposition the network is partitioned into a forest of trees The optimal sub circuit

covering each tree is constructed and finally the circuit covering the entire network is assembled

from these sub circuits. To form the forest of trees, the decomposed network is partitioned at fan

out nodes into a set of single output sub networks.

Each of these sub networks is either a tree or a leaf DAG (Directed Acyclic Graph). A leaf DAG

is a multi input single output DAG where only the input nodes have fan out greater than one.

Each leaf DAG is converted into a tree by creating a unique instance of every input node for


13/20


each of its multiple fan out edges The optimal circuit implementing each tree is constructed

using a dynamic programming traversal that proceeds from the leaf nodes to the root node.

For every node in the tree an optimal circuit implementing the sub tree extending from the node

to the leaf nodes is constructed. This circuit consists of a library element that matches a sub

function rooted at the node and previously constructed circuits implementing its inputs. The cost

of the circuit is calculated from the cost of the matched library element and the cost of the

circuits implementing its inputs.

To find the lowest cost circuit, the DAGON , first finds all library elements that match sub

functions rooted at the node. The cost of the circuit using each of these candidate library

elements is then calculated and the lowest cost circuit is retained . The set of library elements

is found by searching through the library and using tree matching to determine if each library

element matches a sub function rooted at the node.

As an example let us consider the library shown in the figure(a) below and the circuit shown in

figure(b). The circuit elements are standard cells and their costs are given in terms of the area of

the cells. The cost of the INV , NAND-2 and AOI-21 cells are2,3 and 4 respectively. In Figure

(b) the only library element matching at node E is the NAND-2 and the cost of the optimal

circuit implementing node E is therefore 3. At node C the only matching library element is also

the NAND2. The cost of the NAND-2 is 3 and the cost of the optimal circuits implementing its

input E is also 3.Therefore , the cumulative cost of the optimal circuit implementing node C is 6.


14/20


Finally the algorithm will reach node A_ For node A there are two matching library elements

the INV as used in figure(b) and the AOI-21 as used in figure (c).The circuit constructed usingthe INV matching A includes a NAND-2 implementing node B, a NAND-2 implementing node

C, an INV implementing node D and a NAND-2 implementing node E. The cumulative cost of

this circuit is 13. The circuit constructed using the AOI-21 matching A includes a NAND-2

implementing node E. The cumulative cost of this circuit is 7. The circuit using the AOI-21 is

therefore the optimal circuit implementing node A.

The major obstacle to applying library-based technology mapping to LUT circuits is the large

number of different functions that a K-input LUT can implement. The function implemented by

a K-input LUT is determined by the values stored in its 2 K memory bits. Since each bit can

independently be either 0 or 1, there are 22K different Boolean functions of K- variables.

For values of K greater than 3 the library required to represent a K-input LUT becomes very

large. The size of the library can be reduced by noting that some patterns are equivalent after a.


15/20


permutation of inputs . The inversion of outputs or inputs, which is trivially accomplished with

a LUT, can also produce equivalent patterns.

Another alternative is to use a partial library tuned to take advantage of the network structure

likely to be produced by technology independent logic optimization. The limitation of this

approach is that it precludes some opportunities for optimization of the final circuit.

LUT-based Technology Mapping:

The major obstacle to applying library-based technology mapping to LUT circuits is the large

number of different functions that a K-input LUT can implement. The function implemented by

a K-input LUT is determined by the values stored in its 2 K memory bits. Since each bit can

independently be either 0 or 1, there are 22K different Boolean functions of K- variables.For

values of K greater than 3 the library required to represent a K-input LUT becomes very large.

The limitations of earlier technology mapping approaches paved the way for the development

of technology mapping that deals specially with LUT circuits. The first LUT based technology

mappers appeared in 90s. and later improved for optimized delay performance of LUT circuits

by minimizing the number of levels of LUT in the final circuit.

In LUT based FPGAs (example XILINX FPGAs) the building blocks are LUTs and Flip-Flops.

In an LUT based FPGA chip the basic programmable logic block is a K-input Look Up

Table.(K-LUT) which can implement any Boolean function of up to K- variables.The technology

mapping in LUT based FPGA designs is to cover a general Boolean Network using K-LUTs toobtain functionally equivalent K-LUT network. The main objectives in LUT mapping are

(i).Cost optimal mapping i.e Minimizing the number of LUTs and Minimizing the number of

CLBs

(ii) Delay optimal mapping i.e Minimizing the number of LUT levels and Minimizing the

delays (including routing delays)

(iii).Maximizing the routability of the mapping schemes.

The LUT based technology can be implemented using two types of algorithms .They are

(a). The Area Algorithm and (b) The delay algorithm

The Area Algorithm :

A circuit can be implemented by a given FPGA only if the number of logic blocks in the circuit

does not exceed the available number of logic blocks and the required connections between the

logic blocks do not exceed the capacity of the routing network. The area algorithm minimizes


16/20


17/20


In figure(a) the shaded OR node is not decomposed and 5 levels of LUTs are required toimplement the network. However if the OR node is decomposed into the two nodes shown in

figure (b) then only 4 levels of LUTs are required.

The delay algorithm like the area algorithm firstt partitions the original net workin to a forest of

trees , maps each tree separately into a circuit of K-input LUTs and then assembles the circuit

implementing the entire network from the circuits implementing the trees. The trees are mapped

in a breadth first order proceeding from the primary inputs toward the primary outputs. This

ensures that when each tree is mapped that the trees implementing its leaf nodes have already

been mapped.

The overall strategy employed by the delay algorithm is to minimize the number of levels of

LUTs by minimizing the depth of every path in the final circuit. This can result in a circuit that

contains a large number of LUTs.

MULTIPLEXER BASED TECHNOLOGY MAPPING:

This Multiplexer based technology mapping is used in ACTEL FPGAs and in recent Xilinx

VIRTEX 6 FPGA devices .Because their logic block architectures are MUX based.In Actel based FPGAs ,the size of the Multiplexers is small and suitable to achieve the objective of area

optimization and minimum delays.

Circuits usually contain a large number of multiplexers (MUXes). This is mainly true for circuits

that are automatically synthesized from high-level descriptions. MUXes exist in the data-paths of


18/20


circuits, where they are used to route operands to operators. Also, the control logic is frequently

specified as a CASE statement in HDL descriptions. MUXs arise as a result of a direct

translation of CASE statements in HDLs into a logic-level description. Cell libraries too contain

various choices of MUXes. Cell implementations make use of the fact that a pass gate

implementation of a MUX is both, faster and smaller. In the case of MUX-based FPGAs like

Actel, there is a natural presence of MUX in the virtual library. Thus, a method for mapping

MUX in the unmapped network to those in the library is desirable.

The significance of Multiplexer synthesis is mainly due to the fact that Multiplexer tree circuits

give new FPGA's like the ACT. FPGA family from Actel , where the basic building block

consists of multiplexers .Each basic building block of the ACT family allows the

implementation of a multiplexer (a) and, in the case of the ACT l family, implementation of

three hierarchical multiplexers (b), which is denoted by act0. The ACT 2 family allows only a

restricted realization of three hierarchical multiplexers, as can be seen in Fig. (b).

Basic building block of the ACT' family : (a) ACT1 family; (b) .ACT2 family.

The main objective behind this Mux based technology mapping is ,describing a combinational

circuit in terms of Boolean equations and realize it using minimum number of basic blocks of

the target Mux based architecture and minimizing the delay on the critical path.

In this algorithm an appropriate base function ,a library of cells and a set of pattern graphs are

selected .As an example let us select a 2 to 1 multiplexer as a base function.


19/20


The above figure shows two Mux structures STRUCT and STRUCT1.Four pattern graphs are

constructed for STRUCT1 as shon in figure below.If the function is realizable by one STRUCT1 block ,it either uses all the multiplexers or two or just one.These pattern graphs are in one to one

correspondnce with these possibilities.So, a very small set of patterns to capture all possible

functions realizable by one STRUCT1 block is needed.From the figure it is clear that the pattern

graph uses all the multiplexers.

The introduction of the OR gate at the select input of MUX3 increases the number of function

realized by the block.from an algorithmic point of view it creates some problems .But the a

modification of the algorithm is considered for the concurrence of OR structure.

The advantages of MUX based technology mapping are it generates optimal mappings, which

are often much better than those produced by conventional heuristic techniques. Moderately

large circuits can be mapped optimally in a small amount of time. Very large circuits can be

mapped near-optimally by partitioning the circuits and mapping each partition individually.


20/20


---------xxx--------------

References:

(i).Technology Mapping for Lookup Table Based Field Programmable Gate Arrays, Robert J

Francis

(ii).Technology Mapping for Field-Programmable Gate Arrays Using Integer Programming,Amit Chowdhary and John P. Hayes.

(iii) .Experiences with XILINX Programmable Gate arrays,J.Molendijk & U.Wehrle

.

unit ii : cpld & fpga architecture & applications

Documents