theusageofdualedgetriggeredflip-°ops …cdr/pubs/wai_masc.pdfgies.threetypeswillbebrie...

The Usage of Dual Edge Triggered Flip-flops

in Low Power, Low Voltage Applications

by

Wai Man Chung

A thesis

presented to the University of Waterloo

in fulfillment of the

thesis requirement for the degree of

Master of Applied Science

in

Electrical and Computer Engineering

Waterloo, Ontario, Canada, January 2003

c©Wai Man Chung, 2003

I hereby declare that I am the sole author of this thesis.

I authorize the University of Waterloo to lend this thesis to other institutions or individuals

for the purpose of scholarly research.

Wai Man Chung

I authorize the University of Waterloo to reproduce this thesis by photocopying or other

means, in total or in part, at the request of other institutions or individuals for the purpose

of scholarly research.

Wai Man Chung

ii

The University of Waterloo requires the signatures of all persons using or photocopying

this thesis. Please sign below, and give address and date.

iii

Acknowledgements

I would like to thank...

Professor Manoj Sachdev

for the opportunity of working on this project, his guidance and especially his patience;

Dr. A. Opal and Dr. M.A. Hasan, my thesis readers,

for their invaluable time in reviewing my thesis;

Tim Lo and Sarmad Musa,

who have shown me what it is like to work in a real team;

Bhaskar Chatterjee, Oleg Semonov, Mohamed Elgebaly

for their insightful discussions;

Phil Regier,

whom I have given tremendous trouble to;

Wendy Boles and Wendy Gauthier, our helpful and friendly secretaries,

for their help and great smiles;

Gennum Corporation’s HIP group, Mr. Rob Cram in particular,

for their interest and support;

Canadian Microelectronics Corporation (CMC)

for providing resources and for their support;

My friends, Ying, Jen, Chris, Hadi, Ed, Ka Lok, Zhinian, Dorothy, etc.,

for bringing me so much laugher!!

My best friend Shannen for her sense of humor and encouragement;

My best friend Nora, who has listened and cheered me all along;

My family for their love, care and faith in me;

God, our heavenly Father, for His blessings.

iv

Abstract

In the research of low power and low voltage VLSI circuits, the use and implementation

of dual edge triggered flip-flop (DETFF) has gained more attention at the gate level de-

sign. The main advantage of using DETFF is that it allows one to maintain a constant

throughput while operating at only half the clock frequency. This thesis compares four pre-

viously published static dual edge triggered flip-flops (DETFFs) together with our design

for their performance, power dissipation, and low voltage, low power applications. For each

DETFF, the optimal delay, power consumption, and energy are determined as the primary

figures of merit. The proposed design demonstrates the least energy at low voltages. In

order to illustrate the advantages in using DETFFs over conventional single-edge triggered

flip-flops (SETFFs), a digital half-band FIR filter is designed, implemented and used as

a benchmark circuit for further investigation. The implementation of the FIR filter with

DETFFs exhibits power saving of 38% over the implementation with SETFFs.

v

Contents

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 Low Power and Low Voltage CMOS Design 4

2.1 Low Voltage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2 Parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Switching Activity Reduction . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.5 Switching Capacitance Reduction . . . . . . . . . . . . . . . . . . . . . . . 12

3 Dual Edge Triggered Flip-Flop 14

3.1 Types of Flip-flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Dual Edge-Triggered Flip-flops . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.2.1 DETFF implementations . . . . . . . . . . . . . . . . . . . . . . . . 18

3.3 Analysis of Dual-Edge Triggered Flip-flops . . . . . . . . . . . . . . . . . . 23

3.3.1 Power Consumption of a DETFF . . . . . . . . . . . . . . . . . . . 23

3.3.2 Timing Characterization of a DETFF . . . . . . . . . . . . . . . . . 24

vi

3.3.3 Simulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

3.5 Other Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

3.5.1 Parallel Interconnects and Clock Requirements . . . . . . . . . . . . 40

3.5.2 Design for testability . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4 Hearing Aids and Digital Filters 45

4.1 Hearing Aids . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2 Digital Filters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.1 Half-Band FIR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.2 Implementation of FIR - Direct Form Structure . . . . . . . . . . . 49

4.3 Design and Implementation of a Chebychev Half-Band FIR Filter . . . . . 50

4.3.1 Number representation . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.3.2 Processing Blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5 Results and Discussions 59

5.1 Measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.2 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6 Conclusions 68

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

A Glossary of Terms 70

vii

List of Tables

3.1 Simulation parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Optimal parameters for DETFFs studied [1] . . . . . . . . . . . . . . . . . 32

3.3 Performance Characteristics for DETFFs studied [1] . . . . . . . . . . . . . 33

3.4 Summary of DETFF performance as Vdd reduces [1] . . . . . . . . . . . . . 35

4.1 Specifications for Chebychev half-band FIR filter with N=39 . . . . . . . . 51

5.1 Back annotation parameters for DETFFs and standard SETFF . . . . . . 61

5.2 Simulation Results for Chebyshev half-band FIR filter operating at 0.9 V . 63

viii

List of Figures

2.1 A generic data path of a synchronous system . . . . . . . . . . . . . . . . . 8

2.2 Parallel architecture operating at lower clock rate and with reduced Vdd . . 9

2.3 Pipeline architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

2.4 Latch with clock gating [2] . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.5 Interconnect Capacitance . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.1 Classic implementation of DETFF . . . . . . . . . . . . . . . . . . . . . . . 18

3.2 DETFF proposed in [3], DETgago . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 DETFF proposed in [4], DETllopis . . . . . . . . . . . . . . . . . . . . . . . 20

3.4 DETFF proposed in [5], DETpedram . . . . . . . . . . . . . . . . . . . . . . 21

3.5 DETFF proposed in [6], DETstrollo . . . . . . . . . . . . . . . . . . . . . . 22

3.6 proposed DETFF, DETproposed . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.7 The simulation testbench for flip-flops . . . . . . . . . . . . . . . . . . . . . 27

3.8 DETFF internal node transitions given input sequence of 1010101000 . . . 30

3.9 PDPCQ vs tCQ, used to determine the initial optimization point [1] . . . . 33

3.10 Power consumption dependence on data transition activity α [1] . . . . . . 35

3.11 Power consumption dependence on supply voltage [1] . . . . . . . . . . . . 37

3.12 tCQ as a function of supply voltage [1] . . . . . . . . . . . . . . . . . . . . . 38

3.13 PDP dependency as a function of supply voltage [1] . . . . . . . . . . . . . 39

ix

3.14 DETFF proposed in [4] with unidirectional characteristic . . . . . . . . . . 43

3.15 DETFF using XOR operation to generate delayed clock pulses [7] . . . . . 43

3.16 DETFF proposed in [8] with DET pulse generator . . . . . . . . . . . . . . 44

4.1 FIR filter implementing with direct form structure . . . . . . . . . . . . . . 49

4.2 Block Diagram of the Chebychev half-band FIR filter design . . . . . . . . 54

4.3 (a) Bit-serial adder(b) Bit-serial subtractor [9] . . . . . . . . . . . . . . . . 55

4.4 Serial/Parallel multiplier . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.5 Structures for shift registers . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.1 DETgago with reset active low control . . . . . . . . . . . . . . . . . . . . . 60

5.2 DETproposed with reset active low control . . . . . . . . . . . . . . . . . . . 61

5.3 Frequency response of the Chebyshev half-band FIR filter . . . . . . . . . . 62

5.4 Improved implementation of DETgago with reset active low control . . . . . 64

5.5 Layout of the half-band FIR filter implemented with standard SETFF . . . 66

5.6 Layout of the half-band FIR filter implemented with standard DETFF . . 67

x

Chapter 1

Introduction

Today’s technologies make possible powerful computing devices with multi-media capabil-

ities. Consumer’s attitudes are gearing towards better accessibility and mobility. Their

desire has caused a demand for an ever-increasing number of portable applications requir-

ing low-power and high throughput. For example, notebook and handheld computers are

now made with competitive computational capabilities as those found in desktop machines.

Equally demanding are personal communication applications in a pocket-sized device. In

these applications, not only voice, but data as well as video are transmitted via wireless

links. It is important that these high computational capabilities are placed in a low-power,

portable environment. The weight and size of these portable devices is determined by the

amount of power required. The battery lifetime for such products is crucial, hence, a well

planned low-energy design strategy must be in place [10, 11].

As the density of the integrated circuits and size of the chips and systems continue to

grow, it becomes more and more difficult to provide adequate cooling for the systems [10].

In addition to heat removal, there are also economic and environmental issues for low-

power development. In the United States, computer equipment accounts for about 2-3%

1

Introduction 2

of national electricity consumption. This figure is expected to increase as there is tremen-

dous increase in household computer applications, Web phones, handheld computers, and

Internal terminals [11, 12, 13]. These economic and environmental reasons have compelled

the requirement for energy efficient computers.

In order to meet the demand in high computational applications, the clock rate is

steadily increasing, with clock jitter and clock skew being an increasingly significant part

of the clock cycle. The energy consumed by low-skew clock distribution networks is perpet-

ually growing. Clock-related power consumption can reach more than 30-40% of the total

power of microprocessor and is becoming a larger fraction of the chip power. In addition,

the number of logic gate delays in a clock period is reduced by 25% per generation, and is

approaching a value of 10 or smaller beyond 0.13 µm technology generation. As a result,

latency of flip-flops or latches is becoming a larger portion of the cycle time. In order to

achieve a design that is both high-performance and power-efficient, careful attention must

be paid to the design of the flip-flops and latches [7, 14].

1.1 Motivation

Energy consumption is a product of average power and delay. And power is linearly

proportional to the square of supply voltage. Voltage reduction is an efficient way to

reduce power consumption; yet, it also leads to logic speed reduction. However, for signal

processing applications, such as digital filter, it is important to maintain a given level of

computation or throughput. Hence, parallel architectures should be used to maintain the

throughput at a reduced supply voltage [10]. Dual-edge triggered flip-flops is a device level

realization of this concept. It can obtain the same data throughput with one half of the

clock frequency, thus relaxing the power and clock uncertainty requirements [14].

Introduction 3

1.2 Thesis Outline

This thesis is focused on the applicability of DETFFs in low power and low voltage appli-

cations. Chapter 2 provides background on low power and low voltages applications and

considerations. Techniques for low power and low voltage, such as parallelism, are also

described in this chapter. Chapter 3 first describes all the DETFFs investigated in this

study, including a newly proposed DETFF. It states the analysis methodology used and

outlines the simulation testbench and parameters. In addition, the DETFF optimization

procedure is also explained, followed by simulation results. Chapter 4 provides a brief intro-

duction on a stringent low power, low voltage application - hearing aids. It then described

a Chebyshev half-band finite impulse response (FIR) digital filter and its implementation.

This FIR is used as a benchmark circuit to further investigates the usage of DETFFs in

practical low power and low voltage applications. Chapter 5 summarizes the simulation

and layout results of the FIR filter. Finally, the discussion and conclusions are drawn in

Chapter 6.

Chapter 2

Low Power and Low Voltage CMOS

Design

The design of portable devices requires consideration for peak power consumption to ensure

reliability and proper operation. However, the time averaged power is often more critical

as it is linearly related to the battery life. There are four sources of power dissipation

in digital CMOS circuits: switching power, short-circuit power, leakage power and static

power. The following equation describes these four components of power:

Pavg = Pswitching + Pshort−circuit + Pleakage + Pstatic (2.1)

= αCLVddVsfck + IscVdd + IleakageVdd + IstaticVdd (2.2)

Pswitching is the switching power. For a properly designed CMOS circuit, this power

component usually dominates, and may account for more than 90% of the total power [10,

15]. α denotes the transition activity factor, which is defined as the average number of

power consuming transitions that is made at a node in one clock period. Vs is the voltage

swing, where in most cases it is the same as the supply voltage, Vdd. CL is the node

4

Low Power and Low Voltage CMOS Design 5

capacitance. It can be broken into three components, the gate capacitance, the diffusion

capacitance, and the interconnect capacitance. The interconnect capacitance is in general

a function of the placement and routing. fck is the frequency of clock. The switching power

for static CMOS is derived as follows [11].

During the low to high output transition, the path from Vdd to the output node is con-

ducting to charge CL. Hence, the energy provided by the supply source is

E =

∫ ∞

0

VddI(t)dt (2.3)

where I(t) = Vs

Re−t/RCL is the current drawn from the supply. Here, R is the resistance of

the path between the Vdd and the output node. Therefore, the energy can be rewritten as

E = CLVddVs (2.4)

During the high to low transition, no energy is supplied by the source. Hence, the average

power consumed during one clock cycle is

P =Epercycle

T= CLVddVsfck (2.5)

Eq. (2.4) and Eq. (2.5) estimate the energy and the power of a single gate only. From a

system point of view, α is used to account for the actual number of gates switching at a

point in time.

Pshort−circuit is the short-circuit power. It is a type of dynamic power and is typically

much smaller than Pswitching [11]. Isc is known as the direct-path short circuit current.

It refers to the conducting current from power supply directly to ground when both the

NMOS and PMOS transistors are simultaneously active during switching [10].

Pleakage is the leakage power. Ileakage refers to the leakage current. It is primarily deter-

mined by fabrication technology considerations and originates from two sources. The first

is the reverse leakage current of the parasitic drain-/source-substrate diodes. This current


is in the order of a few femtoamperes per diode, which translates into a few microwatts

of power for a million transistors. The second source is the subthreshold current of MOS-

FETs, which is in the order of a few nanoamperes. For a million transistors, the total

subthreshold leakage current results in a few milliwatts of power [10, 11].

Pstatic is the static power and Istatic is static current. This current arises from circuits

that have a constant source of current between the power supplies such as bias circuitries,

pseudo-NMOS logic families [10]. For CMOS logic family, power is dissipated only when

the circuits switch, with no static power consumption [15].

Energy is independent of the clock frequency. Reducing the frequency will lower the

power consumption but will not change the energy required to perform a given operation,

as depicted by Eq. (2.4) and Eq. (2.5). It is important to note that the battery life is

determined by energy consumption, whereas the heat dissipation considerations is related

to the power consumption [11].

There are four factors that influence the power dissipation of CMOS circuits. They are

technology, circuit design style, architecture, and algorithm. The challenge of meeting the

contradicting goals of high performance and low power system operation has motivated the

development of low power process technologies and the scaling of device feature sizes. De-

sign considerations for low power should be carried out in all steps in the design hierarchy,

namely 1)fundamental, 2)material, 3)device, 4)circuit, and 5)system [10, 15].

2.1 Low Voltage

Power consumption is linearly proportional to voltage swing (Vs) and supply voltage (Vdd)

as indicated in Eq. (2.5). For most CMOS logic families, the swing is typically rail-to-

rail. Hence, power consumption is also said to be proportional to the square of the supply


voltage, Vdd. Therefore, lowering the Vdd is an efficient approach to reduce both energy

and power, presuming that the signal voltage swing can be freely chosen. This is, however,

at the expense of the delay of circuits. The delay, td, can be shown to be proportional to

Vdd/(Vdd − VT )γ. The exponent γ is between 1 and 2. It tends to be closer to 1 for MOS

transistors that are in deep sub-micrometer region, where carrier velocity saturation may

occur. γ increases toward 2 for longer channel transistors [11].

The current technology trends are to reduce feature size and lower supply voltage.

Lowering Vdd leads to increased circuit delays and therefore lower functional throughput.

Smaller feature size, however, reduces gate delay, as it is inversely proportional to the square

of the effective channel length of the devices. In addition, thinner gate oxides impose volt-

age limitation for reliability reasons. Hence, the supply voltage must be lowered for smaller

geometries. The net effect is that circuit performance improves as CMOS technologies scale

down, despite of the Vdd reduction. Therefore, the new technology has made it possible to

fulfill the contradicting requirements of low-power and high throughput [10, 11, 15].

The various techniques that are currently used to scale the supply voltage include opti-

mizing the technology and device reliability, trading off area for low power in architecture

driven approach, and exploiting the concurrency possibility in algorithmic transformations

[10]. Reference [11] has shown that the optimum supply voltage for CMOS technology is

equal to 3Vth

3−γ. Hence, the voltage scaling is limited by the threshold voltage Vth.

In applications such as digital processing, where the throughput is of more concern

than the speed, architecture can be designed to reduce the supply voltage at the expense

of speed without throughput degradation. Hence, the performance of the system can be

maintained. This can be achieved by using parallelism and/or pipelining. Both techniques

will be discussed next [11].


2.2 Parallelism

Power dissipation increases linearly with frequency, as shown in Eq. (2.1). System clock

typically switches at the highest frequency across the chip. Hence, clock switching has a

considerable impact on power. Clock power can be as much as twice the logic power for

static logic and three times the logic power for dynamic logic. To minimize clock power,

a given system should be operated at the minimum frequency necessary to attain the

required level of performance. Logic parallelism allows operation at lower frequency while

maintaining the desired throughput. It also allows for reduced supply voltage [15].

Consider a generic data path, which consists of two latches which synchronize the data

flow, as depicted in Figure 2.1. The energy consumed by this implementation is E = CLV2dd.

Din Logic block D

CK(f)

D

CK(f)

Dout

Figure 2.1: A generic data path of a synchronous system

Now, consider the parallel structure illustrated in Figure 2.2. If the logic is duplicated

n times, then the input can be fed to each logic block at a lower frequency. In fact, the

latches to each block are clocked at one n-th the frequency of the latch in Figure 2.1. The

output of the parallel blocks is sent to the output latch through a multiplexer.

In this implementation, the clock rate of each logic block is reduced to f/n. Hence, the

delay can be relaxed by a factor of n, while the throughput is remained the same as in the

non-parallel case. Furthermore, by allowing for increased block delay, the supply voltage

can be scaled down by n times. The optimum supply voltage for the parallel architecture,


Din

Logic sub-block (1)

D

Logic sub-block (2)

D

Logic sub-block (n)

D

CKn(f/n)

D

CK(f)

CK2(f/n)

CK1(f/n)

Select

Dout M

U

X

Figure 2.2: Parallel architecture operating at lower clock rate and with reduced Vdd

Vdd(//), neglecting Vth, is expressed as

Vdd(//) =Vddn

(2.6)

And if the power overhead due to the multiplexer is neglected, the energy for the parallel

architecture, E(//), can be expressed as

E(//) = CLV2dd(//) = E

V 2dd(//)

V 2dd

=E

n2(2.7)

where E is the energy consumed by the generic, non-parallel structure.

This method works well for computationally intensive functions, however, it costs a

large area as it requires duplicating hardware. Using this architecture driven voltage scal-

ing strategy, a parallel logic can provide the same throughput as the original logic while


operating at greatly reduced frequency and voltage [15]. Eq. (2.7) indicates that the en-

ergy saving in a parallel architecture is proportional to the square of the voltage scaling

factor [11]. Parallelism can be applied at different levels of a design: system, architecture,

circuit/logic, device, etc. DET flip-flop is considered as a device level implementation of

parallelism.

2.3 Pipeline

Pipelining is another approach to relax the speed requirement without degrading the

throughput. In this approach, a logic block is broken down into i sub-logic blocks and

latches are inserted between them as shown in Figure 2.3. The delay of each sub-logic

block is therefore reduced. Hence, it allows for reduction of the supply voltage, Vdd. It

should be note that the pipeline approach may offer comparable power savings to the paral-

lel architecture but with less area overhead [11]. In some applications, the two approaches,

pipeline and parallelism, can be applied together to achieve higher power/energy savings.

Logic sub-block (1)

D

CK(f)

D

CK(f)

Logic sub-block (n)

D

CK(f)

D

CK(f)

Din Dout

Figure 2.3: Pipeline architecture

2.4 Switching Activity Reduction

CMOS circuits dissipate power only when switching, therefore it is important to minimize

the switching activity for low power applications. Switching is decreased when the data


rate is low. Hence, switching activity can be reduced by circuit and architectural optimiza-

tion exploring data correlation. For instance, human speech exhibits a higher correlation

compared to random data [10]. Switching activity can be reduced by algorithmic opti-

mization, architecture optimization, logic topology, and circuit optimization, which are

discussed as follows.

Algorithmic optimization depends heavily on the application and on the characteristics

of the data. Furthermore, the data representation may have a significant impact on the

switching activity. Recent researches show that the use of a gray code in address bits, where

data changes sequentially, results in less transition than the use of binary code. Moreover,

the sign-magnitude notation is more efficient for data that changes sign frequently, when

compared to the two’s complement notation. A change in sign causes transitions of all

the most significant bits in the two’s complement representation, whereas only the sign bit

changes in the case of sign-magnitude notation.

Architecture optimization can be achieved through delay balancing, precomputation

logic, and power management scheme. Balanced tree topologies are often used to balance

path delay, hence reduce glitching. Precomputation logic predicts the output signal one

clock cycle ahead while using minimum circuit overhead. It generally limits a small subset

of inputs to pass over to the combinational blocks, and hence minimizes the switching

activity of the system as a whole. As shown in Figure 2.4 is a latch with clock gating. The

XOR gate compares the values of D and Q. If D and Q are the same, the output of the

XOR gate is 0. The AND gate then prevents the clock from triggering the latch. On the

other hand, if D and Q are different, then the XOR-AND logic allows for the passing of

the clock signal. This scheme eliminates any unnecessary clock switching internal to the

latch.

Power management technique is one of the most effective approaches in switching ac-


D Q D Q

CK CK CKG

Figure 2.4: Latch with clock gating [2]

tivity reduction. This power-down method puts the circuits in a sleep mode when they

are idle. It can be applied at different levels of hierarchy, from module to chip level, even

at the printed circuit board. Circuit optimization may come down to the choice of logic

families as well as gate topologies. The selection is also application oriented [11].

2.5 Switching Capacitance Reduction

Energy consumption is proportional to switching capacitance as shown in Eq. (2.4). Switch-

ing capacitance consists of transistor parasitic as well as wire capacitance from metal in-

terconnects. In general, fewer the transistor counts, lesser the parasitic capacitances of the

gate oxide and the source/drain diffusion capacitances. Complementary Passgate Logic

(CPL) family demonstrates the least transistor count, compared to dynamic and static

logic family. The interconnect capacitance can be further divided into three main compo-

nents, as shown in Figure 2.5: parallel plate capacitance, fringing field effect, and wire-wire

capacitance. These three are inter-related by the width (W) and the height (H) of the wire,

as well as the thickness of dielectric (tox).

As tox increases, parallel plate capacitance reduces. But when tox becomes comparable

to W and H, the fringing field effect dominates. When W is much larger than tox, parallel

plate capacitance dominates. But when W is smaller than H, the wire-wire capacitance


t ox Dielectric

W

H

Parallel plate capacitance

Wire - wire capacitance

Metal Interconnect

Fringing capacitance

Figure 2.5: Interconnect Capacitance

would dominate. The optimum ratio for minimum capacitance is obtained when W/H is

1.75.

Moreover, for low power design, the rule is to size up only the transistors that are on

critical paths to meet the speed requirement and keep the rest of transistors minimum-size

as much as possible. Layout optimization is also crucial. The appropriate layout styles

not only minimize the diffusion capacitances, but also the interconnect length, and hence

leads to significant power saving [10, 11].

Chapter 3

Dual Edge Triggered Flip-Flop

In a synchronous system, operations and data sequences take place with a fixed and pre-

determined time relationship. The timing of computations are controlled by flip-flops and

latches together with a global clock, as shown in Figure 2.1. Flip-flops and latches are

clocked storage elements, which store values applied to their inputs. They are classed

according to their behaviour during the clock phases. A latch is level sensitive. It is

transparent and propagates its input to the output during one clock phase (clock low or

high), while holding its value during the other clock phase. A flip-flop is edge triggered.

It captures its input and propagates it to the output at a clock edge (rising or falling),

while keeps the output constant at any other time. The design of these clocked storage

elements is highly depended on the clocking strategy and circuit topology [9, 15, 16]. This

research focuses on synchronous system with edge-triggered clocking strategy, henceforth,

only flip-flop is discussed. In particular, dual edge-triggered flip-flops are introduced and

explored.

14

Dual Edge Triggered Flip-Flop 15

3.1 Types of Flip-flops

Storage element generally stores its value as charges on a capacitor. CMOS flip-flop can

be static or dynamic, depending on how it retains its values against charge leakage. A

static flip-flop retains its value using positive feedback, while a dynamic flip-flop requires

periodic refreshment of charges.

Besides the method of retaining storage value, flip-flops are also classed by their topolo-

gies. Three types will be briefly discussed in the following: master-slave flip-flops, pulsed-

based flip-flops, and amplifier-based flip-flops.

Master-slave flip-flop is the most commonly used flip-flop topology in low power appli-

cations. It is composed of a master latch cascaded with a slave latch. These two latches

are active during opposite clock phases.

Pulsed-based flip-flop is popular for its soft-clock edge property, which allows time

borrowing and alleviates clock skew penalty just like level-sensitive latch. It also provides

superior latency and is capable of incorporating complex logic. Hybrid latch flip-flop

(HLFF) and semidynamic flip-flop (SDFF) are two practical examples of pulsed-based

flip-flops. HLFF is a latch with a brief transparent pulse derived from the global clock

edge [17]. SDFF is composed of a dynamic stage coupled to a static stage [18, 19].

Amplifier-based flip-flop is mainly designed as a de-skewing element [16]. Sense amplifier-

based flip-flop (SAFF) is an example of amplifier-based flip-flop. It incorporates a precharged

sense amplifier in the first stage to generate a negative pulse, and a set-reset (SR) latch in

the second stage to capture and hold the results [20].

For the critical paths of a design, a small flip-flop delay is crucial while power con-

sumption is a secondary concern. Therefore, pulse-based flip-flops, which have very short

latency, are appropriate for these types of applications. For paths that are not critical in

the design, lower power consumption can be achieved by employing static flip-flops [7].


With low power and low voltage applications in mind, static flip-flops are the focus of

analysis. Static flip-flops are of two main types, made from gates or transmission gates.

Strictly speaking, transmission gate inputs and outputs have a slightly smaller capacitance

than inverter outputs, since transmission gates do not have the Miller effect. However, the

gate-based static flip-flop is found to have lower power dissipation than transmission gate

flip-flop, in spite of the fact that it uses considerably more transistors. The reason is that

it has fewer clocked transistors [15].

In addition, many other features can be incorporated into the flip-flop for enhance-

ment. For instance, conditional shutoff capability can provide a reduced sensitivity to the

variations of the sampling window of a pulsed-based flip-flop. Conditional capture fea-

ture can improve statistical power reduction [20]. Pulsed-based and master-slave flip-flops

are integrated as a new topology to further improve the latency and power efficiency in

Reference [7].

3.2 Dual Edge-Triggered Flip-flops

As discussed in chapter 2, clock related power is one of the most significant components of

the dynamic power consumption. The total clock related power dissipation in synchronous

VLSI circuits is further divided into three major components [4, 21]: (i) power dissipation

in the clock network, (ii) power dissipation in the clock buffers, and (iii) power dissipation

in the flip-flops. The total power dissipation of the clock network depends on both the

clock frequency and the data rate, and can be computed based on Eq. (2.5):

PCK = V 2dd[fCK(CCK + Cff,CK) + fDCff,D] (3.1)


where

fCK is the clock frequency;

fD is the average data rate;

CCK is the total capacitance seen by the clock network;

Cff,CK is the capacitance of the clock path seen by the flip-flop;

Cff,D is the capacitance of the data path seen by the flip-flop.

From Eq. (3.1), it is obvious that the clock power can be reduced if any of the parameters

on the right hand side of the equation is reduced. The reduction of Vdd is already the

trend of contemporary design, and it has the strongest impact on the PCK expression. By

reducing the overall capacitance of the clock network, CCK , the power dissipation may also

be reduced. For instance, the capacitance can be reduced by proper design of clock drivers

and buffers. Similarly, by reducing the capacitance inside a flip-flop, Cff,CK and Cff,D,

power may also be reduced.

Furthermore, the clock power dissipation is linearly proportional to the clock frequency.

Although the clock frequency is determined by the system specifications, it can be reduced

with the use of dual edge triggered flip-flops (DETFFs). As its name implied, DETFF

responds to both rising and falling clock edges. Hence, it can reduce the clock frequency by

half while keeping the same data throughput. As a result, power consumption of the clock

distribution network is reduced, making DETFFs desirable for low power applications.

Even for high performance applications, the usage of DETFFs offers certain benefits. Since

the clock speed is reduced by a factor of two, one does not need to propagate a relatively

high speed clock signal.

A classic double-edge triggered flip-flop can be implemented as in Figure 3.1. In this

classic configuration, two opposite polarity level-sensitive latches are connected in parallel,

the output is then multiplexed at the output stage [8].


D Q

Q

1

0

Z

Y

CK

D D Q

CK

CK

Figure 3.1: Classic implementation of DETFF

If the clock load of the DETFF is not significantly larger than the traditional single-

edge triggered flip-flop (SETFF), the power in the clock distribution network is reduced by

as much as a factor of two. Because the clock distribution power is a large fraction of the

total power of a synchronous VLSI system, significant overall power savings is possible [7].

3.2.1 DETFF implementations

A few previously reported DETFFs along with a newly proposed DETFF are analyzed in

this study for their performance and applicability in low power and low voltage applications.

DETgago

The flip-flop, DETgago, proposed in [3] is illustrated in Figure 3.2. Nodes N2, N3, N4,

and N5 represent parallel connections between input buffers and latches. The appropriate

phase of clock and its complement connects and disconnects the input buffers and storage

elements from the power supply and ground. When CK is high, the top input buffer and

the bottom latch are active while the bottom input buffer and the top latch are inactive,


and vice versa. As a result, it has potential for low power applications. Although the

complete isolation of the active and inactive parts of the circuit helps in power savings, it

leads to a larger delay.

N2

N3

Q

N4

N5

N5

N4 N2

N3

D

CK CK

CK

CK

CK

CK

Q

Figure 3.2: DETFF proposed in [3], DETgago

DETllopis

Figure 3.3 shows the circuit implementation of DETllopis proposed in [4] which is a

modified version of the DETFF proposed earlier in [22]. Complementary transmission and

logic gates are employed here to balance the output rise and fall times of the original

DETFF. With this modification, it improves the power and the latency at the expense of


increased total transistor count.

D Q

CK

Q

CK

CK CK CK

CK

CK

CK

Figure 3.3: DETFF proposed in [4], DETllopis

DETpedram

Pedram et al. proposed a DETFF, DETFFpedram, as shown in Figure 3.4 [5]. In this

DETFF, the role of the clock enable signal and the input data signal is reversed in the

feedback transmission gate loops of the storage latches. This implementation can reduce

the number of transistor count at the expense of increased latency. Consider an operation

at a falling clock edge. CK is high initially, and the upper latch is active. Now, if D is

1, the input transmission gates pass a 1 to N1 and N2 becomes a 0. N2 then switches

on P-passgate M6 and passes a 0 (CK) onto N1. This creates contention at node N1 and

hence increases the delay. However, when CK switches to low, input transmission gates

are closed, M6 now passes a 1 (CK) onto N1, further enhances the value stored. A similar

approach can be used to study the case for D equal 0.

DETstrollo

A DETFF proposed by Strollo et al. in [6] is illustrated in Figure 3.5, DETstrollo. This

DETFF is a pulse-based single latch DETFF. Its operation is based on pulse triggering


D Q

CK

CK

CK

CK

CK CK

CK

CK

CK

CK

N1

N2

M6

M5

Figure 3.4: DETFF proposed in [5], DETpedram

that is created by its internal clock buffers. The input passgates in series serves as an AND

operation to provide a short transparent pulse. Since input passgates are of N-type, the

PMOS are used to restore value at N1 to full swing. The weakly on PMOS (referred to the

PMOS whose gate is tied to ground) can help minimizing the parasitic capacitance. The

size of the transparent pulse width is crucial in this design. Hence, the proper operation

of this DETFF is highly dependent on the internal clock buffer sizing and the propagation

delay of the internal clock buffers.

Proposed DETFF

The proposed DETFF, DETproposed, is illustrated in Figure 3.6. It consists of two sets of

back-to-back inverters as storage elements. A true and complement combination of input

data and clock signals controls the latching of the data value in these storage elements.

When CK is high, node N7 is pre-discharged to 0. If D is 1, then N1 is pulled down to 0.


D Q

CK

CK CK

CK

CK1 CK1

CK1

CK2

CK2

Q N1

Figure 3.5: DETFF proposed in [6], DETstrollo

Else if D is 0, then N2 is 0 and N1 becomes a 1. The main advantage of this configuration

is that it avoids stacking PMOS transistors. As a consequence, low voltage and low power

operation becomes feasible.

Q

D D

D

CK CK

D

CK

CK

CK CK

CK

CK

D

N1 N2 N3 N4

N7 N8

Figure 3.6: proposed DETFF, DETproposed


3.3 Analysis of Dual-Edge Triggered Flip-flops

Several metrics are available for comparative analysis of digital circuits. For example,

power consumption, delay and latency, energy or power-delay product (PDP), energy-

delay product (EDP), and energy-delay-squared product (ED2P ) have been reported by

several researchers [23, 24]. In general, a PDP based metric is appropriate for low power

portable systems in which the battery life is the primary index of energy efficiency. This is

in contrast with EDP or ED2P , where delay is weighted more heavily for high performance

systems [23].

In this study, we are primarily interested in DETFF usage for low power low voltage

applications. Therefore, PDP is selected as the primary figure of merit. However, since

the scaling of Vdd directly affect both energy consumption positively and delay negatively,

it implies that using the energy as the metric is not sufficient for low voltage applications.

The energy-delay product, on the other hand, accounts for both the energy and the delay,

thence will be used as well.

In particular, this analysis is similar to the comparative technique described by Sto-

janovic et al. [25]. Their study establishes a set of guidelines for objective comparisons of

single edge triggered (SET) latches and flip-flops. The details of power and delay param-

eters employed in this study are defined in the following two subsections:

3.3.1 Power Consumption of a DETFF

There are three main components of power dissipation of a flip-flop:

(a) Internal power dissipation of the flip-flop represents the power consumed by the internal

and input nodes during latching operations, including the power dissipated driving the

output load.


(b) Local clock power dissipation represents the portion of the power dissipated in the

clock buffer that is driving the clock input of the flip-flop.

(c) Local data power dissipation represents the portion of the power dissipated in the logic

gate that is driving the data input of the flip-flop.

The clock power dissipation is determined solely by the clock load of the flip-flop,

whereas the distribution of the internal and data power dissipation is affected by the

structure and operation of the latching element itself as well as the input switching ac-

tivity [20]. The sum of these three components is referred to as the total power (PTOT ).

All three components of power require independent estimation in any comparative analysis

because, inherently, a tradeoff exists between the three. If a comparison is made without

taking all three components into account, it may indicate misleading results.

3.3.2 Timing Characterization of a DETFF

There are two delay parameters of interest in this study. The first delay is the time

measured between the clock edge and the output edge, or tCQ. The second delay is the

time measured between the input data edge and the output edge, or tDQ. The latter

parameter is often referred to as the latency of a flip-flop. It is composed of tCQ and

tDC (the data setup time). Since there are two parallel paths in a classic DETFF, two

characteristics are obtained. One corresponds to the rising edge of the clock and the other

one corresponds to the falling edge of the clock. These two characteristics are independent

of each other and generally are not the same. Hence, the latency of a DETFF is defined


as:

td1 = tCQ,LH + tDC,HL

td2 = tCQ,HL + tDC,LH

tDQ = max(td1, td2)

where tCQ,LH and tCQ,HL are the clock-to-output time at rising and falling clock edge

respectively. tDC,HL and tDC,LH are the setup time required at the falling and rising clock

edge respectively [14]. Thus, the latency for a DETFF is computed indirectly as the

maximum tDQ of a rising and a falling data transitions for both rising and falling clock

edges.

Latency is significant in synchronous system because the system’s cycle time depends on

the longest delay of the network [16]. However, tCQ is equally important for this comparison

since the setup time is often also a function of the independent variable of the simulations.

This is true in the optimization process where changes in the transistor widths affect the

setup time and in the supply voltage analysis where the voltage is independent variable.

For completeness, the set-up and hold times, the maximum data rate and total transis-

tor width are included as additional flip-flops performance metrics. Total transistor width

is used as a measure of the flip-flop area, since the physical layout is not available at this

point.

3.3.3 Simulation

A tradeoff between speed and power consumption is often possible, and it is normally

determined by the application. Hence, a given flip-flop can either be optimized for high

performance or low power. However, when both power dissipation and performance are

critical, one needs to determine a design that operates at the optimum. At this optimum


operating point, the power-delay product is minimum, i.e. optimal energy utilization for a

given clock frequency. However, since the optimal delay and power parameters cannot be

obtained in a single step, the energy optimization procedure is often iterative [25].

Testbench

For this study, 0.18 µm CMOS technology is used. Apart from the supply voltage analysis,

all simulations are carried out at nominal conditions: Vdd=1.8 V and at room temperature

(25◦C). The clock frequency is kept at 500 MHz. This clock frequency for DETFFs is

equivalent to 1 GHz for SETFFs. Details of the simulation parameters are summarized in

Table 3.1.

Table 3.1: Simulation parameters

0.18 µm CMOS technology

MOSFET Model: BSIM3 Level 49

Nominal Conditions: Vdd=1.8V T=25◦C

Frequency Rise time Fall time Duty Cycle Sequence Length

Clock 500 MHz 100 ps 100 ps 50% n/a

Data n/a 100 ps 100 ps n/a 16 clock cycles

The testbench for this study is illustrated in Figure 3.7. Additionally, input buffers are

used to provide realistic clock and data signals. A fanout of five inverters (approximately

32 fF in 0.18 µm technology) is used as the nominal load for each DETFF. These inverters,

in turn, drive a capacitive load CL of 25 fF each, to simulate the loading from the previous


logic stages, as well as the following stages. All the measurements are taken over a 16-cycle

data sequence of alternating 1’s and 0’s. As mentioned before, the total power dissipation

is composed of three components. They are represented and calculated in the testbench

as follows:

(a) Local data power represents the portion of power dissipated in the grey inverter driving

the data input of the flip-flop.

(b) Local clock power represents the portion of power dissipated in the black inverter,

which drives the clock input of the flip-flop.

(c) Internal power consumption is the intrinsic power dissipated on switching the internal

nodes of the flip-flop.

Clock

Data C

L C L

C L

C L

C L

C L

CK

D Q

Figure 3.7: The simulation testbench for flip-flops

In order to compute the local data power and the local clock power, the flip-flop under

test is initially disconnected, and the power dissipated by the grey inverter and the black

inverter are recorded respectively. The flip-flop is then connected to the testbench for

performance analysis. The power consumed by the grey and black inverters are recorded

again for this time. Hence, the local data power can be calculated as the difference of the


two power dissipations of the grey inverter. Likewise, the local clock power is computed as

the difference of the two power consumption values of the black inverter.

Size Optimization of DETFF

Due to the inter-relationship between transistors’ sizes, the sizing of flip-flops is opti-

mized using a line optimization algorithm. Starting with an initial guess in which all the

devices are minimum sized, the dimension of the inverter driving the output Q is first

optimized. Then working backward from the Q output to the D input. This sequence

of one-dimensional optimizations is iterated until the power-delay product stops decreas-

ing [26].

During the size optimization, a data transition probability, α, of 0.5 is assumed. The

critical path is first identified. The width of the NMOS transistor, wn, is then selected

as the parameter of interest. The sizing of the PMOS transistors that are located on the

critical path is kept at a certain ratio with respect to wn. This ratio is determined by

balancing the rising and falling edges of the output waveform of a test inverter. Note that

this ratio changes with NMOS sizing. Moreover, transmission gates and transistors that

are not located on the critical path are implemented with relatively smaller sizes.

Delay and power are measured as functions of wn. The measured power is the sum of

all three components discussed earlier, whereas the delay is expressed by tCQ. Once the

power and delay measurements are obtained, the PDPCQ is calculated as the product of the

power and delay. Subsequently, PDPCQ is plotted as a function of tCQ. The initial PDPCQ

point is taken as a minimum point of the PDPCQ versus tCQ curve. If the minimum point

does not exist, the operating point with the minimum tCQ for a given energy is selected as

the initial PDPCQ point to begin the optimization process. Once the initial PDPCQ point

is determined for each flip-flop, these flip-flops are further optimized using the iterative


line optimization method, until the best PDPCQ and PDPDQ are found.

Dependency of Data Transition Probability/Activity α

Generally in a VLSI circuit, each flip-flop could have input data with different transition

probability, α. As a consequence, it is interesting to observe the behaviour of flip-flop

power-delay product as a function of α [26].

Furthermore, the power saving of using DETFF is strongly dependent on α. Recall

that power dissipation of a CMOS circuit is

PD =1

2fDV 2

dd

∑

j

αjCj (3.2)

And the power dissipation due to clock nodes’ switching is

PCK = fCKV 2ddCCK (3.3)

In a SET-based system, fD,SET = fCK,SET , as there is at most one signal change in one clock

period, in the absence of glitching. Whereas in a DET-based system, fD,DET = 2×fCK,DET

as there are at most two signal changes in one clock period. For a fixed data throughput,

fD,SET = fD,DET = f . Hence,

PSET = fV 2ddCCK,SET +

1

2fV 2

dd

∑

j

αj,SETCj,SET (3.4)

PDET =1

2fV 2

ddCCK,DET +1

2fV 2

dd

∑

j

αj,DETCj,DET (3.5)

In addition, it is interesting to note that the internal nodes transition probability of a

classic DETFF is the same as that of a SETFF. Consider the classic DETFF illustrated in

Figure 3.1 and a master-slave SETFF. As shown in Figure 3.8, the transition probability


D

1 1 1 1 0 0 0 0 0 0

CK

Z

Y

Q

Figure 3.8: DETFF internal node transitions given input sequence of 1010101000

of DETFF’s internal nodes is equal to that of the D input. Hence, αY = αZ = αQ = αD.

Similarly for master-slave SETFF, αX = αQ = αD.

Therefore, (3.4) and (3.5) can be rewritten as:

PSET = fV 2ddCCK,SET +

1

2fV 2

ddαDCα,SET (3.6)

PDET =1

2fV 2

ddCCK,DET +1

2fV 2

ddαDCα,DET (3.7)

where Cα,SET = CD,SET + CX,SET + CQ,SET and Cα,DET = CD,DET + CY,DET + CZ,DET +

CQ,DET . And the power saving is defined by the ratio, η, between the DETFF and SETFF

power dissipation:

η =PDET

PSET

=CCK,DET + αDCα,DET

2CCK,SET + αDCα,SET

(3.8)

As demonstrated, if α is low, then the reduced clock frequency of DETFF may result

in significant power savings. For a larger number of low power applications, the transition

activity of input data is indeed approximately one-tenth of the clock signal activity [7]. For


high input activity, the αDCα parts dominate. In fact, the total switched capacitance on

the clock line is actually larger in DETFF with respect to SETFF structures. This requires

larger buffers in the clock tree, and hence, increases the overall power dissipated by the

clock. Therefore, energy saving from using DETFFs is due to the halved clock frequency

and not to the value of the clock capacitance [26].

Once the DETFFs are optimized, they are simulated at different data activity rates: 0

(all zero’s and all one’s), 0.5 and 1. This is to determine the efficiency and performance

of each DETFF for a wide range of data activities. As discussed above, the total power

consumption of a DETFF consists of three separate components. Owing to the diverse

design styles, these components can vary from flip-flop to flip-flop. As a result, the total

power consumption of a flip-flop may change depending on the data activity. Therefore,

it is desirable to simulate various DETFFs with different data activities. It is, however,

worth noting that this behaviour is independent on the α value assumed in the optimization

procedure.

Influence of Supply Voltage

The nominal power supply voltage for 0.18 µm technology is 1.8 V. However, for bat-

tery operated systems, the power supply voltage is reduced drastically to lower the power

consumption. Also, an efficient low voltage flip-flop should demonstrate a lower rate of

incremental delay as the power supply voltage is reduced. Therefore delay, power, and

energy of all the DETFFs are computed as a function of supply voltage. Again, since the

setup time increases with reduced supply voltage, the simulations require relaxed setup

time conditions to provide results over a wide range. Hence in this analysis, tCQ and

PDPCQ are determined for precise results.


3.4 Results

All five DETFFs under study have been optimized as described in Section 3.3.3. It is

found that the delay decreases as the width increases until the minimum point is reached,

if such a point exists. At this point, any further increase in the width does not result in

any further appreciable decrease in the delay. On the contrary, owing to the increased

parasitics associated with the increased width, the delay may increase. On the other hand,

for all the DETFFs, PTOT increases monotonically as the width increases. PDPCQ is

then determined by multiplying PTOT by tCQ for the corresponding width. Furthermore,

by combining the tCQ and the PDPCQ curves, we can plot PDPCQ versus tCQ, which is

illustrated in Figure 3.9. These curves represent the first step of the optimization process.

The slopes of the PDPCQ curves in Figure 3.9 indicate sensitivity of the flip-flops to

delay as the width varies. When tCQ is small, PDPCQ is large since the total power

dominates the product at larger widths. As the width decreases, the power consumption

decreases, however the delay is inversely related to the width. This remains true until the

local minimum is reached. At this point, both the power and delay increase because of

the weakened driver strength. Figure 3.9 also depicts the spread of DETFF performance

in terms of PDPCQ and delay. As shown, the performance of the DETFFs studied is

comparable. PDPCQ ranges from 30 fJ to 75 fJ and delay ranges from 200 ps to 300 ps.

Table 3.2: Optimal parameters for DETFFs studied [1]

ClockPower DataPower InternalPower TotalPower tCQ PDPCQ tDQ PDPDQ

Cell (µW ) (µW ) (µW ) (µW ) (ps) (fJ) (ps) (fJ)

DETpedram 17.6 65.6 241.7 324.9 233.1 75.7 245.3 79.7

DETllopis 17.0 4.6 153.4 175.0 237.5 41.6 312.3 54.7

DETgago 23.2 11.6 131.4 166.2 202.2 33.6 262.2 43.6

DETstrollo 30.0 13.4 194.5 237.8 214.4 51.0 235.3 56.0

DETproposed 18.1 10.9 189.4 218.4 161.3 35.2 230.5 50.3


PDPCQ vs tCQ under relaxed setup time condition

000E+0

50E-15

100E-15

150E-15

200E-15

250E-15

300E-15

100E-12 150E-12 200E-12 250E-12 300E-12tCQ (s)

PD

PC

Q (

J)DETpedramDETllopisDETgagoDETstrolloDETproposed

Figure 3.9: PDPCQ vs tCQ, used to determine the initial optimization point [1]

Table 3.3: Performance Characteristics for DETFFs studied [1]

Cell Setup (ps) Hold (ps) Max. Data Rate (GHz) Total Width (µm)

DETpedram 17.9 34.0 1.75 23.0

DETllopis 80.3 -15.7 2.22 37.7

DETgago 49.5 -5.7 2.63 44.6

DETstrollo -41.4 85.9 2.22 40.5

DETproposed 76.9 -5.1 1.56 56.1

The initial optimization points are then extracted from Figure 3.9 and an iterative

process is used to complete the optimization process. The goal of the optimization is to


minimize the energy consumption, PDPDQ. The different DETFFs are compared in terms

of power, delay and energy. The final optimal parameters are summarized in Table 3.2.

The first column of Table 3.2 lists the DETFFs and the second column displays the three

components of power dissipation and the total power consumption. The third and fourth

columns report the delay and energy consumption, CQ and DQ respectively. Table 3.3

lists the other performance characteristics, such as setup and hold times, maximum data

rate and total transistor width. As shown in the tables, DETpedram consumes the most

power, due to an extensively large internal and data power dissipation. This also leads

to the highest energy consumption. However, it has the smallest total transistor width.

DETllopis has the largest delay, yet the smallest consumption of clock and data power.

DETgago consumes the least internal and total power, thence the least energy. DETstrollo

consumes the most clock power, yet this does not affect its overall performance compared

to the other DETFFs studied. DETproposed has the smallest delay, but it requires the

largest total width.

After the DETFFs are optimized, they are simulated at different data activity rates.

The results are shown in Figure 3.10. In general, applications with α = 1, exhibit the

largest total power consumption. Clock power dissipation is rather constant over all data

activity rates. Data and internal power consumption increase as the data activity increases.

One exception is DETpedram. Where the data sequence consists of all zeros, the internal

power is remarkably large. For the case of all ones, the internal power, on the other hand,

is especially small, whereas the data power is notably larger. However, the data power at

α = 0.5 and α = 1 are almost the same. Furthermore, DETpedram demonstrates the worst

power consumption at all data rates, except when α = 1. DETgago is the best in terms of

power dissipation, at all different data rates. The total power consumption of DETllopis is

very close to DETgago in all data activity. DETproposed has similar power consumption as


DETgago, except in the case of α = 1, in which it exhibits a substantially large internal

power dissipation.

Power Distr ibution as a function of Data Activity

0

50

100

150

200

250

300

350

400

450

500

Pow

er C

onsu

mpt

ion

(uW

)

Clock Power Data Power Internal Power

αααα = 0 (all 0's) αααα = 0 (all 1's) αααα = 0.5 αααα = 1

sDE

Tpe

dram

sDE

Tpr

opos

ed

sDE

Tst

rollo

sDE

Tga

go

sDE

Tllo

pis

Figure 3.10: Power consumption dependence on data transition activity α [1]

Table 3.4: Summary of DETFF performance as Vdd reduces [1]

Vdd = 0.9V Vdd = 1.3V Vdd = 1.6V

tCQ(ps) PTOT (µW ) PDPCQ(fJ) tCQ PTOT PDPCQ tCQ PTOT PDPCQ

DETpedram 734.1 77.3 56.7 329.5 172.2 56.7 244.2 257.9 63.0

DETllopis 762.8 75.4 57.5 350.7 117.2 41.1 264.8 152.4 40.4

DETgago 721.2 37.0 26.7 335.3 89.1 29.9 253.3 143.1 36.2

DETstrollo failed failed failed 932.4 118.4 110.4 262.2 183.2 48.0

DETproposed 445.6 51.2 22.8 233.7 111.7 26.1 180.0 174.8 31.5

†CQ-delay and PDP as a function of supply voltage with relaxed setup time

The performance of DETFFs under reduced voltage conditions is depicted in Fig-


ures. 3.11, 3.12, and 3.13. Figure 3.11 plots total power consumption of DETFFs as a

function of supply voltage. DETgago exhibits the lowest power consumption. DETproposed

shows the second lowest power consumption at low supply voltage. DETllopis has the second

best power dissipation near nominal supply voltage, however by the time supply voltage

drops to 1.4 V, it starts to exceed that of DETproposed. The worst power consumption is

exhibited by DETpedram. The power consumption curve of DETstrollo is somewhat mis-

leading, since it fails to function below 1.3 V. Figure 3.12 depicts the tCQ of DETFFs as a

function of supply voltage. The DETproposed exhibits the lowest delay. On the other hand,

DETstrollo demonstrates the worst delay and quickly fails to latch below 1.3 V. All the other

DETFFs have similar delay at all supply voltages tested. Figure 3.13 plots the PDPCQ as

a function of supply voltage. The best energy consumption versus supply voltage is seen

from the proposed DETFF, but DETgago is comparable. DETpedram and DETstrollo, have

similar energy dissipation at half of the nominal supply voltage. The results are further

summarized in Table 3.4.

3.4.1 Discussion

DETpedram consumes the most data power in this study. It is found that the high data

and internal power dissipation is a result of the positive feedback of the transmission gate

loop at the input end of the flip-flop. In the feedback path of the latches, the input data

controls the passing of the clock signals. For instance from Fig. 3.4, when D=0 and CK=1,

M1 turns on. Hence, Node A discharges to 0 and Node B switches to 1. Node B then

switches M2 on. As a result, M1 and M2 attempt to write 0 and (VDD - Vtn) voltages

simultaneously onto Node A. This voltage conflict is present until the clock changes state.

Such a conflict results in a degraded noise margin. This has two implications. First, this

structure allows large current to flow through the transmission gates at the input. Second,


Total Power vs Supply Voltage under relaxed setup time condition

000e+0

50e-6

100e-6

150e-6

200e-6

250e-6

300e-6

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6VDD (V)

PT

OT (

W)

DETpedramDETllopisDETgagoDETstrolloDETproposed

Figure 3.11: Power consumption dependence on supply voltage [1]

the degraded voltage level at Node A also causes a direct path current in the subsequent

inverters. Hence, large data and internal power dissipation results. In addition, both data

power and internal power depend on the data level rather than the data activity. When

D=0, NMOS pass gates are active through the input loop, while the PMOS is active in

the inverter that follows the loop. The opposite is true for D=1. In either case, PMOS

transistors draw more current. The all 0’s and all 1’s cases are extreme examples of this

effect. Despite the large data power consumption, its clock power dissipation is small

because of the local clock buffers. The absence of local data buffers brings into question

the robustness of the flip-flop. The transparent nature of the pass gates fails to secure

unidirectional data flow. Furthermore, its energy consumption at low supply voltage is

approximately twice as high as the proposed DETFF. Hence, the usage of DETpedram in


CQ delay vs Supply Voltage under relaxed setup time condition

100e-12

200e-12

300e-12

400e-12

500e-12

600e-12

700e-12

800e-12

900e-12

1e-9

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6VDD (V)

t CQ

(s)

DETpedram

DETllopis

DETgago

DETstrollo

DETproposed

Figure 3.12: tCQ as a function of supply voltage [1]

low voltage and low power applications is not recommended.

DETllopis has the best clock and data power dissipation. Its clock power consumption

is low because of the small clock capacitance, whereas its data power dissipation is low

due to the use of an inverting input buffer. Despite the fact that it has one of the smallest

power consumptions at all data activity, it has the longest delay at nominal voltage since

the data must propagate through the most logic stages compared to the other DETFF

configurations. This leads to comparatively large energy consumption at nominal condition.

As a function of supply voltage, its total power consumption drops at a much lower rate

and its delay rises at a slightly higher rate, compared to other DETFFs studied. Hence,

it results in higher energy consumption at low voltage. Therefore, its application for low

voltage conditions is limited and its best energy consumption is seen around 1.5 V.


PDPCQ vs Supply Voltage under relaxed setup time condition

20e-15

30e-15

40e-15

50e-15

60e-15

70e-15

80e-15

90e-15

100e-15

110e-15

120e-15

0.9 1 1.1 1.2 1.3 1.4 1.5 1.6VDD (V)

PD

PC

Q (

J)DETpedram

DETllopis

DETgago

DETstrollo

DETproposed

Figure 3.13: PDP dependency as a function of supply voltage [1]

DETgago is found to be the most energy efficient DETFFs in all circumstances under

nominal conditions in this study. Its superior low power performance is mainly due to the

complete isolation of the elements when they are not in use. Its low power application

is demonstrated. Under low supply voltage condition, although it has the lowest power

consumption, its delay is relatively higher than that of the proposed DETFF. It results in

a slightly higher energy consumption than DETproposed at low supply voltage.

DETstrollo consumes the largest clock power because of the chain of internal clock

buffers. The delay through these clock buffers defines the activation pulse for the flip-flop.

The definition of the activation pulse width is crucial to its operation. As the supply

voltage reduces, the activation pulse width varies that causes the delay to increase at a

much higher rate. The delay rapidly approaches the clock pulse width, hence it fails to latch


the input data anymore. Therefore, it is not suitable to use in low voltage environment.

DETproposed has superior delay because the use of NMOS transistors and the avoid-

ance of PMOS transistor stacking in its design. However, its inferior slew rate leads to

an especially prominent power consumption at high data rates. As a result, its overall

energy consumption at nominal condition is close to DETgago which has the lowest energy

dissipation. In reduced supply voltage condition, DETproposed has the second best power

consumption and the best delay. Therefore, the best energy consumption at low supply

voltage results. Hence, it has promising usage in low energy, and low voltage applications.

The proposed design is an attempt to design a low voltage DETFF. AlthoughDETproposed

can achieve good performance, it is found that the complete isolation of the deacti-

vated elements, as in the case of DETgago, is a key to low power dissipation. However,

DETproposed has been shown to operate the most efficiently at low supply voltage. Hence,

the DETproposed and DETgago are recommended for further research in low power low volt-

age subsystems. In the next chapter, these two DETFFs will be implemented in a digital

filter, along with a standard SETFF.

3.5 Other Considerations

3.5.1 Parallel Interconnects and Clock Requirements

Although many DETFFs have been proposed, their use is still uncommon. There are

several reasons why DETFFs are not yet popular in VLSI circuits. DETFF requires a more

complex implementation and a more intense interconnects with respect to SET structures.

In DETFFs, latches are connected in parallel. This results in a higher number of internal

nodes, as well as higher node capacitances including those on data and clock inputs. In

fact, this byproducts of DETFF offsets some of the benefits of a reduced clock rate [7,


14, 26]. For instance, the setup and hold times of DETFFs are typically larger compared

to that of conventional flip-flops [27]. Thus, DETFFs become less attractive for high

performance applications. Furthermore, DETFFs pay a penalty in the design area [22, 27].

The larger number of transistors and the increased number of interconnects make the

footprint of a DETFF much larger than that of a conventional SETFF. Nevertheless, a

careful floorplanning and layout can be used to minimize the length of wires, which in

turns optimize interconnect capacitance. Hence, keeping the offsets to a minimum.

In addition, a DETFF captures data on both clock edges, therefore, a duty cycle of

50% is required. As such, the specification on jitter tolerance is more stringent. With the

trend of increasing clock frequency, it is more difficult to control the clock duty cycle and

both edges of the clock in the clock distribution system [26] Hence, local clock regeneration

or a more complex system phase lock loop is necessary.

3.5.2 Design for testability

In a typical VLSI system, millions of transistors are involved. In such an overwhelming

situation, testing and debugging would seem to be impossible to accomplish in a timely

manner without the aids of automated tools. Controllability and observability are two

important attributes in the testing and debugging. Controllability is the ability to establish

a specific signal value at each node in a circuit by setting values on the circuit’s inputs.

Observability is the ability to determine the signal value at any node in a circuit by

retrieving values on the circuit’s outputs. The ability to observe a snapshot of the operation

at a particular cycle or cycles and/or to control the state of the system at a desired cycle,

are invaluable tools for debugging.

Testability is a design characteristic that allows the status of a device to be determined,

the isolation of faults to be performed quickly and effectively, and the tests to be devel-


oped in a cost effective manner. It influences strongly the testing time and cost. Design

for testability (DFT) techniques are design efforts employed to ensure that a device is

testable [28, 29].

Most DFT techniques require circuit modifications and affect factors such as logic

complexity, die area, I/O pins, and circuit delay. Increasing logic complexity results in

increased power consumption and decreased yield. Hence, a critical balance between the

extent of DFT used and the gain achieved should be sought with care. In brief, DFT is

used to reduce test generation costs, enhance the fault coverage of tests, and hence reduce

defect levels. When a good DFT is applied, it can also reduce test length, tester memory

and test application time.

One of the most popular DFT techniques is scanning, with the use of scan registers.

A scan register is a flip-flop or a latch with both shift and parallel load capability. The

storage cells in the register are used as observation points and/or input controls. The

implementation of scan can be achieved at a relatively small cost to latency and die area,

and its benefits far outweigh the added hardware complexity.

The parallel structure of the DETFF also presents new challenges in forming scan chain.

Llopis and Sachdev are one of the first who have addressed this challenge. They proposed

a DETFF, as illustrated in Figure 3.14, that is made unidirectional on the data paths [4].

Good as it seems, the area and delay penalties are rather large.

Recent researches have explored alternatives other than the classic configurations of

DETFF, where only one latch is used. Since there only exists one data path, the imple-

mentation of scan is much easier than in the case of classic DETFF [7]. But how can one

single latch capture and store data on both rising and falling edges of the clock signal? The

focus has been shifted to the multiplexing mechanism at the input. A new clock signal is

generated from the original system-wide clock signal. And the input multiplexer thereby


D

Q

N1

CK

CK

Q

CK

CK

CK

CK

N2

Figure 3.14: DETFF proposed in [4] with unidirectional characteristic

ensures the data capture at each clock edge of the system clock.

Figures 3.15 and 3.16 are two examples of this new direction of DETFF implementation.

In Figure 3.15, an XOR operation is performed on delayed clock pulses that are generated

by an inverter chain. This results in a new clock signal of twice the frequency of the system

clock. It is then feed into a standard SETFF. In essence, it is a SETFF that operates twice

as fast locally as the system clock in order to maintain the data throughput. The area

overhead is much less than the one of a classic implementation. However, the XOR gate

delay manifests in the overall flip-flop delay and degrades its maximum clock frequency.

D

X

Y

Q D Q

CK

Figure 3.15: DETFF using XOR operation to generate delayed clock pulses [7]


Figure 3.16 presents an efficient realization of dual edge-triggered pulse generator. This

generator is composed of an inverter chain followed by a pair of parallel transmission gates.

It exhibits minimum area overhead and no delay penalty. It also has good clock skew

absorption.

D Q

CK N2

N1

Figure 3.16: DETFF proposed in [8] with DET pulse generator

From the above discussion, it is important to note that in the implementations of

DETFF, it is crucial to achieve DET features without increasing the loads of either the

clock or the input data. The DETFF structures should also facilitate the implementation

of DFT for the ease of debugging and robust designs.

Chapter 4

Hearing Aids and Digital Filters

In the objective of further analyzing the DETFF performance and possible power savings in

low power low voltage VLSI circuits, digital filters of hearing aid applications are explored.

4.1 Hearing Aids

The strictest set of requirements in portable audio systems is found in hearing aids. There

are four common types of hearing aids, which differ in size and location on the users. They

can be worn behind-the-ear, in-the-ear, in-the-canal, and completely-in-the-canal (CIC).

CIC is the smallest discreet hearing aids, which can be fit entirely within the bony canal.

Not only does it have superiour sound solution, but also offers better cosmetic appeal

since it is completely invisible to the outsider. In general, smaller hearing aids have fewer

controls over acoustical adjustments and smaller batteries so they do not last as long as

larger hearing aids. However, smaller hearing aid offers a more natural sound because of

the characteristics of ear.

Until very recently all contemporary hearing aids were designed utilizing analog cir-

45

Hearing Aids and Digital Filters 46

cuitry. The shortcoming of these analog aids is that their electroacoustic characteristics

could not be modified as easily to suit individual patient requirements. As the technology

advances, digital programmable hearing devices are becoming readily available at more

affordable prices. In these devices, the input signal is digitized, then processed with dig-

ital signal processing circuitry. With the algorithmic fitting methods, the electroacoustic

performance and sound output can be adjusted to meet individual hearing loss require-

ments [30, 31].

Hearing aid devices have challenging requirements in terms of power and area due

to the small battery capacitance and limited dimension. Since hearing aids are battery

powered, the maximum supply is 1.5 mA, with an operating voltage range of 1 to 1.5 V.

The core of a hearing aid consists of a Digital Signal Processor (DSP), which executes

basic filter algorithms such as finite impulse response as well as special adaptive algorithms

used for noise reduction. The basic building blocks of these filters are adders, multipliers,

multiply-accumulators and memories in addition to counters, registers, multiplexers and

demultiplexers. In this study, only the digital filters are covered and used as a benchmark

circuit for the comparison. Its implementation with DETFF is discussed next.

4.2 Digital Filters

Digital filters can be categorized into two classes known as FIR (finite-length impulse

response) and IIR (infinite-length impulse response filters). Advantages of FIR filters

over IIR filters are that they are guaranteed to be stable and to have a linear-phase re-

sponse. Linear-phase FIR filters are widely used in digital communication systems, speech

and image processing systems, spectral analysis, and particularly in applications where

nonlinear-phase response distortion cannot be tolerated. Digital FIR filters are limited to


designs that have transfer functions with effective poles at the origin of the z-plane, while

IIR filters can have poles anywhere within the unit circle. Hence, in IIR filters, the poles

can be used to improve the frequency selectivity. As a consequence, the required filter

order is much lower for IIR as compared to FIR filters. However, it is still not possible to

have exact linear-phase IIR filters. Furthermore, FIR filters are straightforward to design

by using CAD tools. One of the major drawbacks of FIR filters is that large amounts of

memory and arithmetic processing are needed. These make them unattractive in many

applications. IIR filters, on the other hand, require much less memory and fewer arith-

metic operations, but they are difficult to design and they suffer from stability problems.

Although the design is much more demanding, the use of an IIR filter may result in a lower

system cost and higher performance [9, 32].

The primary goal of this study is to determine the usage of DETFF in a low power and

low voltage VLSI application, such as hearing aids, and the possible power saving. Hence,

FIR filter, which is simpler, is selected to minimize the design efforts.

4.2.1 Half-Band FIR

An FIR filter is a type of nonrecursive filters, which are always stable. They cannot

sustain any type of parasitic oscillation, except when they are a part of a recursive loop.

They generate little round-off noise. However, they require a large number of arithmetic

operations and large memories.

Consider the transfer function of an FIR filter of order M is

H(z) =M

∑

n=0

h(n)z−n

=h(0)zM + h(1)zM−1 + ...+ h(M − 1)z + h(M)

zM


where h(n) is the impulse response. Instead of using the order of the filter to describe an

FIR filter, it is customary to use the length (N) of the impulse response. N is equal to

M + 1 in this case.

The FIR filters in interest are filters with linear phase. Linear-phase response has

constant group delay, which implies a pure delay of the signal. The group delay is defined

as the rate of change of the total phase shift with respect to angular frequency. For linear-

phase FIR, it is equal to

τg(ωT ) = −∂Φ(ωT )

∂ω= −

N − 1

2T

Linear-phase filters are useful in applications where frequency dispersion effects must

be minimized, such as in speech processing systems of hearing aids. The impulse response

of linear-phase filters exhibits symmetry or antisymmetry. Therefore, the transfer function

can be rewritten, using a real function HR

H(ejωT ) = ejΦ(ωT )HR(ejωT ) (4.1)

where Φ(ωT ) = c − τg(ωT ) with c = 0 and c = π/2 for symmetric and antisymmetric

impulse responses, respectively. HR(ejωT ) is referred to as the zero-phase response. For

FIR with antisymmetric impulse response and N is odd, the zero-phase response is

HR(ejωT ) =

(N−1)/2∑

n=1

2h(N − 1

2− n)sin(ωTn) (4.2)

Many DSP schemes exploit the fact that a large number of zeros in the impulse response

of certain types of FIR filters, such as half-band FIR filter. The required number of arith-

metic operations can therefore be reduced since it is unnecessary to perform multiplications

by coefficients that are zero.

An even-order (N = odd) half-band FIR filter has zero-phase function that is anti-

symmetric about π/2. Hence, for lowpass half-band FIR, HR(ejωT ) = 1 − HR(e

j(π−ωT )).


Then every other coefficient in the impulse response is zero, except for the one in the cen-

ter, which is 0.5. It is called half-band because the bandwidth is about half of the whole

frequency band. The symmetry implies that the relation between the cutoff angle and

stopband angle is ωcT + ωsT = π and the passband and stopband deviations are equal.

The reduction in the number of arithmetic operations is significant although in practice

the required filter order is slightly higher than that for a corresponding linear-phase filter.

The normalized zero-phase function is HR(ejπ/2) = 0.5, implying an attenuation of 6 dB.

It should be noted that the coefficients are nonzero for odd-order half-band [9, 33].

4.2.2 Implementation of FIR - Direct Form Structure

Only a few structures are of interest for the realization of FIR filters. One of the best and

yet simplest structures is the direct form, as depicted in Figure 4.1.

X(n) X(n-2) X(n-1) X(n-37) D

X(n-38)

h(1)

D D D

h(0) h(2) h(36) h(37) h(38)

…..

y(n)

Figure 4.1: FIR filter implementing with direct form structure

The direct form FIR filter of order M (length N =M +1) can be described by a single


difference equation:

y(n) =M

∑

k=0

h(k)x(n− k) (4.3)

The required numbers of multiplications and additions are N and N − 1 respectively.

This structure is suitable for implementation on processors that are efficient in computing

sum-of-products. Most standard signal processors provide special features to support sum-

of-product computations, i.e., a multiplier-accumulator and hardware implementation of

loops and circular memory addressing. The signal levels in this structure are inherently

scaled except for the output [9, 32].

As shown in Eq. (4.3), digital FIR filters consists of a series of multiplications of samples

of input by some constant coefficients and of additions of these products. Building blocks

include memory cells, multipliers, adders, and programmer to control the sequence of

operations [34].

4.3 Design and Implementation of a Chebychev Half-

Band FIR Filter

Human audible frequency range is 20-22 kHz and human speech bandwidth is 100-8 kHz.

By Nyquist Sampling Theorem, data is only valid up to Fs/2 where Fs is the sampling

frequency of the input data. Hence, maximum input data frequency should be less Fs/2.

Thence, the bandwidth of a half-band is 12(Fs/2) = Fs/4. In order to cover the human

speech spectrum, a sampling frequency of 32 kHz is required.

Chebyshev polynomial has been chosen as the filter approximation method to determine

the filter coefficients. It is because linear-phase, equal-ripple FIR filters are naturally

describable in terms of Chebyshev polynomials. The design method for filter coefficients


described in [35] is adopted here. A filter length of 39 is first selected. The min stopband

loss and the passband ripple are found to be 36.3 dB and 0.26 dB respectively. The details

of the filter specification is listed in Table 4.1.

Table 4.1: Specifications for Chebychev half-band FIR filter with N=39

length: 39

cutoff freq.: 1.507

min. stopband loss: 36.3 dB

passband ripple: 0.26 dB

4.3.1 Number representation

Before describing the implementation of the design, its number system must be defined

first. Performance of the processing elements with respect to speed, chip area, and power

dissipation depends on the number representation used. In this research, two’s-complement

representation and fractional fixed-point arithmetic are employed.

Two’s complement representation is the most common type of arithmetic used in dig-

ital signal processing. It is a subset of binary representation of numbers. One of the

main advantages of a complement representation is that addition and subtraction can be

performed without regard of the sign of the operands. The value of a normalized Wd-bit

binary word (x) in two’s complement representation is

x = −x0 +

Wd−1∑

i=1

xi2−i (4.4)


For x > 0 two’s complement has the same binary word as signed-magnitude representa-

tion. The negative value of a number in two’s complement representation can be obtained

from the corresponding positive number by adding 2−Wd−1 to the bit-complement.

A useful property of two’s complement representation is that, if the sum lies in the

proper range, several two’s complement numbers can be added even though the partial

sums may temporarily overflow the available number range. Thus, the numbers can be

added in arbitrary order without considering possible overflow as long as the final sum lies

within the proper range. Hence, arithmetic operation such as addition, subtraction, and

multiplication, are simple to implement in two’s complement representation, since they are

independent of the signs of the numbers involved.

In addition, in the fractional fixed-point arithmetic, k leftmost digits represent the

integer part and the Wd − k remaining digits represent the fractional part. For example,

the binary representation of the number x with k = 2 is x0x1. x2 · · · xWd−1. This radix

point is not stored in the fixed-point representation, instead its position is understood.

The advantage of fractional fixed-point arithmetic is that parasitic oscillations are more

easily suppressed. It also requires less chip area and is much faster than floating-point

arithmetic. Hence, in most VLSI circuits for dedicated DSP applications calculations are

done using fractional fixed-point arithmetic [9].

In the Chebychev half-band FIR filter design, 16-bit word length is selected for both

data inputs and filter coefficients.

4.3.2 Processing Blocks

Bit-serial arithmetic is a viable alternative in digital signal processing applications to tra-

ditional bit-parallel arithmetic. Bit-serial arithmetic significantly reduces chip area by

eliminating wide buses and simplifies wire routing. The speed penalty is not as large as


it seems. In fact, the ratio is in speed is much smaller due to the long carry propagation

paths in parallel arithmetic [9, 36].

The comparison of power consumption is, however, more complicated. Bit-parallel

arithmetic suffers from energy losses in glitches that occur when the carry propagates.

Yet, the glitches will be fewer if successive data are strongly correlated. Driving long and

wide buses consumes large amounts of power. Bit-serial arithmetic, on the other hand, will

only perform useful computations without any glitches, but require more clocked elements

that will consume significant amounts of power. Power-efficient realization of the clocked

elements is then more important [9]. Since the focus of this research is on the efficiency

and performance of DETFFs, bit-serial arithmetic is selected for the design of Chebyshev

half-band digital filter core.

The architecture of the FIR filter depicted in Figure 4.1 has been modified to reduce the

number of multipliers used. The block diagram of the design is illustrated in Figure 4.2.

Input data of 16-bit word length is coming in at a rate of Fs. The data is then stored

in a queue through a mux stage. The mux stage selects whether to push new data into the

queue or to loop the output of queue back to the beginning of queue. New data is pushed

into the queue 139of the time. The queue needs to operate at 39 times faster than the

input data rate, since only one set of multiplier and accumulator while 39 coefficients are

required to be processed. Because the filter core uses bit-serial arithmetic, it is required

to operate at (Wc +Wd) × 39 × Fs, where Wc is the word length of filter coefficients. In

bit-serial arithmetic, the numbers are normally processed with the least-significant bit first.

The SRin is a parallel-to-serial shift register followed by a serial-parallel multiplier. The

accumulator adds up all the products. The SRout stored the accumulated sum, and feed

the sum back to the accumulator. It also drives the output buffers when all multiplications

with the 39 coefficients and additions are completed. These processing blocks are described


MUX Queue

Push and Pop @ 39Fs

Fs (1/39Fs duty cycle)

16 / 16

/

Fs IN 16

/

SRin

X

ROM 39

coefficients

16 bits

16 /

SRout

+

Buffer

OUT Fs

32 ~

32 ~

Figure 4.2: Block Diagram of the Chebychev half-band FIR filter design


in more details in the following.

Adder

Figure 4.3(a) shows a bit-serial adder, which is composed of a full adder (FA) and a D

flip-flop. This adder is also called carry-save adder, since the carries are saved from one

bit position to the next. The D flip-flop is reset at the start of a computation to clear the

memory of the adder.

Sum

C

Y

X

FA

D

Reset

C

Y

X

FA

D

Diff

Set

(a) (b)

Figure 4.3: (a) Bit-serial adder(b) Bit-serial subtractor [9]

A subtractor can be obtained from the adder implementation by simply inverting one

of the addends and setting the D flip-flop at the beginning. The implementation of a

subtractor is illustrated in 4.3(b).

Serial/Parallel Multiplier

Most bit-serial multipliers are based on the shift-and-add algorithm, where several bit-

products are added in each time slot. In a serial/parallel multiplier, the Wd-bit multi-


plicand, x, arrives bit-serially while the Wc-bit multiplier, a, is applied in a bit-parallel

format. Many different schemes for bit-serial multipliers are available. They differ mainly

in the order of bit-products generation and addition and in the handling of subtraction.

A common approach is to generate a row of bit-products in each time slot and then add

them concurrently.

Figure 4.4 illustrates an example of serial/parallel multiplier based on carry-save adders.

It is composed of an AND stage and an array of one subtractor for the sign bit and Wc− 1

adders cascaded together. At the beginning of a computation, all the D flip-flops are

clocked to set the subtractor and reset the adders. The input data bits are broadcast

LSB-first bit-serially to the array, while the Wc-bit multiplier is applied parallelly. The

x bit is first and-ed with a word, then the outputs are added/subtracted in the array.

As the D flip-flops are clocked, the sum-bits from the FAs are shifted one bit to the

right while each carry-bit is saved and will be added to the FA in the same stage at the

next clock. At each cycle, one partial product is formed and shift-accumulated. These

operations correspond to multiplying the accumulator contents by 2−1. After Wd clock

cycles, only the least-significant Wd product bits have been shifted out from the end of

the array. The most-significant Wc − 1 product bits are remained in the accumulator as

a carry-save residue. These remained product bits must be combined to form the whole

product by clocking through the accumulator. Hence, a bit-serial multiplication takes at

least Wd +Wc − 1 clock cycles. Two successive multiplications are therefore separated by

Wd +Wc clock cycles since one clock cycle is required to clear the accumulator [9, 37].

In most serial-parallel multipliers, speed limitation is the propagation delay of the un-

pipelined data path. In this case it is the propagation time through one AND gate and one

full-adder [36]. In addition, the use of carry save adders yields a regular hardware struc-

ture. A weakness of serial/parallel multiplier architectures is that data and control signals


y

0 FA

D

Set

D

&

FA

D

D

&

FA

D

D

&

FA

D

D

&

FA

D

&

a 4 a 3 a 2 a 1 a 0

x 0 x

1 x

2 ...x

Wd -1

Figure 4.4: Serial/Parallel multiplier

must be broadcast to the array. The loading on these wires, together with their physically

inherent RC constant, may impair the potential performance. This may combated in two

ways: (i) inclusion of buffer/drivers before broadcasting; (ii) inclusion of pipelining latches

in all direct paths, every few stages in the accumulator [37].

Shift Registers

Shift registers are composed of multiplexers and flip-flops, except for the serial-in-serial-out

(SISO) structure. The function of multiplexers is to select the inputs and load the regis-

ters. In this filter design, the shift registers at the input is of parallel-in-serial-out (PISO)

structure, whereas the one at the output is a simplification of SISO and serial-in-parallel-

out (SIPO) structures. Basically the SRout is a SISO shift register with parallel access

points between flip-flops that drives the output buffers. These structures are illustrated in

Figure 4.5.

A queue is essentially an array of SISO shift register. It is also referred to as FIFO

(first-in-first-out).


D D D D

load

PISO

SISO D D D D

SIPO D D D

load

D D

Figure 4.5: Structures for shift registers

Chapter 5

Results and Discussions

5.1 Measurements

As noted in chapter 3, DETgago and DETproposed are selected to be implemented in a

Chebyshev digital filter for further analysis. These two DETFFs are modified with the

inclusion of reset and set input controls. Figures 5.1 and 5.2 illustrate the implementations

with reset control input. Similar realizations are carried out for the set control input. These

DETFFs are optimized, laid-out according to the 0.18 µm standard cell rules and extracted.

All the DETFF implementations have width of 21.12 µm, except for the DETgago with set

control which is 20.46 µm wide. Compared to the 0.18 µm standard SETFF with reset

and set inputs, the DETFFs are about 77% larger in area. This is expected because of the

more complex parallel interconnects of the DETFF structures.

The DETFFs together with the standard SETFF are simulated at a range of Vdd from

a nominal 1.8 V to 0.9 V. The same testbench and simulation parameters introduced in

section 3.3.3 are used here as well. Results are summarized in Table 5.1. As shown in

the table, the two DETFFs demonstrate superior power, delay and energy as compared

59

Results and Discussions 60

N2

N3

Q

N4

N5

N5

N4 N2

N3

D

CK CK CK

CK

CK

CK

Q R

Figure 5.1: DETgago with reset active low control

to the SETFF for the same throughput. DETproposed exhibits the best energy efficiency

among the three flip-flops while DETgago consumes the least power. DETFFs demonstrate

a power saving of 10% and an energy saving of as much as 23%.

Furthermore, the standard SETFF fails to deliver the data throughput described in

the testbench at high clock frequency (1 GHz) for Vdd of 1 V or less. It can however work

at 1.05 V delivering the same throughput. It has power, delay and energy of 68.19 µW,

846.4 ps and 57.72 fJ respectively. The standard SETFF can still operate properly at

lower clock frequency as in the case of FIR filter used in this study. It is worth noting

that although the DETFFs dissipate the least power at Vdd = 0.9 V , the least energy


Q

D D

D

CK CK

D

CK

CK

CK CK

CK

CK

D

N1 N2 N3 N4

N7 N8

R R

Figure 5.2: DETproposed with reset active low control

consumption is occurred at around 1 V Vdd. The delay penalty well exceeds the power

savings at 0.9 V Vdd.

Table 5.1: Back annotation parameters for DETFFs and standard SETFF

Vdd = 0.9V Vdd = 1V Vdd = 1.8V

Power PTOT Delay td Energy PTOT td E PCK PD PFF td E

Cell (µW) (ps) (fJ) (µW) (ps) (fJ) (µW) (µW) (µW) (ps) (fJ)

SETFF - - - - - - 39.66 26 150.2 340.1 73.41

DETgago 43.37 1212 52.56 54.45 877.5 47.78 25.04 33.12 134.2 299 57.52

DETproposed 44.89 951.6 42.71 55.91 710.1 39.7 25.42 30.06 147.5 277.9 56.41

The Chebyshev half-band FIR filter is verified functionally using Matlab. Figure 5.3

shows the frequency response of the half-band FIR filter.

The Chebyshev half-band FIR filter design is coded using RTL/VHDL, compiled, sim-


0 0.5 1 1.5 2 2.5 3 3.5−250

−200

−150

−100

−50

0

50

w (rad/s)

Mag

nitu

de (

dB)

Frequency Response of Chebyshev Polynominal Half−Band FIR Filter

Figure 5.3: Frequency response of the Chebyshev half-band FIR filter


Table 5.2: Simulation Results for Chebyshev half-band FIR filter operating at 0.9 V

FIR with Total power (µW) Core area ((µm)2)

SETFF 135.7 280 x 280

DETgago - 280 x 475

DETproposed 83.61 280 x 475

ulated, and then synthesized using Synopsys. It is then laid-out using automated place-

and-route tools, Design Planar and Silicon Ensemble. The design is ported into Cadence

for final post-layout simulation. Table 5.2 summarized the results for power consumption

and filter core dimension. The FIR filter with DETproposed is found to consume 38% less

power than that with standard SETFF cell. However, the area overhead is approximately

70%.

The simulation data for DETgago is not available at this point in time because of the

improper implementation of the reset control signal. As shown in Figure 5.1, as reset is

asserted, the output of the flip-flop appears to be reset. However, since the latches are not

disconnected during reset, the value stored in the flip-flop remains as the reset is released.

This causes problem in resetting the filter at the start of simulation, but not so much

during the filter operation since the clock is always toggling. Hence, clocking with input

data equal to 0 is necessary initially. An improved implementation of DETgago with reset

is illustrated in Figure 5.4 where latches are disconnected during reset.


N2

N3

Q

N4

N5

N5

N4 N2

N3

D

CK CK CK

CK

CK

CK

Q

R

R

R

R

Figure 5.4: Improved implementation of DETgago with reset active low control

5.2 Discussions

It should be noted that the implementation of the FIR with DETFF is not as optimized

as the one with SETFF due to the incompatibility of the design tools. The layouts of the

two filter implementation with SETFF and DETFF are illustrated in Figures 5.5 and 5.6

respectively. As of today, automated tools, such as Synopsys, Design Planner and Silicon

Ensemble, do not support devices/gates with multiple clock signal controls. For instance,

rising and falling clock edges are consider two clock signal controls. Simulation is feasible,

but not for synthesis. Hence, in this study, many manual workarounds are put in for the

implementation of FIR with DETFF. This implementation of DETFF is based on the SET


implementation, hence, is far less optimized. If the automated tools is made to support

multiple clock signal controls, the FIR area overhead of DETFF implementation could have

been much smaller and the interconnects routing could have been more optimized. The

offset could have been smaller than what is presented here and hence more power savings

can be achieved. Therefore, advancement in automated tools is essential, in order to take

the full advantage of DETFFs in VLSI designs.


Figure 5.5: Layout of the half-band FIR filter implemented with standard SETFF


Figure 5.6: Layout of the half-band FIR filter implemented with standard DETFF

Chapter 6

Conclusions

The proposed design is an attempt to design a low power, low voltage DETFF. DETproposed

exhibits the best energy efficiency while DETgago consumes the least power among the

various flip-flops studied. DETFFs demonstrate a power saving of 10% and an energy

saving of as much as 23% comparing to the standard SETFF. The standard SETFF, on

the other hand, fails to deliver the data throughput described in the testbench at high

clock frequency (1 GHz), at low voltage (1 V).

In a digital filter setting, the benefits of DETFF are even more prominent. Global

clock net capacitance can be very large in a VLSI circuit. The usage of DETFF allows

one to maintain a constant throughput while operating at only half the clock frequency.

The FIR filter with DETproposed is found to consume 38% less power than the one with

standard SETFF cell. However, this comes with a price of larger area overhead, which

is approximately 70%. Advancement in automated tools to include multiple clock control

signals is essential, in order to minimize the area overhead and to take full advantage of

the DETFFs in VLSI designs.

It has been shown that the usage of DETFFs in VLSI systems is beneficial in low

68

Conclusions 69

power, low voltage applications and high speed applications. As illustrated in Table 5.1 of

section 5.1, the implementation of DETFFs in high speed applications can relax the clock

rate and hence make low voltage possible to carry out. As demonstrated in the half-band

FIR filter design, DETFFs offer significant power savings compared to the implementation

of standard SETFFs in applications where constant data throughput is important and

where data transition probability, α, is low.

6.1 Future Work

The design flow and the automated place-and-route tools should be enhanced to include

and support the design and layout of DETFF implementations in VLSI circuits. This will

reduce the area overhead of the DETFF designs and improve their performance. It will

also encourage further research on DETFF circuits and topologies.

Appendix A

Glossary of Terms

BIST built-in self test

CMOS complementary metal oxide semiconductor

C2MOS clocked complementary metal oxide semiconductor

CPL complementary passgate logic

DET dual-edge triggered

DETFF dual-edge triggered flip-flop

DSP digital signal processing

FA full adder

FIFO first-in-first-out

FIR finite impulse response

IIR infinite impulse response

70

Glossary of Terms 71

IC integrated circuit

I/O input and output

LSB least-significant bit

MSB most-significant bit

MOS metal oxide semiconductor

MOSFET metal oxide semiconductor field-effect transistor

NMOS n-channel metal oxide semiconductor

PDP power-delay product, which is also referred to as energy

PDPCQ energy calculated with clock-to-output delay

PDPDQ energy calculated with data-to-output delay

PIPO parallel-in-parallel-out

PISO parallel-in-serial-out

PMOS p-channel metal oxide semiconductor

RTL register transfer language

SAFF sense amplifier flip-flop

SDFF semi-dynamic flip-flop

SET single-edge triggered

SETFF single-edge triggered flip-flop

Glossary of Terms 72

SIPO serial-in-parallel-out

SISO serial-in-serial-out

SOC system-on-a-chip

TSPC true single phase clock

VHDL VHSIC hardware design language

VHSIC very high speed integrated circuit

VLSI very large scaled integrated circuit

ULSI ultra large scaled integrated circuit

Bibliography

[1] W.M. Chung, , T. Lo, and M. Sachdev, “A comparative analysis of low-power low-

voltage dual-edge-triggered flip-flops,” IEEE Transactions on Very Large Scale Inte-

gration (VLSI) Systems, 2001 (accepted).

[2] A.G.M. Strollo and D. De Caro, “Low Power Flip-flop with Clock Gating on Master

and Slave Latches,” Electronics Letters, vol. 31, no. 4, pp. 294–295, February 2000.

[3] A. Gago, R. Escano, and J.A. Hidalgo, “Reduced Implementation of D-type DET

Flip-Flops,” IEEE J. of Solid-State Circuits, pp. 400–442, March 1993.

[4] R.P. Llopis and M. Sachdev, “Low Power, Testable Dual Edge Triggered Flip-Flops,”

1996 International Symposium on Low Power Electronics and Design, pp. 341–345,

1996.

[5] M. Pedram, Q. Wu, and X. Wu, “A New Design of Double Edge Triggered Flip-Flops,”

1998 Proceedings of the Asian and South Pacific Design Automation Conference (ASP-

DAC ’98), pp. 417–421, 1998.

[6] A.G.M. Strollo, E. Napoli, and C. Cimino, “Low Power Double Edge-Triggered Flip-

Flop Using One Latch,” Electronics Letters, vol. 35, no. 3, pp. 187–188, 1999.

73

Bibliography 74

[7] J. Tschanz, S. Narendra, Z. Chen, S. Borkar, and M. Sachdev, “Comparative Delay

and Energy of Single Edge-Triggered & Dual Edge-Triggered Pulsed Flip-Flops for

High-Performance Microprocessors,” 2001 Symposium on VLSI Circuits, Digest of

Technical Papers., pp. 217–218, 2001.

[8] T.A. Johnson and I.S. Kourtev, “A Single Latch, High Speed Double-Edge Triggered

Flip-Flop (DETFF),” The 8th IEEE International Conference on Electronics, Circuits

and Systems, 2001, vol. 1, pp. 189–192, 2001.

[9] L. Wanhammar, DSP Integrated Circuits, Academic Press, San Diego, CA, USA,

1999.

[10] A.P. Chandrakasan and R.W. Brodersen, Low Power Digital CMOS Design, Kluwer

Academic Publishers, Norwell, Massachusetts, USA, 1995.

[11] E. Sanchez-Sinencio and A.G. Andreou, Eds., Low-Voltage/Low-Power Integraged

Circuits and Systems: low-voltage mixed-signal circuits, IEEE Press, New York, New

York, USA, 1999.

[12] J. Koomey, K. Kawamoto, J. Koomey, B. Nordman, R. Brown, M.A. Piette, and

A. Meier, “Electricity Used by Office Equipment and Network Equipment in the U.S.,”

2000 Proceedings of the ACEEE Summer Study on Energy Efficiency in Buildings.

Asilomar, CA., p. Panel 7, August 2000.

[13] Energy Information Administration Office of Energy Markets and End Use U.S. De-

partment of Energy Washington, “A Look at Residential Energy Consumption in

1997,” November 1999.

Bibliography 75

[14] N. Nedovic, M. Aleksic, and V.G. Oklobdzija, “Timing Characterization of Dual-Edge

Triggered Flip-flops,” 2001 Proceedings of International Conference on Computer

Design, pp. 538–541, 2001.

[15] J.M. Rabaey and M. Pedram, Eds., Low Power Design Methodologies, Kluwer Aca-

demic, Norwell, Massachusetts, USA, 2000.

[16] A. Chandrakasan, W.J. Bowhill, and F. Fox, Eds., Design of high-performance mi-

croprocessor circuits, IEEE Press, Piscataway, NJ, USA, 2001.

[17] H. Partovi, R. Burd, U. Salim, F. Weber, L. DiGregorio, and D. Draper, “Flow-

through latch and edge-triggered flip-flop hybrid elements,” 1996 IEEE International

Solid-State Circuits Conference, Digest of Technical Papers, pp. 138–139, 1996.

[18] F. Klass, “Semi-Dynamic and Dynamic Flip-flops with Embedded Logic,” 1998

Symposium on VLSI Circuits, Digest of Technical Papers, pp. 108–109, 1998.

[19] F. Klass, C. Amir, A. Das, K. Aingaran, C. Truong, R. Wang, A. Mehta, R. Heald,

and G. Yee, “A New Family of Semidynamic and Dynamic Flip-flops with Embedded

Logic for High-Performance Processors,” IEEE Journal of Solid-State Circuits, vol.

34, no. 5, pp. 712–716, May 1999.

[20] B. Kong, S. Kim, and Y. Jun, “Conditional-Capture Flip-Flop for Statistical Power

Reduction,” IEEE Journal of Solid-State Circuits, vol. 36, no. 8, pp. 1263–1271, 2001.

[21] W.M. Chung and M. Sachdev, “A comparative analysis of dual edge triggered flip-

flops,” 2000 Canadian Conference on Electrical and Computer Engineering, vol. 1,

pp. 564–568, 2000.

Bibliography 76

[22] R. Hossain, L.D. Wronski, and A. Albicki, “Low Power Design Using Double Edge

Triggered Flip-Flops,” IEEE Trans. on VLSI Systems, pp. 261–265, June 1994.

[23] D.M. Brooks, P. Bose, S.E. Schuster, H. Jacobson, P.N. Kudva, A. Buyuktosunoglu,

J. Wellman, V. Zyuban, M. Gupta, and P.W. Cook, “Power-aware microarchitecture:

design and modelling challeges for next generation microprocessors,” IEEE Micro,

vol. 20, no. 6, pp. 26–44, November–December 2000.

[24] S.J. Abou-Samra and A. Guyot, “Performance/Complexity Space Exploration: Bulk

vs. SOI,” PATMOS ’98, International workshop – Power And Timing Modeling,

Optimization and Simulation, October 7-9, 1998.

[25] V. Stojanovic and V.G. Oklobdzija, “Comparative Analysis of Master-Slave Latches

and Flip-Flops for High-Performance and Low-Power Systems,” IEEE J. of Solid-State

Circuits, vol. 34, no. 4, pp. 536–548, April 1999.

[26] A.G.M. Strollo, E. Napoli, and C. Cimino, “Analysis of Power Dissipation in Dou-

ble Edge-Triggered Flip-flops,” IEEE Transactions on Very Large Scale Integration

(VLSI) Systems, vol. 8, no. 5, pp. 624–629, October 2000.

[27] S.L. Lu and M. Ercegovac, “A Novel CMOS Implementation of Double-Edge-Triggered

Flip-Flops,” IEEE J. of Solid-State Circuits, pp. 1008–1010, August 1990.

[28] N. Shastry, “Tutorial on design for testability,” 1992 Proceedings of Fifth Annual

IEEE International ASIC Conference and Exhibit, pp. 139–142, 1992.

[29] R.D. Hess, “Considerations in selecting a design-for-testability technique,” 1988 IEEE

Region 5 Conference: Spanning the Peaks of Electrotechnology, pp. 157–160, 1988.

Bibliography 77

[30] Audiotech Healthcare Corporation, “Hearing aid types for all levels of hearing im-

pairment,” http : //www.hearingcenteronline.com/hearaid.shtml, 2000.

[31] Dr G K Hebbar’s Micro Ear Surgery & E.N.T Endoscopy Centre, “Types of hearing

aids,” http : //entcentre.faithweb.com/faqs/HearingAids BroadPerspectives/

type of hearing aids.htm, 2000.

[32] A.V. Oppenheim, R.W. Schafer, and J.R. Buck, Discrete-time Signal Processing,

Prentice Hall, Upper Saddle River, NJ, USA, second edition, 1999.

[33] P.P. Vaidyanathan, “Multirate digital filters, filter banks, polyphase networks, and

application: A tutorial,” Proceedings of the IEEE, vol. 78, no. 1, pp. 56–92, January

1990.

[34] V. Cappellini, Digital Filters and Their Applications, Prentice Hall, New York, USA,

1978.

[35] Jr. A.N. Willson and H.J. Orchard, “A design method for half-band fir filters,” IEEE

Transactions on Circuits and Systems I: Fundamental Theory and Applicaions, vol.

46, no. 1, pp. 95–101, January 1999.

[36] G. Bi and E.V. Jones, “High-performance bit-serial adders and multipliers,” IEEE

Proceedings-G, vol. 139, no. 1, pp. 109–113, February 1992.

[37] S.G. Smith, “Serial/Parallel Automultiplier,” Electronics Letters, vol. 23, no. 8, pp.

413–414, April 1987.

theusageofdualedgetriggeredflip-°ops …cdr/pubs/wai_masc.pdfgies.threetypeswillbebrie...

Documents