transition inversion based low power coding for buffered...

i

Transition Inversion based Low Power Coding for Buffered Bus Systems

By

Abinesh R

200742006

A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF

Master of Science (by Research) in

VLSI & Embedded Systems

Centre for VLSI & Embedded Systems Technologies International Institute of Information Technology

Hyderabad, India May 2010

ii

Copyright © 2010 Abinesh R All Rights Reserved

iii

Dedicated to my parents.

iv

INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY Hyderabad, India

CERTIFICATE

It is certified that the work contained in this thesis, titled “Transition Inversion based

Low Power Coding for Buffered Bus Systems” by Abinesh R (200742006) submitted

in partial fulfilment for the award of the degree of Master of Science (by Research) in

VLSI & Embedded Systems, has been carried out under our supervision and it is not

submitted elsewhere for a degree.

__________ _____________ Date Advisor: Dr. Suresh Purini Asst. Professor IIIT, Hyderabad __________ _____________ Date Advisor:

Prof. Govindarajulu Professor IIIT, Hyderabad

v

Acknowledgements

I owe my deepest gratitude to my advisors, Dr. Suresh Purini and Professor

Govindarajulu whose encouragement, guidance and support enabled me to

accomplish this work.

I also thank Prof. M. Satyam for his feedback on various aspects of my work. I

also thank Prof. M.B.Srinivas for his feedback on the work. I am exceptionally

thankful to Bharghava for his valuable help and feedback on my work. I also

thank Mr. Deepak Tanna of Intel India for his comments on the practical aspects

of the work. I would like to thank all my friends and people in CVEST lab for the

terrific company during my study.

Finally, I want to thank my family for their unconditional love. Their constant

encouragement and their faith in me has always given me the strength to try to

achieve more and to be a better person.

vi

Abstract

The field of electronics has undergone tremendous changes in recent

times. The innovations in it have been happening in multiple perspectives, among

which the growth of portable computing devices has lead to a new wave of

miniaturization and low power development.

The rise of high bandwidth internal devices like GPUs (Graphic Processing

Units), Multicore processors etc has lead to the growth of high frequency busses

like PCI (Peripheral Component Interconnect), the Northbridge/Southbridge FSB

(Front side bus) etc. Also the increasing use of multimedia streaming and

external storage devices has lead to a massive increase in memory-intensive

applications. The need for a simpler protocol has made most of them to be block

data transfer systems with the final usage in caches, DMA (Direct Memory

Access), Video data transfer etc all tending to be block data transfer utilizing

some form of a buffer. I/O pads driving external busses dissipate a major portion

of this power as they drive large capacitances, and operate at a higher voltage.

This forms a good part of the overall power consumption of a system.

In this thesis, a novel technique, Transition Inversion, to reduce power

consumption of buffered busses is proposed. This work outlines a data coding

protocol by which these transitions can be reduced for block data transfer over

vii

buses such as DMA, cache lines etc. Block data transfers generally occur

through data buffers. The prior knowledge of the data to be transmitted, when it is

stored in the buffer, is exploited in serial fashion to reduce transitions on every

bus line.

The technique considers a buffer of data to be a collection of bitstreams

running in parallel over multiple lines. The transitions are counted in a bit serial

fashion and used to determine whether the transitions in any given bitstream

have to be inverted. This way the data running on any given line sequentially is

encoded such a way it will have a reduced number of transitions.

The technique is implemented in an optimized fashion using pipelining so

that it can be used in practical systems with only a slight compromise in

performance. This is achieved by calculating the decision as the data is being

loaded on to the buffer and doing the encoding on the fly. This is one aspect

which is lacking in most existing algorithms as they are not amenable to low

delay implementation. The critical parameter of delay, which limits bandwidth, is

reduced by 64% with the proposed technique and the pipelined implementation.

Also the proposed technique and implementation does not require any

extra bus lines to be used as is wont with most existing techniques. This does

away with the rise in PCB (Printed Circuit Board) fabrication costs.

viii

Theoretical analysis of transition inversion showed it to be reducing

transitions that is, independent of the increase in bus width. Most existing

techniques suffer with increasing bus widths, requiring mitigation factors that

themselves increase wiring requirements.

Also transition inversion scales slower in hardware complexities compared

to another comparable technique which makes it suitable for the increasing data

widths in modern busses.

The analyses showed that the technique is suitable for modern bus

architectures in terms of delay, space complexity and the extra overhead power

consumed. It was seen that the bandwidth need not be limited and still a good

amount of power saving with the overhead was obtained.

ix

Contents Contents .................................................................................................................. ix List of Tables .......................................................................................................... xi List of Figures ....................................................................................................... xii List of Relevant Publications ............................................................................... xiii Chapter 1 ..................................................................................................................1

Introduction................................................................................................................... 1 1.1 Low Power Systems ..........................................................................................................1

1.1.1 System Architecture Level ..................................................................................................4 1.1.2 Circuit Level..........................................................................................................................5

1.2 Computer Bus Systems......................................................................................................6 1.2.1 Practical Considerations ......................................................................................................9

1.3 Transition Inversion...........................................................................................................9 1.4 Contributions ...................................................................................................................10 1.5 Organization of Thesis.....................................................................................................11

Chapter 2 ................................................................................................................12 Related Work .............................................................................................................. 12

2.1 General Busses.................................................................................................................12 2.2 Special Busses ............................................................................................................................14

2.3 Bus Invert.........................................................................................................................16 Chapter 3 ................................................................................................................20

Transition Inversion ................................................................................................... 20 3.1 System Scenario...............................................................................................................20

3.1.1 Mutual Capacitance............................................................................................................21 3.2 Algorithm.........................................................................................................................22 3.3 Explanation ......................................................................................................................26

Chapter 4 ................................................................................................................29 System Analysis........................................................................................................... 29

4.1 Reduction Analysis ..........................................................................................................29 4.1.1 Statistical Analysis .............................................................................................................29 4.1.2 Brute Force Analysis..........................................................................................................31 4.1.3 Reduction Efficiency..........................................................................................................32

4.2 Performance Penalty Analysis .........................................................................................34 4.3 Tradeoff Analysis ............................................................................................................35 4.4 Power Analysis ................................................................................................................36 4.5 Error Detection Analysis .................................................................................................36

x

Chapter 5 ................................................................................................................40 Implementation and Analysis .................................................................................... 40

5.1 High Level Architecture ..................................................................................................40 5.1.1 Decision Circuit ..................................................................................................................42 5.1.2 Encoder Circuit ...................................................................................................................43 5.1.3 Decoder Circuit...................................................................................................................45

5.2 Complexity Analysis........................................................................................................45 5.2.1 Decision Circuit ..................................................................................................................45 5.2.2 Encoder/Decoder Circuit ...................................................................................................46

Chapter 6 ................................................................................................................48 Experimental Analysis................................................................................................ 48

6.1 Benchmark Simulation ....................................................................................................48 6.1.1 Random Image Data...........................................................................................................48 6.1.2 SPEC2000 Benchmark ......................................................................................................51

6.2 Overall Delay Analysis ....................................................................................................54 6.3 Overall Power Analysis ...................................................................................................55

Chapter 7 ................................................................................................................57 Conclusions.................................................................................................................. 57

7.1 Derivative Work ..............................................................................................................57 7.1.1 Serial Busses .......................................................................................................................57 7.1.2 Usage to Caches..................................................................................................................58 7.1.3 Usage to Clock Recovery ..................................................................................................60

7.2 Contributions ...................................................................................................................61 7.3 Comparison......................................................................................................................62 7.4 Conclusions......................................................................................................................62

Bibliography ...........................................................................................................64

xi

List of Tables Table 2.1 Bus Invert Decision Circuit .............................................................................. 18 Table 2.2 Bus Invert Decision Encoder ............................................................................ 18 Table 3.1 Block Arrangement........................................................................................... 20 Table 3.2 Sample Coding Process .................................................................................... 25 Table 4.1 Statistical Percentage Reduction....................................................................... 32 Table 4.2 Performance Metric, Ps variation with buffer depth (bitstream length) ........... 34 Table 4.3 Error Detection Analysis .................................................................................. 37 Table 6.1 A comparison of transition reduction for Bus invert and the Proposed Technique for Images ....................................................................................................... 49 Table 6.2 Comparison of encoder delay in Bus Invert and Transition Inversion ............. 54 Table 6.3 Power dissipation of extra circuitry for transition inversion ............................ 54

xii

List of Figures Figure 1.1 System Block Diagram...................................................................................... 2 Figure 1.2 Component Block Diagram............................................................................... 3 Figure 2.1 Location of Bus Invert in Bus Chain............................................................... 16 Figure 2.2 Bus Invert Block Diagram............................................................................... 17 Figure 3.1 Sample Bitstreams Waveform......................................................................... 27 Figure 4.1 Transition Reduction with Change in Buffer Depth …………………………33 Figure 4.2 Effects of Frequency scaling with Buffer Depth ….…………………………35 Figure 4.3 Proposed Technique Vs Parity Bit Technique ….……………………………38 Figure 5.1 High Level Architecture .................................................................................. 41 Figure 5.2 Decision Circuit (Transition Counter)............................................................. 43 Figure 5.3 Encoder Circuit................................................................................................ 44 Figure 5.4 Decoder Circuit ............................................................................................... 45 Figure 6.1 Comparison of Transition Inversion to Bus Invert for Images........................ 50 Figure 6.2 Comparison of Transition Inversion to Bus Invert for SPEC2000…………..52 Figure 6.3 Transition reduction in Gray Coding............................................................... 53 Figure 7.1 Serial Coding High Level Architecture .…………………………………......59 Figure 7.2 Cache Architecture for Transition Inversion ………………………………...59

xiii

List of Relevant Publications

• Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti, “Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, Selected for publication in special issue of Journal of Low Power Electronics, to appear in October 2010.

• Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti,

“Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, 23rd International Conference on VLSI Design, January 2010

• Joint Winner of Intel Research Challenge (also known as Intel Scholar

Program) 2008-2009 http://www.intel.com/cd/corporate/education/APAC/ENG/in/news/news43/419015.htm

• Abinesh R., Bharghava R., M.B. Srinivas, “Transition Inversion Based Low

Power Data Coding Scheme for Synchronous Serial Communication”, isvlsi, pp.103-108, 2009 IEEE Computer Society Annual Symposium on VLSI, 2009

• Bharghava R., Abinesh R., Suresh Purini, Govindarajulu Regeti, “Inexact

Decision Circuits: An Application to Hamming Weight Threshold Voting”, Selected for publication in special issue of Journal of Low Power Electronics, to appear in October 2010.

• Bharghava R., Abinesh R., Suresh Purini, Govindarajulu Regeti, “Inexact

Decision Circuits: An Application to Hamming Weight Threshold Voting”, 23rd International Conference on VLSI Design, January 2010

1

Chapter 1

Introduction

Green computing has become an important phenomenon in system design in

recent times. While high speed was considered a major thrust area for research a

few years back, designing highly efficient systems have become the norm of the

day [45][21][22][23]. Green computing or Green IT is being used as an umbrella

term for every type of system design, including that of mechanical, electronics,

software etc. In Electronics, the major focus in wriggling the last juice out of the

resources is being concentrated mostly on designing low power systems

[40][41][42]. The growth in this direction is also being driven in no less fashion by

the explosive growth in portable computing devices. These battery driven devices

have made power consumption an important parameter in system design and not

something that is optimized as an afterthought.

1.1 Low Power Systems

Low power design, in a system perspective, happens at all levels of the

digital electronic system stack. It is being done from the lowermost device level

design to the topmost software design. And there are the intermediate levels

where a lot of effort is being expended to make systems run at low power,

keeping the compromise in performance to be minimum. The increasing density

2

of the integrated circuits as postulated by Moore’s law [50] makes it even more

important to have low power systems since the power supply for such a densely

integrated circuit may not keep track in size with the miniaturization of the

electronic components. Hence research is being made at all levels of a system

stack. A system can consist of multiple components. They can be broadly

classified and a communication framework designed between them as shown in

Figure 1.1.

Figure 1.1 System Block Diagram

Each component in a system needs to communicate with each other through

some form of a communication bus mechanism which will be part of the

component itself. The control core can be a CPU (Central Processing Unit),

FPGA (Field Programmable Gate Array), microcontroller etc. The bus

mechanism itself is standardized into multiple busses each of which service

3

specific parts of the system. A block diagram of a component from a

communication perspective is shown in Figure 1.2. It splits a component into the

component core, a medium access layer (MAC) and a physical layer. The

component core is specific to the component, be it CPU, memory etc. The

system architecture of each component is the high level representation of that

component while the circuit design is the exact design of that component.

Figure 1.2 Component Block Diagram

The proposed work deals exclusively with the physical layer of such a bus

mechanism. The other layers of the stack, including the MAC, are left untouched

in the proposed technique. Both system architecture and circuit design of the

physical layer is dealt with in the proposed technique.

4

1.1.1 System Architecture Level

System architecture level work that had been done on processor design,

operating system and other higher levels of the stack have already borne fruit

with many innovations appearing in the market. The contribution of the recession

of 2007-2010 was also to be seen since it drove innovation to happen at making

systems more efficient so that cost could be kept less. One notable phenomenon

to become popular out of the recession induced changes was that of the Netbook

taking market share away from the Laptops [5]. These computing devices were

targeted at a lower price point than the then expensively priced Laptops. They

were built on highly power efficient processor architectures (Intel Atom, Via

Nano) and specially tuned versions of commercial operating systems (Windows

XP with a netbook setting, Linux with various features like that of tickless kernel

[6]). Like any tradeoff being done in design, these devices traded off performance

for high efficiency [7]. It resulted in these devices sporting longer battery life

compared to any laptop with a similar battery pack. Power efficiency came with

an added incentive of being low price [8]. Rather than blindly using MHz (Mega

Hertz) or MIPS (Million instructions per second) as a metric for qualification,

performance per watt (MHz/Watt or MIPS/Watt) has become the new metric for

comparison. This is exemplified in the new list of top 500 environmentally efficient

supercomputers that serve as an addon to the well know top 500 supercomputer

list [19][20].

5

1.1.2 Circuit Level

The innovations go into all levels of the stack. Innovations on the lower levels

occurred in device physics that have contributed to decreasing power. But this

may not be for long since a lot of limiting factors are coming into picture as the

transistor feature size is reduced. They range from lithographic inaccuracies to

process variations [9].

In another direction of research, power management protocols have been

developed incorporating various features that take into account circuit related

innovations. Some of the important features are voltage and clock scaling

[10][12]. In these techniques, the clock frequency and core voltage of the

processor are changed depending on the load. If the system is running less

number of processes, it will be scaled down to a lower voltage and clock

frequency. This is done in a tightly coupled fashion between the CPU,

motherboard and OS (Operating System). ACPI (Advanced Configuration and

Power Interface) is one such widely used protocol that supports multiple types of

states like processor, performance, global, device etc [11]. The power

management techniques are playing a wider role with the advent of multi-core

processors. Multi-core processors, having multiple mostly similar cores, can be

power managed based on loads by switching them on/off. Notable technologies

to move out of research into mainstream in recent times are the Intel’s

TurboBoost and AMD’s TurboCore [47].

Power management is also developing in the direction of having multiple

circuits for the same functionality with different power/performance parameters

6

and switching between them depending on the power saving to be done. One

good example is the Optimus technology from Nvidia which consists of a high

performance Nvidia GPU and a low power Intel IGP (Integrated graphics

processor). This technology seamlessly switches between the 2 chips depending

on the performance required [24][25].

One another example is the Lenovo Ideapad U1. This notebook contains 2

separate computer systems in one casing. The LCD display itself contains an

ARM processor based system that can be detached and used separately as a

tablet. When the display is attached to the notebook chassis, it becomes an Intel

processor based system. They run separate OSs (Operating Systems) with

separate memories [26][27][28]. Also as per the Law of diminishing returns, any

small improvement needed in the output generally requires a large change in

input once a high performance state is achieved [46]. Most existing low power

techniques worked with a premise of tuning various parameters to attain low

power rather than design for low power.

In the following body of work, a novel method of designing off-chip

communication buses is proposed and the results explained to show it’s efficacy

in reducing the power consumption.

1.2 Computer Bus Systems

A typical computer system consists of various components including the

control core (CPU/FPGA etc), Communication Buses and various peripheral I/O

7

(Input/Output) mechanisms. Busses constitute an important resource for

addressing and data transfer in implementation of most electronic systems. They

are used throughout the system from a basic addressing mechanism of the CPUs

to memory right upto the high bandwidth buses needed for graphic applications.

These offchip busses consume dynamic power, orgp which is given by.

αfCVp Tddorg2=

Where, Vdd, f, Ct, α represent drain voltage, Frequency of operation, line

capacitance, switching activity respectively. Switching activity is the changes in

values the data has within itself. The dynamic power is consumed whenever

there is a change of value on the line. Any transition, either from 10 → or

01→ will consume this dynamic power.

Recent advances in computing uses like that of graphics, scientific computing

have raised the requirements of data transfer to such high levels that bus

interfaces are being constantly racked upto higher performance points with

respect to bandwidth, usability etc. These applications are highly memory

intensive rather than being just CPU intensive. They need enormous amount of

data to be transferred for computation which has increased bandwidth

requirements of offchip busses. This in turn entails higher frequencies which in

turn lead to higher power consumption.

Reducing this off-chip bus power consumption has become one of the key

issues for low power system design. The fact that the power consumed in bus

8

accesses account for a significant fraction of the total power consumed in VLSI

(Very Large Scale Integrated Systems) systems has been independently

established by many researchers, [13][14][15]. This is because the self-

capacitance of busses is quite large in comparison to the capacitance of other

data-path units like that within a CPU. The capacitance tends to be higher for an

off-chip bus than on-chip interconnects since the traces are longer. Also the

busses are operated at higher voltage levels than a CPU. The reduction in

operating voltage achieved with a CPU could not be done in busses since these

are external devices and noise margins generally prevent any further reduction in

off-chip bus voltages.

There are essentially two ways to reduce power consumption in busses.

The first one involves minimizing bus accesses by either reducing the number of

data-path units connected to large busses [14] or reducing the number of

accesses of READ/WRITE busses for large memory units by algorithm

transformations [16]. The second way to reduce power consumed in busses is to

reduce bus transition activity. In this regard many researchers have studied

reduction of bus transition activity by resorting to coding, similar to error-

correcting codes, [17][18]. This approach has been effective, but the delay

caused by the encoder/decoder limits the maximum bandwidth the bus can

operate on. The extra circuitry causes a drop in performance thus rendering it

unsuitable for most bus systems. Moreover power consumed by the encoder and

the decoder has to be less than the power saved as a result of activity reduction

on the bus. These constraints, which are imposed on the encoder/decoder logic,

9

limit the space of possible encoding solutions. These constraints have prevented

any of these techniques from being used in any practical systems.

1.2.1 Practical Considerations

Most bus encoding techniques involve overhead in terms of space complexity,

delay and their own operational power. The delay of the circuit is the time the

circuit will take to encode the data. This limits the bandwidth of the system since

the circuit should be able to encode a date element before the next one arrives.

Also the overhead power incurred by the extra circuitry also reduces the

effectiveness of the technique. This is particularly true for onchip interconnects

where the interconnect voltage, frequency and frequency will be comparable to

that of the circuit itself. Offchip busses stand a better chance with such

techniques since the I/O voltage and capacitance will be much higher than that of

the internal circuitry. But still, most technique have overhead in terms of delays

that severely restrict their bandwidth. Due to these reasons, the bus systems that

are popularly used presently have no form of any low power bus coding.

1.3 Transition Inversion

Transition inversion is the proposed novel technique that deals with

reducing power consumption for buffered bus systems [1][2][3][4]. One aspect of

buffered bus systems is that they can literally see into the data that is going to be

put on the bus in future. This is done because a given block is transmitted only

10

after it is completely loaded onto the buffer. In the proposed technique,

transitions along a line are manipulated for reduction with less delay. The

sequential nature of the technique makes it suitable for a pipelined approach thus

avoiding bandwidth limitations.

The one limitation of the proposed technique is that it can be used only

with buffered busses but it is not much of a practical limitation since most modern

busses tend to be buffered in design. The proposed technique of transition

inversion mitigates most of the impractical points of existing comparable

techniques. Various analyses supporting the point are explained in the following

sections.

1.4 Contributions

The major contributions in this thesis are as follows:

1. A novel technique is proposed in this work for power reduction for

busses that makes use of the buffered nature of the modern busses.

2. Optimized circuit design has been done for the same by a pipelined

approach. Complexity analysis is done on the circuit design.

3. Analysis of the algorithm and the circuit has been done to test it’s

efficacy.

4. Comparison with other techniques has been done.

5. Applications have been proposed as usage scenarios.

11

1.5 Organization of Thesis A motivation for the work has been given in the preceding sections. The rest of

the thesis is organized as follows:

Chapter 2 reviews some of the current work done in this field.

Chapter 3 explains the proposed technique of Transition Inversion.

Chapter 4 presents an analysis of the technique from an algorithmic perspective.

Chapter 5 describes the pipelined circuit implementation of the technique and

analyses it.

Chapter 6 analyses the experimental results of the designed circuit by using

various benchmarks and comparing with other techniques.

Chapter 7 describes applications that have been derived out of the technique and

concludes the work.

12

Chapter 2

Related Work

Looking from a data characteristic perspective, existing work in reducing

transitions have been developed for both general data and data that has some

special nature. General data is the data that flows in a general bus that does not

have any pattern. Special data are those with specific pattern. Examples for

special data are address busses, audio/video data etc. Most research has gone

into special busses, more specifically address busses, since it is much simpler

and the hardware overhead is not too much when compared to general busses.

The simplicity itself is in terms of the delay incurred with special busses which

generally has been shown to be less compared to that of general busses.

2.1 General Busses

General data can contain anything that runs on a general purpose bus like

PCI (Peripheral Component Interconnect), Northbridge/Southbridge FSB etc.

They are used as general purpose busses to interconnect various components in

a system. These busses can carry application binary data, user content etc. They

can also contain graphic/audio but no special attention is given to them and is

treated as general purpose data itself.

13

Of general purpose busses, research has gone into techniques that

depend on the data that is known only at runtime. One of the most often cited

encoding methods is the Bus Invert method [13]. Bus-invert selects between the

original and the inverted pattern in a way that minimizes the switching activity on

the bus. The resulting patterns together with an extra bus line (to notify whether

the original data or its complement has been sent) are signalled over the bus.

The proposed technique, also being designed for a general bus, will be

compared with bus invert in all respects in the following body of work. Bus invert

is discussed in detail in the following sections.

Other encoding techniques include Gray Coding [38] which takes the

serial XOR between data words to reduce transitions. This is traced to the fact

that Gray code always has been used for reducing transitions. Frequent Value

encoding (FVE) is another technique [30] that results in a significant reduction in

transitions, but has not been considered here, due to the excessive run time

overhead involved. It involves maintaining a codebook of frequently used values

and encoding them using code words with less number of transitions. It needs

significant memory area for the codebook and the overhead circuitry itself limits

bandwidth.

The proposed technique is for general busses and is compared with bus

invert and Gray coding in the upcoming chapters.

14

2.2 Special Busses

Special data are those data that are recognized by the system itself as

serving some very specific purpose, typically without software intervention.

Examples can include the HDA (High Definition Audio) audio chipsets found in

most modern machines. These chipsets by themselves have no functionality to

encode/decode/convert audio. Other Codec (Coder Decoder) chips are

connected to them by a special purpose bus that is specifically designed for

audio transfer. This bus is also defined as a standard so that HDA chipsets and

Codec chipsets can be used interchangeably. Whatever data that is transferred

on this bus between the HDA chip and the Codec chip, it has audio data inherent

to it. This special nature of the bus has been used in research to design low

overhead compression techniques which use lead to lesser use of busses and

thus lower power. One example is that of FLAC (free lossless audio codec) which

pushes most of processing onto the encoder keeping the decoder simple. With

the decoding happening on-chip which operates at a lower voltage than offchip

busses, power is reduced if the bus is used lesser. Since most audio usage is in

terms of playback or decoding, this also leads to lesser processing and thus

lesser power [48].

Another example of such a special bus that is a bit more complex is the

texture memory transfer in GPUs. GPUs make use of special texture memory to

store textures which will get warped onto 3D (3 Dimensional) models to generate

realistic looking 3D graphics. The bus that interconnects a GPU to such a

memory is also specific in nature. Here also techniques have been proposed in

15

literature to encode data [29]. It involves coding the image data such it is

compressed and thus less data is put on the bus. This of course involves

computation to extract the data at the GPU, but the savings made in an offchip

bus tends to be greater than that of overhead of internal computation.

In the domain of address busses, Musoll et al. proposed the working zone

method [15]. Their method takes advantage of the fact that data accesses tend to

remain in a small set of working zones. This technique sends only the offset of

the location being addressed with respect to the previous addressed location

along with information about the current working zone. This too entails limitations

on bandwidth since a search of the set of working zones had to be performed

before an address can be sent. Another popular technique is Asymptotic Zero-

Transition Encoding [14] which operates under the fact that most addresses tend

to be consecutive. So the receiver device can predict the address itself and be

ready with data. Only exceptions to a previously agreed protocol need to be

transmitted by the sender.

One another technique to reduce transitions includes Gray coding for

addresses. Most of the existent techniques make use of the repeating patterns in

address buses to reduce address bus transitions [33][34][35][36]. Gray coding

works by taking a bit serial XOR between address words, which generally tend to

be consecutive, giving rise to a data that can be encoded with lesser number of

bits.

There is no existent literature on bus coding methodologies for block data

transfer, other than Serial Bus Invert [37], which encodes blocks of data rather

16

than individual data words. Serial Bus Invert is similar to bus invert, the difference

being that the decision bits are transmitted as a word at the end of the block data

transfer. The technique proposed in this work is compared with Bus Invert, Gray

coding [38][32].

2.3 Bus Invert

Bus invert works by counting the number of transitions, which involves

XORing of the present and previous data. If the number of transitions is more

than half the bus width, the inverted data is transmitted, else the original data is

transmitted. A separate line is also added to the bus which will carry the decision.

This is an overhead to the design of the system which requires extra circuitry as

well as traces. The decision bit will signify whether the data that is on the bus is

the original data or it’s complement.

Figure 2.1 Location of Bus Invert in Bus Chain

The bus invert logic has to be added to an existing bus interface just

where the external interface is happening. It can be just before the level

17

converter, which converts the chip internal voltage levels to external voltage

levels. The logic will take its input from what the bus core systems is feeding it

and processes it to feed to the level converter. A block diagram for the location of

the bus invert logic in a bus chain is shown in Figure 2.1.The bus invert algorithm

is explained below:

_______________________________________________________________ Algorithm 1: Bus Invert_____________________________________________

1. Count the transitions between the data on the bus and the next data

that is to be put on bus

2. if transitions count < half of the bus width

3. Assign next data to bus

4. else

5. Invert the next data and assign the complement to bus

________________________________________________________________

A block diagram of the Bus Invert system is shown in Figure 2.2. A

sample coding process for a sample data is shown in Table 2.1.

Figure 2.2 Bus Invert Block Diagram

18

Table 2.1 Bus Invert Decision Circuit Bit No. 1 2 3 4 5 6 7 8

Current Data on bus 1 0 1 0 1 0 1 1

Next Data to be put on Bus 0 1 1 1 0 1 0 1

XOR of present and next data (Transition

Vector)

1 1 0 1 1 1 1 0

In the given example the number of transitions is 6 which is more than half

the bus width, 4. So the data is inverted and then sent. The decision is sent on a

separate line. An XOR between the current data and the next data that is put on

the bus shows that the transitions are reduced to 2 which is also given by (N-t)

where N is the bus width and t is the original number of transitions. The encoding

process is shown in Table 2.2.

Table 2.2 Bus Invert Decision Encoder Bit No. 1 2 3 4 5 6 7 8

Next Data to be put on Bus 0 1 1 1 0 1 0 1

Next Data that is put on Bus 1 0 0 0 1 0 1 0

Current Data on bus 1 0 1 0 1 0 1 1

XOR of current and next data 0 0 1 0 0 0 0 1

The whole operation involves a chain of full adders to count the transitions

and then perform another XOR on the data that has to be sent. All these

operations have to be done before the next data is to be put on the bus. The

19

chain of fulladders operating the output of the array of XOR gates contribute to

the delay in taking a decision. This delay is the parameter which limits the

maximum bandwidth of the system. Beyond this the encoder delay also has to be

taken into account which involves a parallel XOR to perform controlled inversion.

This entire set of operations has to be over by the time the next data arrives

leading to a restriction on bandwidth.

Existing work in the field of low power bus coding has been discussed with

special attention paid to Bus Invert. As seen the major limiting factor for most of

these techniques is that of delay and the following limit on bandwidth.

To mitigate the issues, the technique of Transition Inversion is proposed and

discussed in the next chapter.

20

Chapter 3

Transition Inversion

This chapter details the proposed algorithm to be used in inverting the

transition states to achieve power reduction.

3.1 System Scenario

The algorithm is proposed specifically for offchip block data transfer

busses. In block data transfer, data is generally loaded onto a buffer and then put

on the bus. Each line in the bus is a serial line that will transmit one particular bit

position of all data words that are put on the bus. A typical block can be as shown

in Table 3.1. The buffer mostly will be able to hold a larger data than just one

block of data but transmission will still happen with a granularity of one block.

Table 3.1 Block Arrangement Buffer data Bit Pattern

Data 1 D1,8 D1,7 D1,6 D1,5 D1,4 D1,3 D1,2 D1,1

Data 2 D2,8 D2,7 D2,6 D2,5 D2,4 D2,3 D2,2 D2,1

Data 3 D3,8 D3,7 D3,6 D3,5 D3,4 D3,3 D3,2 D3,1

Data 4 D4,8 D4,7 D4,6 D4,5 D4,4 D4,3 D4,2 D4,1

Data 5 D5,8 D5,7 D5,6 D5,5 D5,4 D5,3 D5,2 D5,1

Data 6 D6,8 D6,7 D6,6 D6,5 D6,4 D6,3 D6,2 D6,1

Data 7 D7,8 D7,7 D7,6 D7,5 D7,4 D7,3 D7,2 D7,1

Data 8 D8,8 D8,7 D8,6 D8,5 D8,4 D8,3 D8,2 D8,1

21

The bits taken in the vertical direction form a bitstream that travels in a line.

Taking the bus into account, it is a collection of bitstreams running in parallel. The

columns in the table represent the lines of the bus. The rows represent each data

element of the buffer. When transmitting, the bits from each element travel in

parallel across the lines. In the perspective of a line, bits of all data elements of

one position are transmitted sequentially. This forms bitstreams on all lines

composed of corresponding bits of all data elements.

The algorithm is developed for offchip bus systems for a variety of reasons.

One reason as mentioned previously is that the saving that can be achieved out

of an offchip bus can be much higher than an onchip interconnect. Designing

around delay restrictions will be slightly made easier since the bus frequency

generally tends to be less than that of the CPU frequency. Also the voltage,

capacitance of an offchip bus tends to be higher than that of internal circuitry.

Thus the extra power consumed by the overhead circuitry will not lead to a

significant reduction in overall power saving as will be shown in the following

chapters.

3.1.1 Mutual Capacitance

One another factor that is important in bus design is that of mutual

capacitance that can lead to cross talk and its own power dissipation. Cross talk

is the interference of one line with the neighbouring lines. If there is considerable

22

mutual capacitance between lines, high frequency signals can easily leak out to

the neighbouring lines and corrupt the data.

With offchip busses, self capacitance plays a major role compared to that

of mutual capacitance. This is because mutual capacitance falls off exponentially

with increase in trace spacing. The trace spacing of an offchip bus is generally

much higher compared to that of an onchip bus. Thus effects of mutual

capacitance can be ignored for most calculations done on offchip busses [39].

3.2 Algorithm

Before transmission, the number of transitions on a line is counted. This is

just counting the transitions of the bitstream in that line. This can be done by a

simple XOR gate between consecutive bits and counting the number of ‘1’s. If the

number of transitions is more than half the number of data words, the transitions

states between the bits can be inverted. Each transition is made as a non-

transition and vice versa. If not, the bit stream is transmitted as such. In case

transition inversion is needed, the scheme operates by observing the transition

states between any 2 bits and setting the encoded second bit to be the same as

the previous encoded bit if there is a transition. If there is no transition, the

previous encoded bit is inverted. The decision bit signifying transition inversion is

transmitted before transmitting the encoded data. Also, the first bit of the

bitstream is transmitted as such. This has to be done on all lines. Since the data

23

is sent as a block, the extra bits on each line will signal for the respective bit

streams. The algorithms for encoder and decoder are discussed below.

_______________________________________________________________ Algorithm 2: Transition Inversion Encoder______________________________

1. Count the transitions between the bits of the bitstream of a line as it is

being loaded into buffer

2. if transitions count < half of the buffer depth

3. Assign the unmodified bitstream to the bus line

4. else

5. Transmit decision bit

6. Transmit first bit of bit stream

7. for the rest of the bits in bitstream

8. If present bit on line ≠ next bit

9. Assign present bit as next bit

10. else

11. Assign complement of present bit as next bit

12. end if

13. end for

14. end if

________________________________________________________________

_______________________________________________________________ Algorithm 3: Transition Inversion Decoder______________________________

1. if decision bit signifies inversion

2. Take the first bit as first decoded bit

24

3. for the rest of the bits in the incoming bitstream

4. if next received bit ≠ previous received bit

5. Assign previous decoded bit as next decoded bit

6. else

7. Assign complement of previous decoded bit as next decoded bit

8. end if

9. end for

10. else

11. Take the first bit as first decoded bit

12. for the rest of the bits in the incoming bitstream

13. Assign decoded bit as incoming bit

14. end for

________________________________________________________________

The transitions in the bit stream, transmitted on a line, can be reduced by

the aforementioned scheme. Each line is processed independent of each other. If

there is a need for transition inversion, then the following steps are followed to

obtain encoded data. Let the data bit that is to be transmitted next be bd and

previous data bit be bdp. The previous transmitted bit is btp. The next transmitted

bit will be

bt = btp if bd ≠ bdp

= !btp if bd = bdp

25

In receiver the reverse logic needs to be applied. When the bit stream has

been signaled as modified, then the following steps are followed to decode data.

The previous and current received bits are assumed to be ‘brp’ and ‘br’

respectively. The previous decoded bit is assumed to be bdp. The current

received bit will be

bd = bdp if brp ≠ br

= !bdp if brp = br

The encoding and decoding is done on the fly, to reduce performance

losses. For example in the bit stream 10101011, the number of transitions is 6.

This is more than half the maximum number of transitions which is 4. Thus this

stream is to be modified according to the algorithm described above. The first bit

is transmitted as such, without any change. This is described in Table 3.2. The

encoded data has only one transition and the process in explained in the next

section.

Key: NT – No Transition, T – Transition, NC-No Change

Table 3.2 Sample Coding Process Bit No. 1 2 3 4 5 6 7 8

Bit stream 1 0 1 0 1 0 1 1

Transition State NC T T T T T T NT

Encode state NC NT NT NT NT NT NT T

Encoded Bit stream 1 1 1 1 1 1 1 0

Decode state NC T T T T T T NT

Decoded bit stream 1 0 1 0 1 0 1 1

26

One very important observation to be made out of the proposed technique

is that it does not involve any additional bus lines unlike most other techniques

including Bus Invert. This is a major improvement since it does not add to the

complexity of board designs where every line/trace is extensively tested for noise

and shielding as well not increasing the board resources used.

3.3 Explanation

To elucidate further, a few samples of the above table can be taken and

seen. Comparing bits 1 and 2, it can be seen it has a transition from 1 to 0. So

the transition state of the encoded bitstream’s bits 1 and 2 should be a non-

transition. It will mean making the encoded second bit to be the same as the

encoded first bit. Since the encoded first bit, which is same as the original bit, is 1

the encoded second bit is also made as 1.

Taking another case of bits 7 and 8, there is no transition. Both bits are 1.

For such a case during encoding, it has to be made a transition for sake of

consistency. This will mean the encoded eight bit has to be the complement of

the encoded seventh bit. Since the encoded seventh bit is 1, the eight bit is

encoded as 0. The waveforms of the original, encoded and decoded bitstreams

are shown in Figure 3.1.

27

Figure 3.1 Sample Bitstreams Waveform

Putting all of the bits together, the modified bit stream is 11111110 with

the decision bit signifying 1 to indicate that a transition inversion has been done.

At the receiver, if any bus line is signalled as modified, when there is a

transition, it is made as a non-transition, and when there is no transition, it is

made as a transition. The same process of the encoder is repeated in decoder.

Taking encoded bits 1 and 2, it can be observed that there is no transition.

Both are 1. Since the decision bit indicates that transition inversion has taken

place, the transition state has to be complemented. So the decoded bits 1 and 2

should be such that there is a transition between them. This will require the

28

complement of the first bit to be taken as the second bit. This generates the

second bit and can be continued till the entire bitstream is decoded.

Doing all this, the decoded bit stream is 10101011 which is the same as

the bitstream that was started with. The number of transitions in the original

bitstream was 6 and after encoding it became 1. This is straightforward to see

since the transitions are only complemented.

In a generic case of N bits, there can be (N-1) transition states. So if the

number of transitions is more than (N-1)/2, then transition inversion has to be

done. If the number of transitions is To, the number of transitions in the encoded

data will be Te = (N-1)-To. The proposed technique has the propensity to reduce

transitions at a slight cost of bandwidth utilization which is due to the

transmission of the extra bit before the actual transmission starts. But the other

limiting factor of delay does not affect the proposed technique unlike other

existing techniques as will be shown in the subsequent chapters. As such, the

hardware for the proposed technique will be located at various parts of the bus

chain. A high level architecture for the proposed technique is discussed in

Chapter 5.

An analysis of the algorithm from a purely theoretical perspective is carried

out in the next chapter. It discusses the efficacy of the proposed technique in

reducing the transitions as well the performance trade off involved.

29

Chapter 4

System Analysis

This chapter details a theoretical analysis of the proposed algorithm with

respect to its reduction efficacy, performance trade-off as well as power.

4.1 Reduction Analysis

The theoretical analysis of the efficiency is determined in terms of the

average reduction that can be obtained. The analysis has been done in two

independent ways: A theoretical one and a simulation based one. Both the

results tally with each other and are discussed below

4.1.1 Statistical Analysis

This analysis was an analytical one. For an N length buffer system, where

the transitions are taken between the consecutive bits, maximum of (N-1)

transitions are possible in one bit stream. Taking the Binomial distribution into

account the number of possibilities of ‘i’ transitions = (N-1)Ci. Torg and Tmod are taken

to be the number of transitions in the original data patterns and the number of

transitions in the modified data patterns respectively. These entities can be

calculated as follows:

30

Probability of a bit stream with ‘i’ transitions, P(i)=

( )

N

N iC

2

1−

Total number of transitions Torg =( )( )∑

−

=

−1

0

*1N

iC iN

i

Average number of transitions of original data E(torg) = ∑=

N

iiiP

0*)(

=

( )( )N

N

iC iN

i

2

*11

0∑−

=

−

Transition inversion is done when the number of transition is more than or

equal to N/2. The number of transitions in the modified data will be (N-1-i) for ‘i’

transitions in the original data.

Probability of a modified bit stream with i transitions, P(i)=

( )

N

N iC

2

1−

Tmod=

( )( ) ( ) ( )( )∑∑−

=

−

=

−−−+−1

2

12

01*1*1

N

Ni

C

N

iC iNNiN

ii

31

Average number of transitions of modified data E(tmod) =

( )( ) ( ) ( )( )N

N

Ni

C

N

iC iNNiN

ii

2

1*1*11

2

12

0∑∑−

=

−

=

−−−+−

The reduction efficiency can be given by comparing E(org) and E(mod).

The reduction percentage is given by R= 100*)(

)()( mod

orgtEtEtE org −

The metric of average reduction can be calculated by taking the difference

between the number of transitions in the modified data patterns, and unmodified

data patterns. This statistical calculation of the algorithm has been carried out for

word lengths of 8, 16, 32, and 64.

4.1.2 Brute Force Analysis

This analysis considers all combinations of the word and determines the

original and modified transitions in the datasets. It is essentially a brute force

approach. The data considered was a uniform distribution of all possible data

patterns that is likely to be transmitted over the bus. For example, considering a

buffer depth of 8, the number of possible data patterns of one bit stream is 256.

The number of transitions in these data patterns was calculated along with the

32

number of transitions in the data pattern after being modified, using the proposed

algorithm.

This was evaluated by means of simulation wherein runs with multiple

word lengths (buffer depths) were carried out. The simulation ran through all

possible combinations for a given length and determined the decision for all the

combinations. It determined the encoded data, incase transition inversion being

done, and calculated the reduction in transitions in every case. The overall

reduction average was calculated by the total number of transitions in all the

combinations and the total number of transitions in all the encoded data.

4.1.3 Reduction Efficiency

The results obtained by both the methods agree with each other and are

shown in Table 4.1 and Figure 4.1.

Table 4.1 Statistical Percentage Reduction Word Length 8 Bits 16 Bits 32 bits 64 bits

% Reduction in transitions 31.25 20.95 13.54 9.78

It can be observed that the reduction in transitions itself reduces with

increasing buffer depth. This can also be explained by a simple combinatorial

example.

33

% Reduction

0

5

10

15

20

25

30

35

8 16 32 64

Buffer Depth

% R

educ

tion

in T

rans

ition

s

% Reduction

Figure 4.1 Transition Reduction with Change in Buffer Depth

For a data space of some given buffer depth, say N, there can be 2N

possible values. Of these values, there can be only two elements which have the

maximum number of transitions, whatever N may be. These two possibilities will

be just a sequence of alternating ‘1’s and ‘0’s, one of which will start with ‘1’ and

the other with ‘0’. They alone will give maximum transition reduction. Similarly,

the number of possibilities that will lead to a transition reduction of just 1 will be

maximum in quantity. This will be because the elements that can give a reduction

of 1 will be those that have a transition count of N/2. They will be maximum in

number since they occur in the middle of the binomial distribution. Because of

this, increasing buffer depth can lead to a reduction in transition reduction

efficiency.

This can be mitigated by using smaller buffer depths, which will entail

transmitting more decision bits. Overall, the effect of the decision bit also has to

be taken into account when choosing an appropriate depth.

34

The effect of the extra decision bit on the overall bandwidth utilization is

explained in the following section.

4.2 Performance Penalty Analysis

The transition inversion algorithm needs an extra bit to be transmitted

before the start of the block of data on all lines. This leads to a decrease in

bandwidth utilization. For a system with buffer depth of 8, 9 bits are transmitted

on one line. This leads to the requirement of a slightly longer time to transfer the

data or a slightly higher frequency to maintain the bandwidth. For the above case,

a frequency increase of 9/8 (or N

N )1( + to be general) will be needed to maintain

the same bandwidth utilization. Or an increase of time can be allowed provided

the performance tradeoff is accepted in the system design. But whatever

mitigation is taken into design, the corresponding power consumed by I/O pads

will increase linearly. A performance metric is defined to take into account the

scaling of the frequency and the reduction in transitions, and is calculated as their

product.

Performance Metric, Ps = (frequency/time scaling) * (original reduction

efficiency)

Table 4.2 Performance Metric, Ps variation with buffer depth (bitstream length) Buffer depth 8 Bits 16 Bits 32 bits 64 bits

Ps 27.78 19.71 13.12 9.63

35

The variation of this parameter with buffer depth is shown in Table 4.2

and Figure 4.2.

% Reduction

0

5

10

15

20

25

30

8 16 32 64

Buffer Depth

Scal

ed %

Red

uctio

n in

Tr

ansi

tions

% Reduction

Figure 4.2 Effects of Frequency scaling with Buffer Depth

Bus invert leads to a reduction in bandwidth since it poses a delay in

putting the encoded data. By following a depth based approach where most

delays are hidden, bandwidth need not be reduced.

4.3 Tradeoff Analysis The tradeoff in design happens with reduction efficiency Vs performance

penalty. If the reduction efficiency has to be high, the buffer size should be small

which leads to a higher performance penalty and the extra power consumed with

regarding to that. Having a higher buffer size may lead to lower performance

penalty but gives only a lesser reduction efficiency. This tradeoff has to be taken

into account whenever the technique has to be implemented practically.

36

4.4 Power Analysis

The overall power reduction consists of the power reduction achieved by

the transition reduction minus the power consumed by the extra circuitry.

Unmodified dynamic power consumed by I/O pads is given by

αfCVp Tddorg2=

Where, Vdd, f, CT, α represent drain voltage, Frequency of operation, line

capacitance, switching activity respectively. If the power dissipation of the extra

circuitry required for the coding process is taken into account then the equation

given above has two extra terms on the right hand side, the encoder and decoder

power dissipation respectively. For the algorithm to show any power reduction

the following relation has to be satisfied.

[ ] )(mod2

DETdd PPfCV +>−αα

Where PE is power dissipated by the encoder, PD is the power dissipated by the

decoder and αmod is the modified switching activity.

4.5 Error Detection Analysis

The proposed transition inversion algorithm’s propensity towards reducing

the number of transitions to less than half the word length can be used for

detecting some errors. A simplified discussion of it done below.

A preliminary way of doing this can be by determining if the number of

transitions in the received bitstream is more than half the bitstream length. A

37

counter is placed at the receiver that counts the number of transitions in the

incoming bitstream. If this value is more than half the bitstream length, the

incoming data is incorrect.

As a simple and preliminary analysis, the proposed technique is compared

with parity bit technique, since both have similar overhead i.e. addition of one bit

to the bitstream. The parity bit detects all odd bit errors, but misses even bit flips,

whereas, transition inversion can detect a certain percentage of any number of

bit errors.

Table 4.3 Error Detection Analysis % of errors detected No. of Bit errors

Parity Coding Proposed Technique

1 100 31.25

2 0 44.64

3 100 52.68

4 0 55.71

5 100 52.68

6 0 44.64

7 100 31.25

8 0 0

38

0

50

100

150

1 2 3 4 5 6 7 8

Number of Bit Errors

Perc

enta

ge D

etec

tion

Proposed Technique Parity Bit

Figure 4.3 Proposed Technique Vs Parity Bit Technique

Error analysis has been done by considering all combinations of the given

bitstream length that are transmitted over the bus. For transition inversion coding

on a buffer depth of 8, all the combinations of bit errors right from one bit error to

8 bit errors have been checked for both the proposed technique and parity bit

technique. The result of this analysis is shown in Table 4.3. The variation in error

detection percentage with increasing number of bit errors is shown in Figure 4.3.

If all the bits are in error, then neither technique can detect the error, as in

the proposed technique if all bits are flipped, the number of transitions remains

the same. Calculation of averages over the entire range of bit errors shows that

the proposed technique and parity bit technique both have the same value of

50.2%. The average is calculated as the ratio of total number of errors detected

to the total number of errors possible on the line. Since the proposed technique

cannot give a definite indication of an error by itself, it can be used as a hint to

upper layers of communication that an error has occurred. This is only proposed

39

as an added advantage that can be achieved with not much extra hardware at

receiver since decoding anyway will be done.

The proposed technique of transition inversion has been analyzed from an

algorithmic perspective showing its potential in solving the low power problem. In

the next chapter, an optimized implementation of it will be discussed and its

complexity analyzed.

40

Chapter 5

Implementation and Analysis

This chapter deals with the overall architecture of the system proposed

and the gate level design of the transition counter, encoder, and decoder circuits.

Theoretical complexity of the circuits is also analyzed and compared.

5.1 High Level Architecture

In most block systems, the data buffer is present just before the

transmission part. The core logic of the system places the data inside the buffer

from one side and the transmission happens from the other side. In such systems,

the transmission starts only after a complete block is fully formed at the buffer.

This block data transfer system by itself is a form of a pipelined system

that trades off latency to throughput. In any pipelined design, delay buffers are

used to split longer delays into shorter delays. There might be an initial delay to

get the first data out but a consistent throughput will be maintained. Thus

bandwidth will not be affected even though latency might increase. This pipeline

is what is being used in the proposed implementation to cut down on the delay

that forms the bottleneck of most bus coding techniques.

The proposed technique is implemented by a pipelining the entire coding

system into two separate blocks, namely:

a) Decision Circuit & Transition Counter

41

b) Encoder

The bit stream is encoded on the fly as the data is put on the bus, as

shown in Figure 5.1.

Figure 5.1 High Level Architecture

The transition counter is placed right at the entry of the data into the

buffer from the system core. This way, the data that is fed into the buffer will be

analyzed for the number of transitions just as the data is being loaded onto the

buffer. The decision of the transition inversion is made depending on the count of

the transitions and is stored in some extra space in the buffer. This counter

output itself can be the decision as will be elaborated in the following sections.

The bit stream is encoded if a transition inversion is needed. This is done

as the data is being put on the bus. This can be done in an on-the-fly manner

since the encoder need to only process the current and next bit. This can be

implemented with less delay as shown in the following sections.

In the receiver the decoder has to decode the incoming bit stream and

recover the original data.

42

5.1.1 Decision Circuit

The decision circuit is built by counting the transitions between

consecutive bits in the bitstream. A transition between two bits is found in a

simple manner by performing the equivalence operation of XOR (Exclusive OR)

between them. The proposed circuit using a simple XOR gate between

consecutive incoming bits of the bit stream is shown in Figure 5.2.

The D-FF (D-Flip Flop) is the one that holds and propagates the data on

each clock cycle. Actually, this D-FF will already be a part of any block data

loading system. That D-FF itself can be used for the proposed purpose.

The XOR between the output and input of the D-FF performs the function

of checking the equivalence between the consecutive bits. This gate will give an

output of ‘1’ when the two bits are not the same and ‘0’ when they are same. This

translates to a ‘1’ when there is a transition and a ‘0’ when there is no transition.

This transition state can be used to enable a counter to count the

transitions. This counter itself will be operating on the clock frequency of the data

stream. By using the transition state to be an enabler, the counter can count the

number of transitions. The counter needs to count only up to half the number of

maximum transitions, which will be (N-1)/2 for a N depth buffer. So this will need

log2((N-1)/2) flip flops. Again here, the delay is not much of a problem since it will

be hidden in the pipeline.

43

This circuitry can also be implemented with double edge triggered circuits

to further reduce power dissipation at the encoder stage. Also, the transition

counter works in parallel to buffer loading, and is thus masked.

Figure 5.2 Decision Circuit (Transition Counter)

5.1.2 Encoder Circuit

The encoder operates on the fly depending on the transition inversion

decision as shown in Figure 5.3. The encoder is operated along with the usual

data transmission part. The encoder itself is implemented in a pipelined fashion,

using 2 D-FFs, so that the effect of delay on bandwidth is reduced. This is made

possible by the sequential nature of the block data and the proposed technique.

This pipeline only introduces a latency, but the throughput is maintained since the

processing of the bits happen in parallel as the transmission is happening in a

pipelined fashion.

This encoder needs to operate only for those cases where a transition

inversion is needed. So this entire circuit can be power gated and be made

operational only when needed as discussed in the prior chapters.

44

Figure 5.3 Encoder Circuit

The D-FF on the incoming bitstream calculates the transition state just as

the decision circuit did during the loading of the block. Once the transition state is

known, it is inverted to generate an inverted state if the decision was to invert the

transition. This inverted transition state is used to manipulate the next bit in such

a way that the next bit will be in the inverted transition state in correspondence to

the current bit.

To do this, the other D-FF takes its input as the current bit that is put on

the bus. This next bit is manipulated in such a fashion to be the same as current

bit if the transition state for that transition was ‘1’. When transition state is ‘1’,

inverted transition state is ‘0’ which means, the next bit to be the same as the

current bit. This is because of the underlying idea of transition inversion. This can

be easily achieved with an XOR gate which acts as a controlled inverter. It

passes one of the inputs as the output if the other input, taken as a control input,

is ‘0’. If the control input is ‘1’ the other input is inverted to generate the output bit.

This principle is used here to generate the next bit from the inverted transition

state.

45

5.1.3 Decoder Circuit

The decoder is essentially the same as the encoder circuit. The decoder

performs XOR between consecutive bits to determine transition state and uses it

to perform a controlled inversion on the received bit if required to recover the

data. The decoder is shown in Figure 5.4.

Figure 5.4 Decoder Circuit

5.2 Complexity Analysis

The time and space complexities have been analyzed comparing them

with bus invert method. The main components of both systems are the decision

circuit, encoder and decoder.

5.2.1 Decision Circuit

With increase in bus width, the space complexity increases linearly (O(N))

for the bus invert’s transition determiner. This is because, it involves XOR gates

on all the bits. The transition determiner of the transition inversion technique is

46

just the one XOR gate whatever the buffer depth, a comparable parameter to bus

width, may be. So it scales O(1) for all bus depths.

Time complexity of both of them is constant O(1). But the time complexity

of transition inversion will not affect the system due to pipelining. So the delay

due to that is hidden.

The transition counter of the bus invert needs to have the circuitry to count

the parallel transition vector to the maximum value which leads to a scale up of

O(log2 N) for both space and time complexities. Transition inversion needs to

count sequentially only upto half the maximum number of transition, (N-1). This

scales better than bus invert in space and time complexity coming upto

O(log2((N-1)/2)).

But the time complexity of the transition inversion’s counter will not affect

the delay due to the pipelined operation of the proposed technique. This is a

major improvement over bus invert as well as most other techniques since the

chief limiting factor of delay can be reduced.

5.2.2 Encoder/Decoder Circuit

With increase in bus width, the bus invert encoder/decoder circuit

complexity increases linearly (O(N)) since it needs that many XOR gates. For the

proposed technique, the circuit complexity is constant (O(1)) since it needs only

that one set of D-FFs and XOR gates to achieve the same result for any buffer

47

depth. Time complexity is constant (O(1)) in both bus invert and the proposed

technique.

An optimized implementation of the proposed technique has been

discussed. It makes use of pipelining to mask the effect of the extra circuitry on

delay. Its ability to reduce transitions was also discussed. The next chapter will

deal with a comparison of the technique to Bus Invert over benchmarks.

48

Chapter 6

Experimental Analysis

This chapter deals with comparisons of the simulation analysis of the

proposed technique and bus invert.

6.1 Benchmark Simulation

For experimental analysis, the algorithm was applied on random image

data and SPEC2000 benchmark binaries. The SPEC2000 benchmark is used for

the purpose of simulating data that is executable in nature. Memory traces of

such benchmark binaries are generally used in most other comparisons. The

image data is to show data that is not executable in nature and which will not

involve any memory access beyond the size of the image.

6.1.1 Random Image Data

Two analyses were performed. The first involved a limited analysis which

took a limited set of configurations. This was done to show the individual results.

A second analysis which is more comprehensive is also discussed.

For first preliminary analysis of the algorithm seven images were taken

and their RGB values were ran through the proposed algorithm and bus invert.

The images were a mix of both smooth and detailed features. The mix is taken so

that there will be varying levels of variance within the images. This run simulates

49

a transfer of image data on a bus. The run was performed assuming the buffer

depth to be 8 and bus width to be 8. The individual results are tabulated in Table

6.1 for a buffer depth of 8 and bus width of 8. These do not include the power

dissipated by the encoder and decoder circuitry. In a system, the image data is

transferred though the type of busses that the proposed technique targets. They

will be mostly offchip, block data transfer systems.

Table 6.1 A comparison of transition reduction for Bus invert and the Proposed Technique for Images

Bus Invert Coding Proposed technique Image

#

Original no. of

transitions transitions %

reduction

transitions %

reduction

1 120160 86296 28.18 72212 39.9

2 127770 94776 25.82 85454 33.11

3 74666 61746 17.3 53502 28.34

4 165678 119578 27.83 119908 27.62

5 111909 81645 27.04 70978 36.58

6 66189 49251 25.59 46769 29.34

7 159620 121466 23.9 114163 28.48

It can be seen from the table that some individual images produce much better

transition reduction than others thus self validating the wide nature of the data

taken. It is also clear from the above table that transition inversion performs much

50

better than bus invert. It is able to reduce more compared to bus invert when

image data is taken as input.

Also a more comprehensive analysis was done with a set of another ten

images that also consisted of smooth and detailed features. The analysis was

done for various configurations of bus width and buffer depths. Buffer depth was

varied with values of 8, 16, 32 and 64. Bus widths were varied with values of 8,

16, 32, and 64 bits. Due to the voluminous data involved, the reductions for all

the images have been averaged to show the results. The results of the analysis

are shown in Figure 6.1. Each data point in the plot is the average of the

reductions over all the images for that configuration.

Transition Reduction Efficiency

0

5

10

15

20

25

30

35

8 16 32 64

Bus Width

% R

educ

tion

Transition Inversion BufferDepth=8Transition Inversion BufferDepth=16Transition Inversion BufferDepth=32Transition Inversion BufferDepth=64Bus Width BufferDepth=8,16,32,64

Figure 6.1: Comparison of Transition Inversion to Bus Invert for images

51

The plots for bus invert for varying buffer depths have merged and hence

are shown as one single plot. The pattern that can be seen in the plots reiterates

what has been discussed before. The results for bus invert clearly shows that it’s

efficiency takes a hit when bus width is increased while not affected by changes

in buffer depth.

For transition inversion, the converse happens. The efficiency is

independent of bus widths while reducing with increasing buffer depths. This

makes it suitable for the wider busses that are becoming the norm these days.

6.1.2 SPEC2000 Benchmark

SPEC2000 benchmark binaries [49] were run with the proposed technique

and compared with bus invert and gray coding. These benchmark binaries are

typically used to simulate bus activity as shown in other literature discussed

before. Memory traces of the binaries are taken and they are run through the

simulated bus system for any type of analysis. They model the activity of any bus.

Memory traces are those information of how the binary accesses the

memory locations which can include both its own binary data as well as just data.

These traces are useful to model a bus system.

The 26 binary traces were run with varying the buffer depth across

8,16,32,64 and bus widths with the values 8,16,32,64 bits. All configurations were

taken varying both buffer depth and bus width.

The averages for a given combination of bus width and buffer depth

across all the binaries were taken and have been plotted. The results for both the

52

techniques are showed in Figure 6.2. Each data point represents the average

reduction over all the binaries for that given buffer depth and bus width.

Transition Reduction Efficiency

0.00

5.00

10.00

15.00

20.00

25.00

30.00

35.00

8 16 32 64

Bus Width

% R

educ

tion

Transition Inversion BufferDepth=8Transition Inversion BufferDepth=16Transition Inversion BufferDepth=32Transition Inversion BufferDepth=64Bus Invert BufferDepths=8,16,32,64

Figure 6.2: Comparison of Transition Inversion to Bus Invert for SPEC2000

It can be observed that the buffer depth does not make any changes to

bus invert. Also with increase in buffer depth, the transition reduction reduces for

the proposed technique. A similar observation can be made for bus invert when

bus width is increased.

Most notable, the change in bus width for transition inversion has affected

the reduction efficiency slightly. Even though it can said that it should remain

constant with respect to bus widths, when data is concatenated/split to attain

different bus widths it can cause some unexpected consequences. This can be

mitigated by carefully designing the system taking into account the typical data

widths and the instruction set of the processor which will be used on that bus.

53

With increasing bus widths in present VLSI systems, the proposed

technique will perform better. This is the main differentiating factor from existing

techniques which scale in a depreciative manner. Also delay is reduced to a large

extent as will be discussed in the following sections. The increase in buffer depth

leads to a lesser reduction. It can be offset by splitting the block into sub-blocks

of smaller depths. Depending on the system, a compromise between power

reduction and bandwidth utilization can be found.

-50

-40

-30

-20

-10

0

10

20

8 16 32 64

Bus Width

Tran

sitio

n re

duct

ion

Perc

enta

ge

Buffer Depth 8Buffer Depth 16Buffer Depth 32Buffer Depth 64

Figure 6.3: Transition reduction in Gray Coding

The benchmark files were also run using Gray Coding technique. Gray

coding was also compared against since it is another popular technique but has

the disadvantage of being an N-bit input to N-bit output conversion. The results

are shown in Figure 6.3. It can be seen that there is not much reduction in

transitions, with some of the data points showing an increase in the number of

transitions. Gray code does not show much reduction in transition since it is an

N-bit to N-bit conversion. Whatever data combinations are there in the input set

54

are exactly the same in the output set with a one-one mapping. So it is not

possible to get much benefit out of Gray code.

6.2 Overall Delay Analysis

The proposed system and bus invert were designed in Verilog RTL

(Register Transfer Level) and analyzed with Synopsys synthesis tools. A bus

operating at 100MHz was assumed with its I/O voltage levels at 3.3v. The

internal circuitry was modeled on 180nm process technology from TSMC (Taiwan

Semiconductor Manufacturing Corporation). The circuitry was simulated by

feeding the SPEC2000 benchmark trace files as input for a buffer depth of 8.

The proposed technique does not involve circuitry of multiple stages thus

leading to less delay. The delay performance of the proposed technique and bus

invert in terms of propagation speed is simulated and compared in Table 6.2.

Table 6.2 Comparison of encoder delay in Bus Invert and Transition Inversion

Technique Proposed

Technique

Bus Invert

Delay 1.2ns 3.3ns

The table shows that the delay of Transition Inversion is much less than

that of Bus Invert. The delay of Transition Inversion is only 64% of that Bus Invert.

55

For calculating the delay due to the proposed technique only the encoder is

considered. The decision circuit is not taken as it will be part of the buffer loading

delay and will not contribute to encoding. The decision circuit delay was found to

be 0.2ns since it involves only the XOR gate. The decision circuit flip flops delays

will be masked by the sequential loading thus giving a pipelined approach.

For the bus invert technique, the decision circuit delay is also taken into

account because encoding has to be complete before the next data word arrives.

Thus before the next data arrives the counting of the hamming distance should

have been done and the data encoded.

Overall the encoding delay of the proposed technique is considerably

lesser compared to that of the bus invert.

The delay is also constant for increasing buffer depths since the encoder

does not vary with either buffer depth or bus width. Bus invert suffers in this

scalability issue since the delay increases when bus width is increased making it

unsuitable for wide busses. This makes transition inversion more suitable for the

high frequency, wider busses that are the norm today.

6.3 Overall Power Analysis

The power consumed by the circuitry measure by simulation is shown in

Table 6.3. Assuming the parameters stated above, the power was found to be

44.89mW.

56

Table 6.3 Power dissipation of encoding/decoding circuitry for transition inversion Circuitry Decision

circuit

Encoder Decoder

Power

consumed

28.7µW 28.9µW 28.6µW

The reduction in power consumption is linearly dependent on the activity

factor reduction. A reduction of 18.2% for buffer depth 8 activity leads to a

reduction of power by 8.17mW. The total power consumed by the extra circuitry

is 86.2 µW leading to a net power reduction of 8.08mW which corresponds to

17.99% reduction in power.

57

Chapter 7

Conclusions

7.1 Derivative Work

Derivative works from the proposed technique have been proposed for the

purpose of extending it to synchronous serial busses, designing a high level

design for processor cache and clock recovery in asynchronous bus systems.

This elucidates that the proposed technique can be used not only for reducing

power but also serves other purposes and lays the base for future work.

7.1.1 Serial Busses

The proposed technique can also be applied to a synchronous serial bus.

A serial bus by itself has the data flowing in a sequential fashion thus enabling

the application of transition inversion. Serial busses typically move a data

element bit by bit whatever the word length might be. This gives an opportunity to

use transition inversion to reduce power.

This too involves calculating the transition count, making a decision and

encoding. A block diagram of a pipelined approach is shown in Figure 7.1 which

shows two approaches to counting the transitions. One is a parallel way which

suffers from delay issues. The serial counting approach adds a buffer to load the

58

serial data and count the transitions. This increases latency but maintains the

throughput.

Figure 7.1 Serial Coding High Level Architecture

7.1.2 Usage to Caches

The technique can be easily applied to processor caches since caches

make use of block data transfer. The data generally gets transferred in terms of

cache lines. As the cache (L1) does not use a buffer, encoding data on the fly is

not possible, without drastic reduction in performance. Also the absence of a

buffer in the cache means that data when modified will invalidate the inversion

decision bits determined by the primary memory for the given block, when it was

loaded onto the cache.

The above drawbacks can be removed taking into consideration that the

processor core modifies the data only in the L1 cache and not at higher cache

59

levels. Thus we can make use of the L2 cache as a buffer to perform encoding.

Thus the process will be modified as such.

• The memory consists of both the encoder and decoder circuits.

• The L1 cache has only the decoder circuit.

• The L2 cache has the encoder circuit for data coming from the L1

cache.

In case of an on-chip L2 cache, only the encoded data is sent on the bus.

Raw data is never sent on the bus. In case of an off-chip L2 cache, the data from

the L1 cache to the L2 cache will be raw, and no power saving modification is

done to this data. Power saving is still achieved as the modified data is sent in

the other 3 of the 4 possible transmission paths. A high level block diagram of an

architecture is shown in Figure 7.2.

Figure 7.2 Cache Architecture for Transition Inversion

60

7.1.3 Usage to Clock Recovery

This technique can also be applied to asynchronous serial buses.

Generally in asynchronous bus systems, the clock signal is not sent separately.

The transmitter and receiver operate at the same clock frequency and only phase

is synchronized by clock recovery mechanisms. This needs the data to have

more number of transitions. The receiver, by looking at the time of any transition

will adjust it’s clock phase to match with the transmission. In these busses,

generally a preamble might also be sent that will contain only a bitstream of

alternating highs and lows. These are used in busses that move data outside a

system, typically longer range bus systems.

The technique of transition inversion can be applied to clock recovery by

simply inverting the decision condition. Clock recovery is applied to

asynchronous communication wherein the clock phase is recovered from the

data stream itself. So here the data stream needs to have more number of

transitions. There are existing techniques like 8b/10b which are predominantly

used in such communication systems for eg: - Ethernet. In this technique 8 bits of

data is encoded into 10 bits of data such that there are more number of

transitions in the resultant bitstream. This works by selecting a subset of the total

combination space that will have more number of transitions in average. Here

only 256 (2^8) vectors are used out of a total of 1024 (2^10) vectors. The 10 bit

vector selected to represent any given 8 bit vector will have more number of

transitions in average. 4b/5b, 8b/10b are examples of this type of clock recovery

61

mechanism. They operate by means of a look up table generally with not much of

runtime logic involved. 8b/10b operates by splitting the 8 bits into groups of 5 bits

and 3 bits and encoding them into 6 bits and 4 bits respectively. The 5b/6b and

3b/4b happens with a look up table which occupies a lot of space. Though it

reduces delay, it incurs a huge overhead on space for all those entries.

The technique of transition inversion can be easily applied to increase the

number of transitions just by inverting transition states when the number of

transition is less than half the bit stream length. This is the inverse of the

technique used for low power.

This needs only one bit of overhead compared to two bits of 8b/10b. The

transition inversion technique also needs only a simple circuit thus taking up less

space compared to 8b/10b.

7.2 Contributions

The major contributions in this thesis are as follows:

1. A novel technique is proposed in this work for power reduction for

busses that makes use of the buffered nature of the modern busses.

2. Optimized circuit design has been done for the same by a pipelined

approach. Complexity analysis is done on the circuit design.

3. Analysis of the algorithm and the circuit has been done to test it’s

efficacy.

4. Comparison with other techniques has been done.

62

5. Applications have been proposed as usage scenarios.

7.3 Comparison

Transition Inversion can be compared to bus invert in various ways as

shown in the work. Transition inversion’s efficiency reduces with increase in

buffer depth but remains independent of bus width. Bus invert’s efficiency

reduces with increase in bus width but remains independent of buffer depth. The

increasing bus widths of the modern busses make bus invert not a practical

choice.

One limitation of the proposed technique is that it is applicable only to

buffered busses while bus invert is applicable to almost every type of bus. But

this does not place much limitation on the practical utility since most busses tend

to be buffered in nature.

7.4 Conclusions

In this work an encoding technique has been presented that reduces

power dissipated on off-chip data buses for block data transfer. The technique

involves inverting the transition states on every line of the bus if the transitions

exceed the number of non-transitions. The inversion reduces the number of

transition states which signal a transition.

63

The modification status is signaled as an extra word, thus avoiding the use

of an extra line. An optimized circuit was designed which makes use of pipelining

thus reducing the effect of the extra circuitry on delay. This pipelining is made

possible by the sequential nature of block data transfer as well as the proposed

technique.

The important parameter of delay which limits the bandwidth is

significantly reduced to 64% of that of bus invert thus making transition inversion

more suitable for practical applications. Also the encoder circuit is constant thus

removing issues one faces when scaling up.

The average reduction obtained in terms of transitions is 18.2% for buffer

depth while the net power reduction after the extra power circuitry is taken into

account is 17.99%. This is achieved without using an extra bus line thus saving

on design space. The compromise is in bandwidth utilization which can be

adjusted by choosing a proper block length.

64

Bibliography [1]. Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti, “Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, Accepted for publication in special issue of Journal of Low Power Electronics, October 2010. [2] Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti, “Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, 23rd International Conference on VLSI Design, January 2010 [3] Joint Winner of Intel Research Challenge (also known as Intel Scholar Program) 2008-2009 http://www.intel.com/cd/corporate/education/APAC/ENG/in/news/news43/419015.htm [4] Abinesh R., Bharghava R., M.B. Srinivas, “Transition Inversion Based Low Power Data Coding Scheme for Synchronous Serial Communication”, isvlsi, pp.103-108, 2009 IEEE Computer Society Annual Symposium on VLSI, 2009 [5]. The New York Times Technology section April 1, 2008. Light and Cheap, Netbooks Are Poised to Reshape PC Industry. [6]. http://lesswatts.org/ Intel sponsored community project for software based low power development. [7]. http://www.intel.com/consumer/products/style/netbook.htm Netbook vs. Laptop and Entry Level Desktops. [8]. Netbook design considerations by Texas Instruments. http://focus.ti.com/docs/solution/folders/print/581.html [9]. Nano-cmos scaling problems and implications. Nano-CMOS Circuit and Physical Design, Ban P. Wong, Anurag Mittal, Yu Cao, and Greg Starr, John Wiley & Sons Inc. [10]. J. M. Rabaey. Digital Integrated Circuits. Prentice Hall, 1996. [11]. http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface Open industry standard for power management [12]. Dhiman, G. and Rosing, T. S. 2007. Dynamic voltage frequency scaling for multi-tasking systems using online learning. In Proceedings of the 2007 international Symposium on Low Power Electronics and Design (Portland, OR, USA, August 27 - 29, 2007). ISLPED '07. ACM, New York, NY, 207-212. [13]. M. R. Stan, W. P. Burleson. Bus-Invert Coding for Low Power I/O, IEEE Transactions on Very Large Integration Systems, Vol. 3, No. 1, pp. 49-58, March 1995. [14]. L. Benini, G. De Micheli, E. Macii, D. Sciuto, C. Silvano. Asymptotic Zero-Transition Activity Encoding for Address Buses in Low-Power Microprocessor-Based Systems, IEEE 7th Great Lakes Symposium on VLSI, Urbana, IL, pp. 77-82, Mar. 1997.

65

[15]. E. Musoll, T. Lang, and J. Cortadella. Working-Zone Encoding for reducing the energy in microprocessor address buses. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 6, no. 4, Dec 1998 [16]. W. Fornaciari, M. Polentarutti, D.Sciuto, and C. Silvano, “Power Optimization of System-Level Address Buses Based on Software Profiling,” CODES, pp. 29-33, 2000. [17] P. Panda, N. Dutt, “ Reducing Address Bus Transitions for Low Power Memory Mapping”, European Design and Test Conference, pp. 63-67, March 1996. [18] E. Musoll, T. Lang, and J. Cortadella, “Exploiting the locality of memory references to reduce the address bus energy”, Proceedings of International Symposium on Low Power Electronics and Design, pp. 202-207, Monterey CA, August 1997. [19] http://www.green500.org/lists.php Green 500 list [20] Sushant Sharma, Chung-Hsing Hsu, and Wu-chun Feng, “Making a case for a green 500 list”, 2nd IEEE IPDPS Workshop on High-Performance, Power-Aware Computing, April 2006 [21]. Martin, T. L., Siewiorek, D. P., Smailagic, A., Bosworth, M., Ettus, M., and Warren, J. 2003. A case study of a system-level approach to power-aware computing. ACM Trans. Embed. Comput. Syst. 2, 3 (Aug. 2003), 255-276. [22]. Mircea R. Stan, Kevin Skadron, "Guest Editors' Introduction: Power-Aware Computing," IEEE Computer, vol. 36, no. 12, pp. 35-38, Dec. 2003, doi:10.1109/MC.2003.1250876 [23]. Khargharia, B., Hariri, S., and Yousif, M. S. 2008. Autonomic power and performance management for computing systems. Cluster Computing Vol.11, No.2 (Jun. 2008), 167-181. [24] http://www.nvidia.com/object/optimus_technology.html Nvidia Optimus Technology Homepage. [25] http://hothardware.com/Articles/NVIDIA-Optimus-Mobile-Technology-Preview/ Preview and explanation of Nvidia Optimus. [26] http://news.lenovo.com/article_display.cfm?article_id=1301 One PC, Two Devices: Lenovo Reveals the Industry’s First Hybrid Notebook. [27] http://ces.cnet.com/8301-31045_1-10427615-269.html Hands on with Lenovo's CES showstoppers: U1 Hybrid, Skylight, and S10-3t up close. [28] http://www.engadget.com/2010/01/05/lenovo-ideapad-u1-hybrid-hands-on-and-impressions/ Lenovo IdeaPad U1 Hybrid hands-on and impressions. [29]. Tomas Akenine-Moller, Jacob Strom, “Graphics Processing Units for Handhelds”, Proceedings of the IEEE Vol. 96, No. 5, pp. 779-789 May 2008. Invited Paper

66

[30] Jun Yang, Rajiv Gupta, Chuanjun Zhang. Frequent value encoding for low power data buses. ACM Trans. Design Autom. Electr. Syst. 9(3): 354-384 (2004) [31] C. Su, C. Tsui, and A. Despain. Saving power in the control path of embedded processors, IEEE Design and Test of computers, 11(4):24–30, 1994 [32] Wei-Chung Cheng, Massoud Pedram. Memory Bus Encoding for Low Power: A Tutorial. ISQED 2001: 199-204 [33] Giuseppe Ascia, Vincenzo Catania, Maurizio Palesi, "A Genetic Bus Encoding Technique for Power Optimization of Embedded Systems", J.J. Chico and E. Macii (Eds.) PATMOS 2003, LNCS 2799, pp. 21–30, 2003. Springer-Verlag Berlin Heidelberg 2003 [34] C. Su, C. Tsui, and A. Despain, “Saving power in the control path of embedded processors”, IEEE Design and Test of computers, 11(4):24–30, 1994. [35] L. Benini, G. D. Micheli, E. Macii, D. Sciuto, and C. Silvano, “Address bus encoding techniques for system-level power optimization”, In IEEE Design Automation and Test Conference in Europe, pages 861–866, Paris, France, Feb. 1998. [36] L. Benini, G. D. Micheli, E. Macii, M. Poncino, and S. Quer. Power optimization of core-based systems by address bus encoding, IEEE Transactions on Very Large Scale Integration, 6(4), Dec. 1998. [37] Saneei M, Afzali-Kusha A, Navabi Z. Serial Bus Encoding For Low Power Applications, International Symposium on System-On-Chip, pp. 1-4, November 2006. [38] Kangmin Lee, Se-Joong Lee, Hoi-Jun Yoo. SILENT: serialized low energy transmission coding for on-chip interconnection networks. ICCAD 2004: 448-451 [39] Stephen H. Hall, Garrett W Hall, James A McCall. High Speed Digital System Design- A Handbook of Interconnect Theory and Design Practices. Wiley Interscience. pp 57-61 [40] Edited by Christian Piguet, Low-Power Processors and Systems on Chips, CRC Press. [41]. Sun-Mo(Steve) Kang, Elements of Low Power Design for Integrated Systems. ISLPED 2003 [42]. Srinivas Devadas, Sharad Malik, A Survey of Optimization Techniques Targeting Low Power VLSI Circuits. 32nd ACM/IEEE DAC 1995 [43] Keith Buchanan. The evolution of interconnect technology for silicon integrated circuitry. GaAs MANTECH 2002. [44] S. M. Sze, Kwok Kwok Ng. Physics of semiconductor devices. Wiley Interscience. pp 149-150

67

[45]. Microsoft Architecture Journal Vol. 18 Theme: Green Computing http://msdn.microsoft.com/en-us/architecture/bb410935.aspx [46]. Samuelson & Nordhaus, Microeconomics, 17th ed. page 110. McGraw Hill 2001 [47]. http://www.pcauthority.com.au/Feature/173700,pc-building-intels-turbo-boost-vs-amds-turbo-core.aspx PC Building: Intel's Turbo Boost vs AMD's Turbo Core. [48]. http://flac.sourceforge.net/comparison.html FLAC comparison report to other lossless codecs. [49]. http://www.spec.org/cpu/ Standard Performance Evaluation Corporation. [50]. http://news.cnet.com/8301-11128_3-20004378-54.html Can green tech operate under Moore's Law?.

transition inversion based low power coding for buffered...

Documents