transition inversion based low power coding for buffered...
TRANSCRIPT
i
Transition Inversion based Low Power Coding for Buffered Bus Systems
By
Abinesh R
200742006
A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF
Master of Science (by Research) in
VLSI & Embedded Systems
Centre for VLSI & Embedded Systems Technologies International Institute of Information Technology
Hyderabad, India May 2010
ii
Copyright © 2010 Abinesh R All Rights Reserved
iii
Dedicated to my parents.
iv
INTERNATIONAL INSTITUTE OF INFORMATION TECHNOLOGY Hyderabad, India
CERTIFICATE
It is certified that the work contained in this thesis, titled “Transition Inversion based
Low Power Coding for Buffered Bus Systems” by Abinesh R (200742006) submitted
in partial fulfilment for the award of the degree of Master of Science (by Research) in
VLSI & Embedded Systems, has been carried out under our supervision and it is not
submitted elsewhere for a degree.
__________ _____________ Date Advisor: Dr. Suresh Purini Asst. Professor IIIT, Hyderabad __________ _____________ Date Advisor:
Prof. Govindarajulu Professor IIIT, Hyderabad
v
Acknowledgements
I owe my deepest gratitude to my advisors, Dr. Suresh Purini and Professor
Govindarajulu whose encouragement, guidance and support enabled me to
accomplish this work.
I also thank Prof. M. Satyam for his feedback on various aspects of my work. I
also thank Prof. M.B.Srinivas for his feedback on the work. I am exceptionally
thankful to Bharghava for his valuable help and feedback on my work. I also
thank Mr. Deepak Tanna of Intel India for his comments on the practical aspects
of the work. I would like to thank all my friends and people in CVEST lab for the
terrific company during my study.
Finally, I want to thank my family for their unconditional love. Their constant
encouragement and their faith in me has always given me the strength to try to
achieve more and to be a better person.
vi
Abstract
The field of electronics has undergone tremendous changes in recent
times. The innovations in it have been happening in multiple perspectives, among
which the growth of portable computing devices has lead to a new wave of
miniaturization and low power development.
The rise of high bandwidth internal devices like GPUs (Graphic Processing
Units), Multicore processors etc has lead to the growth of high frequency busses
like PCI (Peripheral Component Interconnect), the Northbridge/Southbridge FSB
(Front side bus) etc. Also the increasing use of multimedia streaming and
external storage devices has lead to a massive increase in memory-intensive
applications. The need for a simpler protocol has made most of them to be block
data transfer systems with the final usage in caches, DMA (Direct Memory
Access), Video data transfer etc all tending to be block data transfer utilizing
some form of a buffer. I/O pads driving external busses dissipate a major portion
of this power as they drive large capacitances, and operate at a higher voltage.
This forms a good part of the overall power consumption of a system.
In this thesis, a novel technique, Transition Inversion, to reduce power
consumption of buffered busses is proposed. This work outlines a data coding
protocol by which these transitions can be reduced for block data transfer over
vii
buses such as DMA, cache lines etc. Block data transfers generally occur
through data buffers. The prior knowledge of the data to be transmitted, when it is
stored in the buffer, is exploited in serial fashion to reduce transitions on every
bus line.
The technique considers a buffer of data to be a collection of bitstreams
running in parallel over multiple lines. The transitions are counted in a bit serial
fashion and used to determine whether the transitions in any given bitstream
have to be inverted. This way the data running on any given line sequentially is
encoded such a way it will have a reduced number of transitions.
The technique is implemented in an optimized fashion using pipelining so
that it can be used in practical systems with only a slight compromise in
performance. This is achieved by calculating the decision as the data is being
loaded on to the buffer and doing the encoding on the fly. This is one aspect
which is lacking in most existing algorithms as they are not amenable to low
delay implementation. The critical parameter of delay, which limits bandwidth, is
reduced by 64% with the proposed technique and the pipelined implementation.
Also the proposed technique and implementation does not require any
extra bus lines to be used as is wont with most existing techniques. This does
away with the rise in PCB (Printed Circuit Board) fabrication costs.
viii
Theoretical analysis of transition inversion showed it to be reducing
transitions that is, independent of the increase in bus width. Most existing
techniques suffer with increasing bus widths, requiring mitigation factors that
themselves increase wiring requirements.
Also transition inversion scales slower in hardware complexities compared
to another comparable technique which makes it suitable for the increasing data
widths in modern busses.
The analyses showed that the technique is suitable for modern bus
architectures in terms of delay, space complexity and the extra overhead power
consumed. It was seen that the bandwidth need not be limited and still a good
amount of power saving with the overhead was obtained.
ix
Contents Contents .................................................................................................................. ix List of Tables .......................................................................................................... xi List of Figures ....................................................................................................... xii List of Relevant Publications ............................................................................... xiii Chapter 1 ..................................................................................................................1
Introduction................................................................................................................... 1 1.1 Low Power Systems ..........................................................................................................1
1.1.1 System Architecture Level ..................................................................................................4 1.1.2 Circuit Level..........................................................................................................................5
1.2 Computer Bus Systems......................................................................................................6 1.2.1 Practical Considerations ......................................................................................................9
1.3 Transition Inversion...........................................................................................................9 1.4 Contributions ...................................................................................................................10 1.5 Organization of Thesis.....................................................................................................11
Chapter 2 ................................................................................................................12 Related Work .............................................................................................................. 12
2.1 General Busses.................................................................................................................12 2.2 Special Busses ............................................................................................................................14
2.3 Bus Invert.........................................................................................................................16 Chapter 3 ................................................................................................................20
Transition Inversion ................................................................................................... 20 3.1 System Scenario...............................................................................................................20
3.1.1 Mutual Capacitance............................................................................................................21 3.2 Algorithm.........................................................................................................................22 3.3 Explanation ......................................................................................................................26
Chapter 4 ................................................................................................................29 System Analysis........................................................................................................... 29
4.1 Reduction Analysis ..........................................................................................................29 4.1.1 Statistical Analysis .............................................................................................................29 4.1.2 Brute Force Analysis..........................................................................................................31 4.1.3 Reduction Efficiency..........................................................................................................32
4.2 Performance Penalty Analysis .........................................................................................34 4.3 Tradeoff Analysis ............................................................................................................35 4.4 Power Analysis ................................................................................................................36 4.5 Error Detection Analysis .................................................................................................36
x
Chapter 5 ................................................................................................................40 Implementation and Analysis .................................................................................... 40
5.1 High Level Architecture ..................................................................................................40 5.1.1 Decision Circuit ..................................................................................................................42 5.1.2 Encoder Circuit ...................................................................................................................43 5.1.3 Decoder Circuit...................................................................................................................45
5.2 Complexity Analysis........................................................................................................45 5.2.1 Decision Circuit ..................................................................................................................45 5.2.2 Encoder/Decoder Circuit ...................................................................................................46
Chapter 6 ................................................................................................................48 Experimental Analysis................................................................................................ 48
6.1 Benchmark Simulation ....................................................................................................48 6.1.1 Random Image Data...........................................................................................................48 6.1.2 SPEC2000 Benchmark ......................................................................................................51
6.2 Overall Delay Analysis ....................................................................................................54 6.3 Overall Power Analysis ...................................................................................................55
Chapter 7 ................................................................................................................57 Conclusions.................................................................................................................. 57
7.1 Derivative Work ..............................................................................................................57 7.1.1 Serial Busses .......................................................................................................................57 7.1.2 Usage to Caches..................................................................................................................58 7.1.3 Usage to Clock Recovery ..................................................................................................60
7.2 Contributions ...................................................................................................................61 7.3 Comparison......................................................................................................................62 7.4 Conclusions......................................................................................................................62
Bibliography ...........................................................................................................64
xi
List of Tables Table 2.1 Bus Invert Decision Circuit .............................................................................. 18 Table 2.2 Bus Invert Decision Encoder ............................................................................ 18 Table 3.1 Block Arrangement........................................................................................... 20 Table 3.2 Sample Coding Process .................................................................................... 25 Table 4.1 Statistical Percentage Reduction....................................................................... 32 Table 4.2 Performance Metric, Ps variation with buffer depth (bitstream length) ........... 34 Table 4.3 Error Detection Analysis .................................................................................. 37 Table 6.1 A comparison of transition reduction for Bus invert and the Proposed Technique for Images ....................................................................................................... 49 Table 6.2 Comparison of encoder delay in Bus Invert and Transition Inversion ............. 54 Table 6.3 Power dissipation of extra circuitry for transition inversion ............................ 54
xii
List of Figures Figure 1.1 System Block Diagram...................................................................................... 2 Figure 1.2 Component Block Diagram............................................................................... 3 Figure 2.1 Location of Bus Invert in Bus Chain............................................................... 16 Figure 2.2 Bus Invert Block Diagram............................................................................... 17 Figure 3.1 Sample Bitstreams Waveform......................................................................... 27 Figure 4.1 Transition Reduction with Change in Buffer Depth …………………………33 Figure 4.2 Effects of Frequency scaling with Buffer Depth ….…………………………35 Figure 4.3 Proposed Technique Vs Parity Bit Technique ….……………………………38 Figure 5.1 High Level Architecture .................................................................................. 41 Figure 5.2 Decision Circuit (Transition Counter)............................................................. 43 Figure 5.3 Encoder Circuit................................................................................................ 44 Figure 5.4 Decoder Circuit ............................................................................................... 45 Figure 6.1 Comparison of Transition Inversion to Bus Invert for Images........................ 50 Figure 6.2 Comparison of Transition Inversion to Bus Invert for SPEC2000…………..52 Figure 6.3 Transition reduction in Gray Coding............................................................... 53 Figure 7.1 Serial Coding High Level Architecture .…………………………………......59 Figure 7.2 Cache Architecture for Transition Inversion ………………………………...59
xiii
List of Relevant Publications
• Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti, “Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, Selected for publication in special issue of Journal of Low Power Electronics, to appear in October 2010.
• Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti,
“Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, 23rd International Conference on VLSI Design, January 2010
• Joint Winner of Intel Research Challenge (also known as Intel Scholar
Program) 2008-2009 http://www.intel.com/cd/corporate/education/APAC/ENG/in/news/news43/419015.htm
• Abinesh R., Bharghava R., M.B. Srinivas, “Transition Inversion Based Low
Power Data Coding Scheme for Synchronous Serial Communication”, isvlsi, pp.103-108, 2009 IEEE Computer Society Annual Symposium on VLSI, 2009
• Bharghava R., Abinesh R., Suresh Purini, Govindarajulu Regeti, “Inexact
Decision Circuits: An Application to Hamming Weight Threshold Voting”, Selected for publication in special issue of Journal of Low Power Electronics, to appear in October 2010.
• Bharghava R., Abinesh R., Suresh Purini, Govindarajulu Regeti, “Inexact
Decision Circuits: An Application to Hamming Weight Threshold Voting”, 23rd International Conference on VLSI Design, January 2010
1
Chapter 1
Introduction
Green computing has become an important phenomenon in system design in
recent times. While high speed was considered a major thrust area for research a
few years back, designing highly efficient systems have become the norm of the
day [45][21][22][23]. Green computing or Green IT is being used as an umbrella
term for every type of system design, including that of mechanical, electronics,
software etc. In Electronics, the major focus in wriggling the last juice out of the
resources is being concentrated mostly on designing low power systems
[40][41][42]. The growth in this direction is also being driven in no less fashion by
the explosive growth in portable computing devices. These battery driven devices
have made power consumption an important parameter in system design and not
something that is optimized as an afterthought.
1.1 Low Power Systems
Low power design, in a system perspective, happens at all levels of the
digital electronic system stack. It is being done from the lowermost device level
design to the topmost software design. And there are the intermediate levels
where a lot of effort is being expended to make systems run at low power,
keeping the compromise in performance to be minimum. The increasing density
2
of the integrated circuits as postulated by Moore’s law [50] makes it even more
important to have low power systems since the power supply for such a densely
integrated circuit may not keep track in size with the miniaturization of the
electronic components. Hence research is being made at all levels of a system
stack. A system can consist of multiple components. They can be broadly
classified and a communication framework designed between them as shown in
Figure 1.1.
Figure 1.1 System Block Diagram
Each component in a system needs to communicate with each other through
some form of a communication bus mechanism which will be part of the
component itself. The control core can be a CPU (Central Processing Unit),
FPGA (Field Programmable Gate Array), microcontroller etc. The bus
mechanism itself is standardized into multiple busses each of which service
3
specific parts of the system. A block diagram of a component from a
communication perspective is shown in Figure 1.2. It splits a component into the
component core, a medium access layer (MAC) and a physical layer. The
component core is specific to the component, be it CPU, memory etc. The
system architecture of each component is the high level representation of that
component while the circuit design is the exact design of that component.
Figure 1.2 Component Block Diagram
The proposed work deals exclusively with the physical layer of such a bus
mechanism. The other layers of the stack, including the MAC, are left untouched
in the proposed technique. Both system architecture and circuit design of the
physical layer is dealt with in the proposed technique.
4
1.1.1 System Architecture Level
System architecture level work that had been done on processor design,
operating system and other higher levels of the stack have already borne fruit
with many innovations appearing in the market. The contribution of the recession
of 2007-2010 was also to be seen since it drove innovation to happen at making
systems more efficient so that cost could be kept less. One notable phenomenon
to become popular out of the recession induced changes was that of the Netbook
taking market share away from the Laptops [5]. These computing devices were
targeted at a lower price point than the then expensively priced Laptops. They
were built on highly power efficient processor architectures (Intel Atom, Via
Nano) and specially tuned versions of commercial operating systems (Windows
XP with a netbook setting, Linux with various features like that of tickless kernel
[6]). Like any tradeoff being done in design, these devices traded off performance
for high efficiency [7]. It resulted in these devices sporting longer battery life
compared to any laptop with a similar battery pack. Power efficiency came with
an added incentive of being low price [8]. Rather than blindly using MHz (Mega
Hertz) or MIPS (Million instructions per second) as a metric for qualification,
performance per watt (MHz/Watt or MIPS/Watt) has become the new metric for
comparison. This is exemplified in the new list of top 500 environmentally efficient
supercomputers that serve as an addon to the well know top 500 supercomputer
list [19][20].
5
1.1.2 Circuit Level
The innovations go into all levels of the stack. Innovations on the lower levels
occurred in device physics that have contributed to decreasing power. But this
may not be for long since a lot of limiting factors are coming into picture as the
transistor feature size is reduced. They range from lithographic inaccuracies to
process variations [9].
In another direction of research, power management protocols have been
developed incorporating various features that take into account circuit related
innovations. Some of the important features are voltage and clock scaling
[10][12]. In these techniques, the clock frequency and core voltage of the
processor are changed depending on the load. If the system is running less
number of processes, it will be scaled down to a lower voltage and clock
frequency. This is done in a tightly coupled fashion between the CPU,
motherboard and OS (Operating System). ACPI (Advanced Configuration and
Power Interface) is one such widely used protocol that supports multiple types of
states like processor, performance, global, device etc [11]. The power
management techniques are playing a wider role with the advent of multi-core
processors. Multi-core processors, having multiple mostly similar cores, can be
power managed based on loads by switching them on/off. Notable technologies
to move out of research into mainstream in recent times are the Intel’s
TurboBoost and AMD’s TurboCore [47].
Power management is also developing in the direction of having multiple
circuits for the same functionality with different power/performance parameters
6
and switching between them depending on the power saving to be done. One
good example is the Optimus technology from Nvidia which consists of a high
performance Nvidia GPU and a low power Intel IGP (Integrated graphics
processor). This technology seamlessly switches between the 2 chips depending
on the performance required [24][25].
One another example is the Lenovo Ideapad U1. This notebook contains 2
separate computer systems in one casing. The LCD display itself contains an
ARM processor based system that can be detached and used separately as a
tablet. When the display is attached to the notebook chassis, it becomes an Intel
processor based system. They run separate OSs (Operating Systems) with
separate memories [26][27][28]. Also as per the Law of diminishing returns, any
small improvement needed in the output generally requires a large change in
input once a high performance state is achieved [46]. Most existing low power
techniques worked with a premise of tuning various parameters to attain low
power rather than design for low power.
In the following body of work, a novel method of designing off-chip
communication buses is proposed and the results explained to show it’s efficacy
in reducing the power consumption.
1.2 Computer Bus Systems
A typical computer system consists of various components including the
control core (CPU/FPGA etc), Communication Buses and various peripheral I/O
7
(Input/Output) mechanisms. Busses constitute an important resource for
addressing and data transfer in implementation of most electronic systems. They
are used throughout the system from a basic addressing mechanism of the CPUs
to memory right upto the high bandwidth buses needed for graphic applications.
These offchip busses consume dynamic power, orgp which is given by.
αfCVp Tddorg2=
Where, Vdd, f, Ct, α represent drain voltage, Frequency of operation, line
capacitance, switching activity respectively. Switching activity is the changes in
values the data has within itself. The dynamic power is consumed whenever
there is a change of value on the line. Any transition, either from 10 → or
01→ will consume this dynamic power.
Recent advances in computing uses like that of graphics, scientific computing
have raised the requirements of data transfer to such high levels that bus
interfaces are being constantly racked upto higher performance points with
respect to bandwidth, usability etc. These applications are highly memory
intensive rather than being just CPU intensive. They need enormous amount of
data to be transferred for computation which has increased bandwidth
requirements of offchip busses. This in turn entails higher frequencies which in
turn lead to higher power consumption.
Reducing this off-chip bus power consumption has become one of the key
issues for low power system design. The fact that the power consumed in bus
8
accesses account for a significant fraction of the total power consumed in VLSI
(Very Large Scale Integrated Systems) systems has been independently
established by many researchers, [13][14][15]. This is because the self-
capacitance of busses is quite large in comparison to the capacitance of other
data-path units like that within a CPU. The capacitance tends to be higher for an
off-chip bus than on-chip interconnects since the traces are longer. Also the
busses are operated at higher voltage levels than a CPU. The reduction in
operating voltage achieved with a CPU could not be done in busses since these
are external devices and noise margins generally prevent any further reduction in
off-chip bus voltages.
There are essentially two ways to reduce power consumption in busses.
The first one involves minimizing bus accesses by either reducing the number of
data-path units connected to large busses [14] or reducing the number of
accesses of READ/WRITE busses for large memory units by algorithm
transformations [16]. The second way to reduce power consumed in busses is to
reduce bus transition activity. In this regard many researchers have studied
reduction of bus transition activity by resorting to coding, similar to error-
correcting codes, [17][18]. This approach has been effective, but the delay
caused by the encoder/decoder limits the maximum bandwidth the bus can
operate on. The extra circuitry causes a drop in performance thus rendering it
unsuitable for most bus systems. Moreover power consumed by the encoder and
the decoder has to be less than the power saved as a result of activity reduction
on the bus. These constraints, which are imposed on the encoder/decoder logic,
9
limit the space of possible encoding solutions. These constraints have prevented
any of these techniques from being used in any practical systems.
1.2.1 Practical Considerations
Most bus encoding techniques involve overhead in terms of space complexity,
delay and their own operational power. The delay of the circuit is the time the
circuit will take to encode the data. This limits the bandwidth of the system since
the circuit should be able to encode a date element before the next one arrives.
Also the overhead power incurred by the extra circuitry also reduces the
effectiveness of the technique. This is particularly true for onchip interconnects
where the interconnect voltage, frequency and frequency will be comparable to
that of the circuit itself. Offchip busses stand a better chance with such
techniques since the I/O voltage and capacitance will be much higher than that of
the internal circuitry. But still, most technique have overhead in terms of delays
that severely restrict their bandwidth. Due to these reasons, the bus systems that
are popularly used presently have no form of any low power bus coding.
1.3 Transition Inversion
Transition inversion is the proposed novel technique that deals with
reducing power consumption for buffered bus systems [1][2][3][4]. One aspect of
buffered bus systems is that they can literally see into the data that is going to be
put on the bus in future. This is done because a given block is transmitted only
10
after it is completely loaded onto the buffer. In the proposed technique,
transitions along a line are manipulated for reduction with less delay. The
sequential nature of the technique makes it suitable for a pipelined approach thus
avoiding bandwidth limitations.
The one limitation of the proposed technique is that it can be used only
with buffered busses but it is not much of a practical limitation since most modern
busses tend to be buffered in design. The proposed technique of transition
inversion mitigates most of the impractical points of existing comparable
techniques. Various analyses supporting the point are explained in the following
sections.
1.4 Contributions
The major contributions in this thesis are as follows:
1. A novel technique is proposed in this work for power reduction for
busses that makes use of the buffered nature of the modern busses.
2. Optimized circuit design has been done for the same by a pipelined
approach. Complexity analysis is done on the circuit design.
3. Analysis of the algorithm and the circuit has been done to test it’s
efficacy.
4. Comparison with other techniques has been done.
5. Applications have been proposed as usage scenarios.
11
1.5 Organization of Thesis A motivation for the work has been given in the preceding sections. The rest of
the thesis is organized as follows:
Chapter 2 reviews some of the current work done in this field.
Chapter 3 explains the proposed technique of Transition Inversion.
Chapter 4 presents an analysis of the technique from an algorithmic perspective.
Chapter 5 describes the pipelined circuit implementation of the technique and
analyses it.
Chapter 6 analyses the experimental results of the designed circuit by using
various benchmarks and comparing with other techniques.
Chapter 7 describes applications that have been derived out of the technique and
concludes the work.
12
Chapter 2
Related Work
Looking from a data characteristic perspective, existing work in reducing
transitions have been developed for both general data and data that has some
special nature. General data is the data that flows in a general bus that does not
have any pattern. Special data are those with specific pattern. Examples for
special data are address busses, audio/video data etc. Most research has gone
into special busses, more specifically address busses, since it is much simpler
and the hardware overhead is not too much when compared to general busses.
The simplicity itself is in terms of the delay incurred with special busses which
generally has been shown to be less compared to that of general busses.
2.1 General Busses
General data can contain anything that runs on a general purpose bus like
PCI (Peripheral Component Interconnect), Northbridge/Southbridge FSB etc.
They are used as general purpose busses to interconnect various components in
a system. These busses can carry application binary data, user content etc. They
can also contain graphic/audio but no special attention is given to them and is
treated as general purpose data itself.
13
Of general purpose busses, research has gone into techniques that
depend on the data that is known only at runtime. One of the most often cited
encoding methods is the Bus Invert method [13]. Bus-invert selects between the
original and the inverted pattern in a way that minimizes the switching activity on
the bus. The resulting patterns together with an extra bus line (to notify whether
the original data or its complement has been sent) are signalled over the bus.
The proposed technique, also being designed for a general bus, will be
compared with bus invert in all respects in the following body of work. Bus invert
is discussed in detail in the following sections.
Other encoding techniques include Gray Coding [38] which takes the
serial XOR between data words to reduce transitions. This is traced to the fact
that Gray code always has been used for reducing transitions. Frequent Value
encoding (FVE) is another technique [30] that results in a significant reduction in
transitions, but has not been considered here, due to the excessive run time
overhead involved. It involves maintaining a codebook of frequently used values
and encoding them using code words with less number of transitions. It needs
significant memory area for the codebook and the overhead circuitry itself limits
bandwidth.
The proposed technique is for general busses and is compared with bus
invert and Gray coding in the upcoming chapters.
14
2.2 Special Busses
Special data are those data that are recognized by the system itself as
serving some very specific purpose, typically without software intervention.
Examples can include the HDA (High Definition Audio) audio chipsets found in
most modern machines. These chipsets by themselves have no functionality to
encode/decode/convert audio. Other Codec (Coder Decoder) chips are
connected to them by a special purpose bus that is specifically designed for
audio transfer. This bus is also defined as a standard so that HDA chipsets and
Codec chipsets can be used interchangeably. Whatever data that is transferred
on this bus between the HDA chip and the Codec chip, it has audio data inherent
to it. This special nature of the bus has been used in research to design low
overhead compression techniques which use lead to lesser use of busses and
thus lower power. One example is that of FLAC (free lossless audio codec) which
pushes most of processing onto the encoder keeping the decoder simple. With
the decoding happening on-chip which operates at a lower voltage than offchip
busses, power is reduced if the bus is used lesser. Since most audio usage is in
terms of playback or decoding, this also leads to lesser processing and thus
lesser power [48].
Another example of such a special bus that is a bit more complex is the
texture memory transfer in GPUs. GPUs make use of special texture memory to
store textures which will get warped onto 3D (3 Dimensional) models to generate
realistic looking 3D graphics. The bus that interconnects a GPU to such a
memory is also specific in nature. Here also techniques have been proposed in
15
literature to encode data [29]. It involves coding the image data such it is
compressed and thus less data is put on the bus. This of course involves
computation to extract the data at the GPU, but the savings made in an offchip
bus tends to be greater than that of overhead of internal computation.
In the domain of address busses, Musoll et al. proposed the working zone
method [15]. Their method takes advantage of the fact that data accesses tend to
remain in a small set of working zones. This technique sends only the offset of
the location being addressed with respect to the previous addressed location
along with information about the current working zone. This too entails limitations
on bandwidth since a search of the set of working zones had to be performed
before an address can be sent. Another popular technique is Asymptotic Zero-
Transition Encoding [14] which operates under the fact that most addresses tend
to be consecutive. So the receiver device can predict the address itself and be
ready with data. Only exceptions to a previously agreed protocol need to be
transmitted by the sender.
One another technique to reduce transitions includes Gray coding for
addresses. Most of the existent techniques make use of the repeating patterns in
address buses to reduce address bus transitions [33][34][35][36]. Gray coding
works by taking a bit serial XOR between address words, which generally tend to
be consecutive, giving rise to a data that can be encoded with lesser number of
bits.
There is no existent literature on bus coding methodologies for block data
transfer, other than Serial Bus Invert [37], which encodes blocks of data rather
16
than individual data words. Serial Bus Invert is similar to bus invert, the difference
being that the decision bits are transmitted as a word at the end of the block data
transfer. The technique proposed in this work is compared with Bus Invert, Gray
coding [38][32].
2.3 Bus Invert
Bus invert works by counting the number of transitions, which involves
XORing of the present and previous data. If the number of transitions is more
than half the bus width, the inverted data is transmitted, else the original data is
transmitted. A separate line is also added to the bus which will carry the decision.
This is an overhead to the design of the system which requires extra circuitry as
well as traces. The decision bit will signify whether the data that is on the bus is
the original data or it’s complement.
Figure 2.1 Location of Bus Invert in Bus Chain
The bus invert logic has to be added to an existing bus interface just
where the external interface is happening. It can be just before the level
17
converter, which converts the chip internal voltage levels to external voltage
levels. The logic will take its input from what the bus core systems is feeding it
and processes it to feed to the level converter. A block diagram for the location of
the bus invert logic in a bus chain is shown in Figure 2.1.The bus invert algorithm
is explained below:
_______________________________________________________________ Algorithm 1: Bus Invert_____________________________________________
1. Count the transitions between the data on the bus and the next data
that is to be put on bus
2. if transitions count < half of the bus width
3. Assign next data to bus
4. else
5. Invert the next data and assign the complement to bus
________________________________________________________________
A block diagram of the Bus Invert system is shown in Figure 2.2. A
sample coding process for a sample data is shown in Table 2.1.
Figure 2.2 Bus Invert Block Diagram
18
Table 2.1 Bus Invert Decision Circuit Bit No. 1 2 3 4 5 6 7 8
Current Data on bus 1 0 1 0 1 0 1 1
Next Data to be put on Bus 0 1 1 1 0 1 0 1
XOR of present and next data (Transition
Vector)
1 1 0 1 1 1 1 0
In the given example the number of transitions is 6 which is more than half
the bus width, 4. So the data is inverted and then sent. The decision is sent on a
separate line. An XOR between the current data and the next data that is put on
the bus shows that the transitions are reduced to 2 which is also given by (N-t)
where N is the bus width and t is the original number of transitions. The encoding
process is shown in Table 2.2.
Table 2.2 Bus Invert Decision Encoder Bit No. 1 2 3 4 5 6 7 8
Next Data to be put on Bus 0 1 1 1 0 1 0 1
Next Data that is put on Bus 1 0 0 0 1 0 1 0
Current Data on bus 1 0 1 0 1 0 1 1
XOR of current and next data 0 0 1 0 0 0 0 1
The whole operation involves a chain of full adders to count the transitions
and then perform another XOR on the data that has to be sent. All these
operations have to be done before the next data is to be put on the bus. The
19
chain of fulladders operating the output of the array of XOR gates contribute to
the delay in taking a decision. This delay is the parameter which limits the
maximum bandwidth of the system. Beyond this the encoder delay also has to be
taken into account which involves a parallel XOR to perform controlled inversion.
This entire set of operations has to be over by the time the next data arrives
leading to a restriction on bandwidth.
Existing work in the field of low power bus coding has been discussed with
special attention paid to Bus Invert. As seen the major limiting factor for most of
these techniques is that of delay and the following limit on bandwidth.
To mitigate the issues, the technique of Transition Inversion is proposed and
discussed in the next chapter.
20
Chapter 3
Transition Inversion
This chapter details the proposed algorithm to be used in inverting the
transition states to achieve power reduction.
3.1 System Scenario
The algorithm is proposed specifically for offchip block data transfer
busses. In block data transfer, data is generally loaded onto a buffer and then put
on the bus. Each line in the bus is a serial line that will transmit one particular bit
position of all data words that are put on the bus. A typical block can be as shown
in Table 3.1. The buffer mostly will be able to hold a larger data than just one
block of data but transmission will still happen with a granularity of one block.
Table 3.1 Block Arrangement Buffer data Bit Pattern
Data 1 D1,8 D1,7 D1,6 D1,5 D1,4 D1,3 D1,2 D1,1
Data 2 D2,8 D2,7 D2,6 D2,5 D2,4 D2,3 D2,2 D2,1
Data 3 D3,8 D3,7 D3,6 D3,5 D3,4 D3,3 D3,2 D3,1
Data 4 D4,8 D4,7 D4,6 D4,5 D4,4 D4,3 D4,2 D4,1
Data 5 D5,8 D5,7 D5,6 D5,5 D5,4 D5,3 D5,2 D5,1
Data 6 D6,8 D6,7 D6,6 D6,5 D6,4 D6,3 D6,2 D6,1
Data 7 D7,8 D7,7 D7,6 D7,5 D7,4 D7,3 D7,2 D7,1
Data 8 D8,8 D8,7 D8,6 D8,5 D8,4 D8,3 D8,2 D8,1
21
The bits taken in the vertical direction form a bitstream that travels in a line.
Taking the bus into account, it is a collection of bitstreams running in parallel. The
columns in the table represent the lines of the bus. The rows represent each data
element of the buffer. When transmitting, the bits from each element travel in
parallel across the lines. In the perspective of a line, bits of all data elements of
one position are transmitted sequentially. This forms bitstreams on all lines
composed of corresponding bits of all data elements.
The algorithm is developed for offchip bus systems for a variety of reasons.
One reason as mentioned previously is that the saving that can be achieved out
of an offchip bus can be much higher than an onchip interconnect. Designing
around delay restrictions will be slightly made easier since the bus frequency
generally tends to be less than that of the CPU frequency. Also the voltage,
capacitance of an offchip bus tends to be higher than that of internal circuitry.
Thus the extra power consumed by the overhead circuitry will not lead to a
significant reduction in overall power saving as will be shown in the following
chapters.
3.1.1 Mutual Capacitance
One another factor that is important in bus design is that of mutual
capacitance that can lead to cross talk and its own power dissipation. Cross talk
is the interference of one line with the neighbouring lines. If there is considerable
22
mutual capacitance between lines, high frequency signals can easily leak out to
the neighbouring lines and corrupt the data.
With offchip busses, self capacitance plays a major role compared to that
of mutual capacitance. This is because mutual capacitance falls off exponentially
with increase in trace spacing. The trace spacing of an offchip bus is generally
much higher compared to that of an onchip bus. Thus effects of mutual
capacitance can be ignored for most calculations done on offchip busses [39].
3.2 Algorithm
Before transmission, the number of transitions on a line is counted. This is
just counting the transitions of the bitstream in that line. This can be done by a
simple XOR gate between consecutive bits and counting the number of ‘1’s. If the
number of transitions is more than half the number of data words, the transitions
states between the bits can be inverted. Each transition is made as a non-
transition and vice versa. If not, the bit stream is transmitted as such. In case
transition inversion is needed, the scheme operates by observing the transition
states between any 2 bits and setting the encoded second bit to be the same as
the previous encoded bit if there is a transition. If there is no transition, the
previous encoded bit is inverted. The decision bit signifying transition inversion is
transmitted before transmitting the encoded data. Also, the first bit of the
bitstream is transmitted as such. This has to be done on all lines. Since the data
23
is sent as a block, the extra bits on each line will signal for the respective bit
streams. The algorithms for encoder and decoder are discussed below.
_______________________________________________________________ Algorithm 2: Transition Inversion Encoder______________________________
1. Count the transitions between the bits of the bitstream of a line as it is
being loaded into buffer
2. if transitions count < half of the buffer depth
3. Assign the unmodified bitstream to the bus line
4. else
5. Transmit decision bit
6. Transmit first bit of bit stream
7. for the rest of the bits in bitstream
8. If present bit on line ≠ next bit
9. Assign present bit as next bit
10. else
11. Assign complement of present bit as next bit
12. end if
13. end for
14. end if
________________________________________________________________
_______________________________________________________________ Algorithm 3: Transition Inversion Decoder______________________________
1. if decision bit signifies inversion
2. Take the first bit as first decoded bit
24
3. for the rest of the bits in the incoming bitstream
4. if next received bit ≠ previous received bit
5. Assign previous decoded bit as next decoded bit
6. else
7. Assign complement of previous decoded bit as next decoded bit
8. end if
9. end for
10. else
11. Take the first bit as first decoded bit
12. for the rest of the bits in the incoming bitstream
13. Assign decoded bit as incoming bit
14. end for
________________________________________________________________
The transitions in the bit stream, transmitted on a line, can be reduced by
the aforementioned scheme. Each line is processed independent of each other. If
there is a need for transition inversion, then the following steps are followed to
obtain encoded data. Let the data bit that is to be transmitted next be bd and
previous data bit be bdp. The previous transmitted bit is btp. The next transmitted
bit will be
bt = btp if bd ≠ bdp
= !btp if bd = bdp
25
In receiver the reverse logic needs to be applied. When the bit stream has
been signaled as modified, then the following steps are followed to decode data.
The previous and current received bits are assumed to be ‘brp’ and ‘br’
respectively. The previous decoded bit is assumed to be bdp. The current
received bit will be
bd = bdp if brp ≠ br
= !bdp if brp = br
The encoding and decoding is done on the fly, to reduce performance
losses. For example in the bit stream 10101011, the number of transitions is 6.
This is more than half the maximum number of transitions which is 4. Thus this
stream is to be modified according to the algorithm described above. The first bit
is transmitted as such, without any change. This is described in Table 3.2. The
encoded data has only one transition and the process in explained in the next
section.
Key: NT – No Transition, T – Transition, NC-No Change
Table 3.2 Sample Coding Process Bit No. 1 2 3 4 5 6 7 8
Bit stream 1 0 1 0 1 0 1 1
Transition State NC T T T T T T NT
Encode state NC NT NT NT NT NT NT T
Encoded Bit stream 1 1 1 1 1 1 1 0
Decode state NC T T T T T T NT
Decoded bit stream 1 0 1 0 1 0 1 1
26
One very important observation to be made out of the proposed technique
is that it does not involve any additional bus lines unlike most other techniques
including Bus Invert. This is a major improvement since it does not add to the
complexity of board designs where every line/trace is extensively tested for noise
and shielding as well not increasing the board resources used.
3.3 Explanation
To elucidate further, a few samples of the above table can be taken and
seen. Comparing bits 1 and 2, it can be seen it has a transition from 1 to 0. So
the transition state of the encoded bitstream’s bits 1 and 2 should be a non-
transition. It will mean making the encoded second bit to be the same as the
encoded first bit. Since the encoded first bit, which is same as the original bit, is 1
the encoded second bit is also made as 1.
Taking another case of bits 7 and 8, there is no transition. Both bits are 1.
For such a case during encoding, it has to be made a transition for sake of
consistency. This will mean the encoded eight bit has to be the complement of
the encoded seventh bit. Since the encoded seventh bit is 1, the eight bit is
encoded as 0. The waveforms of the original, encoded and decoded bitstreams
are shown in Figure 3.1.
27
Figure 3.1 Sample Bitstreams Waveform
Putting all of the bits together, the modified bit stream is 11111110 with
the decision bit signifying 1 to indicate that a transition inversion has been done.
At the receiver, if any bus line is signalled as modified, when there is a
transition, it is made as a non-transition, and when there is no transition, it is
made as a transition. The same process of the encoder is repeated in decoder.
Taking encoded bits 1 and 2, it can be observed that there is no transition.
Both are 1. Since the decision bit indicates that transition inversion has taken
place, the transition state has to be complemented. So the decoded bits 1 and 2
should be such that there is a transition between them. This will require the
28
complement of the first bit to be taken as the second bit. This generates the
second bit and can be continued till the entire bitstream is decoded.
Doing all this, the decoded bit stream is 10101011 which is the same as
the bitstream that was started with. The number of transitions in the original
bitstream was 6 and after encoding it became 1. This is straightforward to see
since the transitions are only complemented.
In a generic case of N bits, there can be (N-1) transition states. So if the
number of transitions is more than (N-1)/2, then transition inversion has to be
done. If the number of transitions is To, the number of transitions in the encoded
data will be Te = (N-1)-To. The proposed technique has the propensity to reduce
transitions at a slight cost of bandwidth utilization which is due to the
transmission of the extra bit before the actual transmission starts. But the other
limiting factor of delay does not affect the proposed technique unlike other
existing techniques as will be shown in the subsequent chapters. As such, the
hardware for the proposed technique will be located at various parts of the bus
chain. A high level architecture for the proposed technique is discussed in
Chapter 5.
An analysis of the algorithm from a purely theoretical perspective is carried
out in the next chapter. It discusses the efficacy of the proposed technique in
reducing the transitions as well the performance trade off involved.
29
Chapter 4
System Analysis
This chapter details a theoretical analysis of the proposed algorithm with
respect to its reduction efficacy, performance trade-off as well as power.
4.1 Reduction Analysis
The theoretical analysis of the efficiency is determined in terms of the
average reduction that can be obtained. The analysis has been done in two
independent ways: A theoretical one and a simulation based one. Both the
results tally with each other and are discussed below
4.1.1 Statistical Analysis
This analysis was an analytical one. For an N length buffer system, where
the transitions are taken between the consecutive bits, maximum of (N-1)
transitions are possible in one bit stream. Taking the Binomial distribution into
account the number of possibilities of ‘i’ transitions = (N-1)Ci. Torg and Tmod are taken
to be the number of transitions in the original data patterns and the number of
transitions in the modified data patterns respectively. These entities can be
calculated as follows:
30
Probability of a bit stream with ‘i’ transitions, P(i)=
( )
N
N iC
2
1−
Total number of transitions Torg =( )( )∑
−
=
−1
0
*1N
iC iN
i
Average number of transitions of original data E(torg) = ∑=
N
iiiP
0*)(
=
( )( )N
N
iC iN
i
2
*11
0∑−
=
−
Transition inversion is done when the number of transition is more than or
equal to N/2. The number of transitions in the modified data will be (N-1-i) for ‘i’
transitions in the original data.
Probability of a modified bit stream with i transitions, P(i)=
( )
N
N iC
2
1−
Tmod=
( )( ) ( ) ( )( )∑∑−
=
−
=
−−−+−1
2
12
01*1*1
N
Ni
C
N
iC iNNiN
ii
31
Average number of transitions of modified data E(tmod) =
( )( ) ( ) ( )( )N
N
Ni
C
N
iC iNNiN
ii
2
1*1*11
2
12
0∑∑−
=
−
=
−−−+−
The reduction efficiency can be given by comparing E(org) and E(mod).
The reduction percentage is given by R= 100*)(
)()( mod
orgtEtEtE org −
The metric of average reduction can be calculated by taking the difference
between the number of transitions in the modified data patterns, and unmodified
data patterns. This statistical calculation of the algorithm has been carried out for
word lengths of 8, 16, 32, and 64.
4.1.2 Brute Force Analysis
This analysis considers all combinations of the word and determines the
original and modified transitions in the datasets. It is essentially a brute force
approach. The data considered was a uniform distribution of all possible data
patterns that is likely to be transmitted over the bus. For example, considering a
buffer depth of 8, the number of possible data patterns of one bit stream is 256.
The number of transitions in these data patterns was calculated along with the
32
number of transitions in the data pattern after being modified, using the proposed
algorithm.
This was evaluated by means of simulation wherein runs with multiple
word lengths (buffer depths) were carried out. The simulation ran through all
possible combinations for a given length and determined the decision for all the
combinations. It determined the encoded data, incase transition inversion being
done, and calculated the reduction in transitions in every case. The overall
reduction average was calculated by the total number of transitions in all the
combinations and the total number of transitions in all the encoded data.
4.1.3 Reduction Efficiency
The results obtained by both the methods agree with each other and are
shown in Table 4.1 and Figure 4.1.
Table 4.1 Statistical Percentage Reduction Word Length 8 Bits 16 Bits 32 bits 64 bits
% Reduction in transitions 31.25 20.95 13.54 9.78
It can be observed that the reduction in transitions itself reduces with
increasing buffer depth. This can also be explained by a simple combinatorial
example.
33
% Reduction
0
5
10
15
20
25
30
35
8 16 32 64
Buffer Depth
% R
educ
tion
in T
rans
ition
s
% Reduction
Figure 4.1 Transition Reduction with Change in Buffer Depth
For a data space of some given buffer depth, say N, there can be 2N
possible values. Of these values, there can be only two elements which have the
maximum number of transitions, whatever N may be. These two possibilities will
be just a sequence of alternating ‘1’s and ‘0’s, one of which will start with ‘1’ and
the other with ‘0’. They alone will give maximum transition reduction. Similarly,
the number of possibilities that will lead to a transition reduction of just 1 will be
maximum in quantity. This will be because the elements that can give a reduction
of 1 will be those that have a transition count of N/2. They will be maximum in
number since they occur in the middle of the binomial distribution. Because of
this, increasing buffer depth can lead to a reduction in transition reduction
efficiency.
This can be mitigated by using smaller buffer depths, which will entail
transmitting more decision bits. Overall, the effect of the decision bit also has to
be taken into account when choosing an appropriate depth.
34
The effect of the extra decision bit on the overall bandwidth utilization is
explained in the following section.
4.2 Performance Penalty Analysis
The transition inversion algorithm needs an extra bit to be transmitted
before the start of the block of data on all lines. This leads to a decrease in
bandwidth utilization. For a system with buffer depth of 8, 9 bits are transmitted
on one line. This leads to the requirement of a slightly longer time to transfer the
data or a slightly higher frequency to maintain the bandwidth. For the above case,
a frequency increase of 9/8 (or N
N )1( + to be general) will be needed to maintain
the same bandwidth utilization. Or an increase of time can be allowed provided
the performance tradeoff is accepted in the system design. But whatever
mitigation is taken into design, the corresponding power consumed by I/O pads
will increase linearly. A performance metric is defined to take into account the
scaling of the frequency and the reduction in transitions, and is calculated as their
product.
Performance Metric, Ps = (frequency/time scaling) * (original reduction
efficiency)
Table 4.2 Performance Metric, Ps variation with buffer depth (bitstream length) Buffer depth 8 Bits 16 Bits 32 bits 64 bits
Ps 27.78 19.71 13.12 9.63
35
The variation of this parameter with buffer depth is shown in Table 4.2
and Figure 4.2.
% Reduction
0
5
10
15
20
25
30
8 16 32 64
Buffer Depth
Scal
ed %
Red
uctio
n in
Tr
ansi
tions
% Reduction
Figure 4.2 Effects of Frequency scaling with Buffer Depth
Bus invert leads to a reduction in bandwidth since it poses a delay in
putting the encoded data. By following a depth based approach where most
delays are hidden, bandwidth need not be reduced.
4.3 Tradeoff Analysis The tradeoff in design happens with reduction efficiency Vs performance
penalty. If the reduction efficiency has to be high, the buffer size should be small
which leads to a higher performance penalty and the extra power consumed with
regarding to that. Having a higher buffer size may lead to lower performance
penalty but gives only a lesser reduction efficiency. This tradeoff has to be taken
into account whenever the technique has to be implemented practically.
36
4.4 Power Analysis
The overall power reduction consists of the power reduction achieved by
the transition reduction minus the power consumed by the extra circuitry.
Unmodified dynamic power consumed by I/O pads is given by
αfCVp Tddorg2=
Where, Vdd, f, CT, α represent drain voltage, Frequency of operation, line
capacitance, switching activity respectively. If the power dissipation of the extra
circuitry required for the coding process is taken into account then the equation
given above has two extra terms on the right hand side, the encoder and decoder
power dissipation respectively. For the algorithm to show any power reduction
the following relation has to be satisfied.
[ ] )(mod2
DETdd PPfCV +>−αα
Where PE is power dissipated by the encoder, PD is the power dissipated by the
decoder and αmod is the modified switching activity.
4.5 Error Detection Analysis
The proposed transition inversion algorithm’s propensity towards reducing
the number of transitions to less than half the word length can be used for
detecting some errors. A simplified discussion of it done below.
A preliminary way of doing this can be by determining if the number of
transitions in the received bitstream is more than half the bitstream length. A
37
counter is placed at the receiver that counts the number of transitions in the
incoming bitstream. If this value is more than half the bitstream length, the
incoming data is incorrect.
As a simple and preliminary analysis, the proposed technique is compared
with parity bit technique, since both have similar overhead i.e. addition of one bit
to the bitstream. The parity bit detects all odd bit errors, but misses even bit flips,
whereas, transition inversion can detect a certain percentage of any number of
bit errors.
Table 4.3 Error Detection Analysis % of errors detected No. of Bit errors
Parity Coding Proposed Technique
1 100 31.25
2 0 44.64
3 100 52.68
4 0 55.71
5 100 52.68
6 0 44.64
7 100 31.25
8 0 0
38
0
50
100
150
1 2 3 4 5 6 7 8
Number of Bit Errors
Perc
enta
ge D
etec
tion
Proposed Technique Parity Bit
Figure 4.3 Proposed Technique Vs Parity Bit Technique
Error analysis has been done by considering all combinations of the given
bitstream length that are transmitted over the bus. For transition inversion coding
on a buffer depth of 8, all the combinations of bit errors right from one bit error to
8 bit errors have been checked for both the proposed technique and parity bit
technique. The result of this analysis is shown in Table 4.3. The variation in error
detection percentage with increasing number of bit errors is shown in Figure 4.3.
If all the bits are in error, then neither technique can detect the error, as in
the proposed technique if all bits are flipped, the number of transitions remains
the same. Calculation of averages over the entire range of bit errors shows that
the proposed technique and parity bit technique both have the same value of
50.2%. The average is calculated as the ratio of total number of errors detected
to the total number of errors possible on the line. Since the proposed technique
cannot give a definite indication of an error by itself, it can be used as a hint to
upper layers of communication that an error has occurred. This is only proposed
39
as an added advantage that can be achieved with not much extra hardware at
receiver since decoding anyway will be done.
The proposed technique of transition inversion has been analyzed from an
algorithmic perspective showing its potential in solving the low power problem. In
the next chapter, an optimized implementation of it will be discussed and its
complexity analyzed.
40
Chapter 5
Implementation and Analysis
This chapter deals with the overall architecture of the system proposed
and the gate level design of the transition counter, encoder, and decoder circuits.
Theoretical complexity of the circuits is also analyzed and compared.
5.1 High Level Architecture
In most block systems, the data buffer is present just before the
transmission part. The core logic of the system places the data inside the buffer
from one side and the transmission happens from the other side. In such systems,
the transmission starts only after a complete block is fully formed at the buffer.
This block data transfer system by itself is a form of a pipelined system
that trades off latency to throughput. In any pipelined design, delay buffers are
used to split longer delays into shorter delays. There might be an initial delay to
get the first data out but a consistent throughput will be maintained. Thus
bandwidth will not be affected even though latency might increase. This pipeline
is what is being used in the proposed implementation to cut down on the delay
that forms the bottleneck of most bus coding techniques.
The proposed technique is implemented by a pipelining the entire coding
system into two separate blocks, namely:
a) Decision Circuit & Transition Counter
41
b) Encoder
The bit stream is encoded on the fly as the data is put on the bus, as
shown in Figure 5.1.
Figure 5.1 High Level Architecture
The transition counter is placed right at the entry of the data into the
buffer from the system core. This way, the data that is fed into the buffer will be
analyzed for the number of transitions just as the data is being loaded onto the
buffer. The decision of the transition inversion is made depending on the count of
the transitions and is stored in some extra space in the buffer. This counter
output itself can be the decision as will be elaborated in the following sections.
The bit stream is encoded if a transition inversion is needed. This is done
as the data is being put on the bus. This can be done in an on-the-fly manner
since the encoder need to only process the current and next bit. This can be
implemented with less delay as shown in the following sections.
In the receiver the decoder has to decode the incoming bit stream and
recover the original data.
42
5.1.1 Decision Circuit
The decision circuit is built by counting the transitions between
consecutive bits in the bitstream. A transition between two bits is found in a
simple manner by performing the equivalence operation of XOR (Exclusive OR)
between them. The proposed circuit using a simple XOR gate between
consecutive incoming bits of the bit stream is shown in Figure 5.2.
The D-FF (D-Flip Flop) is the one that holds and propagates the data on
each clock cycle. Actually, this D-FF will already be a part of any block data
loading system. That D-FF itself can be used for the proposed purpose.
The XOR between the output and input of the D-FF performs the function
of checking the equivalence between the consecutive bits. This gate will give an
output of ‘1’ when the two bits are not the same and ‘0’ when they are same. This
translates to a ‘1’ when there is a transition and a ‘0’ when there is no transition.
This transition state can be used to enable a counter to count the
transitions. This counter itself will be operating on the clock frequency of the data
stream. By using the transition state to be an enabler, the counter can count the
number of transitions. The counter needs to count only up to half the number of
maximum transitions, which will be (N-1)/2 for a N depth buffer. So this will need
log2((N-1)/2) flip flops. Again here, the delay is not much of a problem since it will
be hidden in the pipeline.
43
This circuitry can also be implemented with double edge triggered circuits
to further reduce power dissipation at the encoder stage. Also, the transition
counter works in parallel to buffer loading, and is thus masked.
Figure 5.2 Decision Circuit (Transition Counter)
5.1.2 Encoder Circuit
The encoder operates on the fly depending on the transition inversion
decision as shown in Figure 5.3. The encoder is operated along with the usual
data transmission part. The encoder itself is implemented in a pipelined fashion,
using 2 D-FFs, so that the effect of delay on bandwidth is reduced. This is made
possible by the sequential nature of the block data and the proposed technique.
This pipeline only introduces a latency, but the throughput is maintained since the
processing of the bits happen in parallel as the transmission is happening in a
pipelined fashion.
This encoder needs to operate only for those cases where a transition
inversion is needed. So this entire circuit can be power gated and be made
operational only when needed as discussed in the prior chapters.
44
Figure 5.3 Encoder Circuit
The D-FF on the incoming bitstream calculates the transition state just as
the decision circuit did during the loading of the block. Once the transition state is
known, it is inverted to generate an inverted state if the decision was to invert the
transition. This inverted transition state is used to manipulate the next bit in such
a way that the next bit will be in the inverted transition state in correspondence to
the current bit.
To do this, the other D-FF takes its input as the current bit that is put on
the bus. This next bit is manipulated in such a fashion to be the same as current
bit if the transition state for that transition was ‘1’. When transition state is ‘1’,
inverted transition state is ‘0’ which means, the next bit to be the same as the
current bit. This is because of the underlying idea of transition inversion. This can
be easily achieved with an XOR gate which acts as a controlled inverter. It
passes one of the inputs as the output if the other input, taken as a control input,
is ‘0’. If the control input is ‘1’ the other input is inverted to generate the output bit.
This principle is used here to generate the next bit from the inverted transition
state.
45
5.1.3 Decoder Circuit
The decoder is essentially the same as the encoder circuit. The decoder
performs XOR between consecutive bits to determine transition state and uses it
to perform a controlled inversion on the received bit if required to recover the
data. The decoder is shown in Figure 5.4.
Figure 5.4 Decoder Circuit
5.2 Complexity Analysis
The time and space complexities have been analyzed comparing them
with bus invert method. The main components of both systems are the decision
circuit, encoder and decoder.
5.2.1 Decision Circuit
With increase in bus width, the space complexity increases linearly (O(N))
for the bus invert’s transition determiner. This is because, it involves XOR gates
on all the bits. The transition determiner of the transition inversion technique is
46
just the one XOR gate whatever the buffer depth, a comparable parameter to bus
width, may be. So it scales O(1) for all bus depths.
Time complexity of both of them is constant O(1). But the time complexity
of transition inversion will not affect the system due to pipelining. So the delay
due to that is hidden.
The transition counter of the bus invert needs to have the circuitry to count
the parallel transition vector to the maximum value which leads to a scale up of
O(log2 N) for both space and time complexities. Transition inversion needs to
count sequentially only upto half the maximum number of transition, (N-1). This
scales better than bus invert in space and time complexity coming upto
O(log2((N-1)/2)).
But the time complexity of the transition inversion’s counter will not affect
the delay due to the pipelined operation of the proposed technique. This is a
major improvement over bus invert as well as most other techniques since the
chief limiting factor of delay can be reduced.
5.2.2 Encoder/Decoder Circuit
With increase in bus width, the bus invert encoder/decoder circuit
complexity increases linearly (O(N)) since it needs that many XOR gates. For the
proposed technique, the circuit complexity is constant (O(1)) since it needs only
that one set of D-FFs and XOR gates to achieve the same result for any buffer
47
depth. Time complexity is constant (O(1)) in both bus invert and the proposed
technique.
An optimized implementation of the proposed technique has been
discussed. It makes use of pipelining to mask the effect of the extra circuitry on
delay. Its ability to reduce transitions was also discussed. The next chapter will
deal with a comparison of the technique to Bus Invert over benchmarks.
48
Chapter 6
Experimental Analysis
This chapter deals with comparisons of the simulation analysis of the
proposed technique and bus invert.
6.1 Benchmark Simulation
For experimental analysis, the algorithm was applied on random image
data and SPEC2000 benchmark binaries. The SPEC2000 benchmark is used for
the purpose of simulating data that is executable in nature. Memory traces of
such benchmark binaries are generally used in most other comparisons. The
image data is to show data that is not executable in nature and which will not
involve any memory access beyond the size of the image.
6.1.1 Random Image Data
Two analyses were performed. The first involved a limited analysis which
took a limited set of configurations. This was done to show the individual results.
A second analysis which is more comprehensive is also discussed.
For first preliminary analysis of the algorithm seven images were taken
and their RGB values were ran through the proposed algorithm and bus invert.
The images were a mix of both smooth and detailed features. The mix is taken so
that there will be varying levels of variance within the images. This run simulates
49
a transfer of image data on a bus. The run was performed assuming the buffer
depth to be 8 and bus width to be 8. The individual results are tabulated in Table
6.1 for a buffer depth of 8 and bus width of 8. These do not include the power
dissipated by the encoder and decoder circuitry. In a system, the image data is
transferred though the type of busses that the proposed technique targets. They
will be mostly offchip, block data transfer systems.
Table 6.1 A comparison of transition reduction for Bus invert and the Proposed Technique for Images
Bus Invert Coding Proposed technique Image
#
Original no. of
transitions transitions %
reduction
transitions %
reduction
1 120160 86296 28.18 72212 39.9
2 127770 94776 25.82 85454 33.11
3 74666 61746 17.3 53502 28.34
4 165678 119578 27.83 119908 27.62
5 111909 81645 27.04 70978 36.58
6 66189 49251 25.59 46769 29.34
7 159620 121466 23.9 114163 28.48
It can be seen from the table that some individual images produce much better
transition reduction than others thus self validating the wide nature of the data
taken. It is also clear from the above table that transition inversion performs much
50
better than bus invert. It is able to reduce more compared to bus invert when
image data is taken as input.
Also a more comprehensive analysis was done with a set of another ten
images that also consisted of smooth and detailed features. The analysis was
done for various configurations of bus width and buffer depths. Buffer depth was
varied with values of 8, 16, 32 and 64. Bus widths were varied with values of 8,
16, 32, and 64 bits. Due to the voluminous data involved, the reductions for all
the images have been averaged to show the results. The results of the analysis
are shown in Figure 6.1. Each data point in the plot is the average of the
reductions over all the images for that configuration.
Transition Reduction Efficiency
0
5
10
15
20
25
30
35
8 16 32 64
Bus Width
% R
educ
tion
Transition Inversion BufferDepth=8Transition Inversion BufferDepth=16Transition Inversion BufferDepth=32Transition Inversion BufferDepth=64Bus Width BufferDepth=8,16,32,64
Figure 6.1: Comparison of Transition Inversion to Bus Invert for images
51
The plots for bus invert for varying buffer depths have merged and hence
are shown as one single plot. The pattern that can be seen in the plots reiterates
what has been discussed before. The results for bus invert clearly shows that it’s
efficiency takes a hit when bus width is increased while not affected by changes
in buffer depth.
For transition inversion, the converse happens. The efficiency is
independent of bus widths while reducing with increasing buffer depths. This
makes it suitable for the wider busses that are becoming the norm these days.
6.1.2 SPEC2000 Benchmark
SPEC2000 benchmark binaries [49] were run with the proposed technique
and compared with bus invert and gray coding. These benchmark binaries are
typically used to simulate bus activity as shown in other literature discussed
before. Memory traces of the binaries are taken and they are run through the
simulated bus system for any type of analysis. They model the activity of any bus.
Memory traces are those information of how the binary accesses the
memory locations which can include both its own binary data as well as just data.
These traces are useful to model a bus system.
The 26 binary traces were run with varying the buffer depth across
8,16,32,64 and bus widths with the values 8,16,32,64 bits. All configurations were
taken varying both buffer depth and bus width.
The averages for a given combination of bus width and buffer depth
across all the binaries were taken and have been plotted. The results for both the
52
techniques are showed in Figure 6.2. Each data point represents the average
reduction over all the binaries for that given buffer depth and bus width.
Transition Reduction Efficiency
0.00
5.00
10.00
15.00
20.00
25.00
30.00
35.00
8 16 32 64
Bus Width
% R
educ
tion
Transition Inversion BufferDepth=8Transition Inversion BufferDepth=16Transition Inversion BufferDepth=32Transition Inversion BufferDepth=64Bus Invert BufferDepths=8,16,32,64
Figure 6.2: Comparison of Transition Inversion to Bus Invert for SPEC2000
It can be observed that the buffer depth does not make any changes to
bus invert. Also with increase in buffer depth, the transition reduction reduces for
the proposed technique. A similar observation can be made for bus invert when
bus width is increased.
Most notable, the change in bus width for transition inversion has affected
the reduction efficiency slightly. Even though it can said that it should remain
constant with respect to bus widths, when data is concatenated/split to attain
different bus widths it can cause some unexpected consequences. This can be
mitigated by carefully designing the system taking into account the typical data
widths and the instruction set of the processor which will be used on that bus.
53
With increasing bus widths in present VLSI systems, the proposed
technique will perform better. This is the main differentiating factor from existing
techniques which scale in a depreciative manner. Also delay is reduced to a large
extent as will be discussed in the following sections. The increase in buffer depth
leads to a lesser reduction. It can be offset by splitting the block into sub-blocks
of smaller depths. Depending on the system, a compromise between power
reduction and bandwidth utilization can be found.
-50
-40
-30
-20
-10
0
10
20
8 16 32 64
Bus Width
Tran
sitio
n re
duct
ion
Perc
enta
ge
Buffer Depth 8Buffer Depth 16Buffer Depth 32Buffer Depth 64
Figure 6.3: Transition reduction in Gray Coding
The benchmark files were also run using Gray Coding technique. Gray
coding was also compared against since it is another popular technique but has
the disadvantage of being an N-bit input to N-bit output conversion. The results
are shown in Figure 6.3. It can be seen that there is not much reduction in
transitions, with some of the data points showing an increase in the number of
transitions. Gray code does not show much reduction in transition since it is an
N-bit to N-bit conversion. Whatever data combinations are there in the input set
54
are exactly the same in the output set with a one-one mapping. So it is not
possible to get much benefit out of Gray code.
6.2 Overall Delay Analysis
The proposed system and bus invert were designed in Verilog RTL
(Register Transfer Level) and analyzed with Synopsys synthesis tools. A bus
operating at 100MHz was assumed with its I/O voltage levels at 3.3v. The
internal circuitry was modeled on 180nm process technology from TSMC (Taiwan
Semiconductor Manufacturing Corporation). The circuitry was simulated by
feeding the SPEC2000 benchmark trace files as input for a buffer depth of 8.
The proposed technique does not involve circuitry of multiple stages thus
leading to less delay. The delay performance of the proposed technique and bus
invert in terms of propagation speed is simulated and compared in Table 6.2.
Table 6.2 Comparison of encoder delay in Bus Invert and Transition Inversion
Technique Proposed
Technique
Bus Invert
Delay 1.2ns 3.3ns
The table shows that the delay of Transition Inversion is much less than
that of Bus Invert. The delay of Transition Inversion is only 64% of that Bus Invert.
55
For calculating the delay due to the proposed technique only the encoder is
considered. The decision circuit is not taken as it will be part of the buffer loading
delay and will not contribute to encoding. The decision circuit delay was found to
be 0.2ns since it involves only the XOR gate. The decision circuit flip flops delays
will be masked by the sequential loading thus giving a pipelined approach.
For the bus invert technique, the decision circuit delay is also taken into
account because encoding has to be complete before the next data word arrives.
Thus before the next data arrives the counting of the hamming distance should
have been done and the data encoded.
Overall the encoding delay of the proposed technique is considerably
lesser compared to that of the bus invert.
The delay is also constant for increasing buffer depths since the encoder
does not vary with either buffer depth or bus width. Bus invert suffers in this
scalability issue since the delay increases when bus width is increased making it
unsuitable for wide busses. This makes transition inversion more suitable for the
high frequency, wider busses that are the norm today.
6.3 Overall Power Analysis
The power consumed by the circuitry measure by simulation is shown in
Table 6.3. Assuming the parameters stated above, the power was found to be
44.89mW.
56
Table 6.3 Power dissipation of encoding/decoding circuitry for transition inversion Circuitry Decision
circuit
Encoder Decoder
Power
consumed
28.7µW 28.9µW 28.6µW
The reduction in power consumption is linearly dependent on the activity
factor reduction. A reduction of 18.2% for buffer depth 8 activity leads to a
reduction of power by 8.17mW. The total power consumed by the extra circuitry
is 86.2 µW leading to a net power reduction of 8.08mW which corresponds to
17.99% reduction in power.
57
Chapter 7
Conclusions
7.1 Derivative Work
Derivative works from the proposed technique have been proposed for the
purpose of extending it to synchronous serial busses, designing a high level
design for processor cache and clock recovery in asynchronous bus systems.
This elucidates that the proposed technique can be used not only for reducing
power but also serves other purposes and lays the base for future work.
7.1.1 Serial Busses
The proposed technique can also be applied to a synchronous serial bus.
A serial bus by itself has the data flowing in a sequential fashion thus enabling
the application of transition inversion. Serial busses typically move a data
element bit by bit whatever the word length might be. This gives an opportunity to
use transition inversion to reduce power.
This too involves calculating the transition count, making a decision and
encoding. A block diagram of a pipelined approach is shown in Figure 7.1 which
shows two approaches to counting the transitions. One is a parallel way which
suffers from delay issues. The serial counting approach adds a buffer to load the
58
serial data and count the transitions. This increases latency but maintains the
throughput.
Figure 7.1 Serial Coding High Level Architecture
7.1.2 Usage to Caches
The technique can be easily applied to processor caches since caches
make use of block data transfer. The data generally gets transferred in terms of
cache lines. As the cache (L1) does not use a buffer, encoding data on the fly is
not possible, without drastic reduction in performance. Also the absence of a
buffer in the cache means that data when modified will invalidate the inversion
decision bits determined by the primary memory for the given block, when it was
loaded onto the cache.
The above drawbacks can be removed taking into consideration that the
processor core modifies the data only in the L1 cache and not at higher cache
59
levels. Thus we can make use of the L2 cache as a buffer to perform encoding.
Thus the process will be modified as such.
• The memory consists of both the encoder and decoder circuits.
• The L1 cache has only the decoder circuit.
• The L2 cache has the encoder circuit for data coming from the L1
cache.
In case of an on-chip L2 cache, only the encoded data is sent on the bus.
Raw data is never sent on the bus. In case of an off-chip L2 cache, the data from
the L1 cache to the L2 cache will be raw, and no power saving modification is
done to this data. Power saving is still achieved as the modified data is sent in
the other 3 of the 4 possible transmission paths. A high level block diagram of an
architecture is shown in Figure 7.2.
Figure 7.2 Cache Architecture for Transition Inversion
60
7.1.3 Usage to Clock Recovery
This technique can also be applied to asynchronous serial buses.
Generally in asynchronous bus systems, the clock signal is not sent separately.
The transmitter and receiver operate at the same clock frequency and only phase
is synchronized by clock recovery mechanisms. This needs the data to have
more number of transitions. The receiver, by looking at the time of any transition
will adjust it’s clock phase to match with the transmission. In these busses,
generally a preamble might also be sent that will contain only a bitstream of
alternating highs and lows. These are used in busses that move data outside a
system, typically longer range bus systems.
The technique of transition inversion can be applied to clock recovery by
simply inverting the decision condition. Clock recovery is applied to
asynchronous communication wherein the clock phase is recovered from the
data stream itself. So here the data stream needs to have more number of
transitions. There are existing techniques like 8b/10b which are predominantly
used in such communication systems for eg: - Ethernet. In this technique 8 bits of
data is encoded into 10 bits of data such that there are more number of
transitions in the resultant bitstream. This works by selecting a subset of the total
combination space that will have more number of transitions in average. Here
only 256 (2^8) vectors are used out of a total of 1024 (2^10) vectors. The 10 bit
vector selected to represent any given 8 bit vector will have more number of
transitions in average. 4b/5b, 8b/10b are examples of this type of clock recovery
61
mechanism. They operate by means of a look up table generally with not much of
runtime logic involved. 8b/10b operates by splitting the 8 bits into groups of 5 bits
and 3 bits and encoding them into 6 bits and 4 bits respectively. The 5b/6b and
3b/4b happens with a look up table which occupies a lot of space. Though it
reduces delay, it incurs a huge overhead on space for all those entries.
The technique of transition inversion can be easily applied to increase the
number of transitions just by inverting transition states when the number of
transition is less than half the bit stream length. This is the inverse of the
technique used for low power.
This needs only one bit of overhead compared to two bits of 8b/10b. The
transition inversion technique also needs only a simple circuit thus taking up less
space compared to 8b/10b.
7.2 Contributions
The major contributions in this thesis are as follows:
1. A novel technique is proposed in this work for power reduction for
busses that makes use of the buffered nature of the modern busses.
2. Optimized circuit design has been done for the same by a pipelined
approach. Complexity analysis is done on the circuit design.
3. Analysis of the algorithm and the circuit has been done to test it’s
efficacy.
4. Comparison with other techniques has been done.
62
5. Applications have been proposed as usage scenarios.
7.3 Comparison
Transition Inversion can be compared to bus invert in various ways as
shown in the work. Transition inversion’s efficiency reduces with increase in
buffer depth but remains independent of bus width. Bus invert’s efficiency
reduces with increase in bus width but remains independent of buffer depth. The
increasing bus widths of the modern busses make bus invert not a practical
choice.
One limitation of the proposed technique is that it is applicable only to
buffered busses while bus invert is applicable to almost every type of bus. But
this does not place much limitation on the practical utility since most busses tend
to be buffered in nature.
7.4 Conclusions
In this work an encoding technique has been presented that reduces
power dissipated on off-chip data buses for block data transfer. The technique
involves inverting the transition states on every line of the bus if the transitions
exceed the number of non-transitions. The inversion reduces the number of
transition states which signal a transition.
63
The modification status is signaled as an extra word, thus avoiding the use
of an extra line. An optimized circuit was designed which makes use of pipelining
thus reducing the effect of the extra circuitry on delay. This pipelining is made
possible by the sequential nature of block data transfer as well as the proposed
technique.
The important parameter of delay which limits the bandwidth is
significantly reduced to 64% of that of bus invert thus making transition inversion
more suitable for practical applications. Also the encoder circuit is constant thus
removing issues one faces when scaling up.
The average reduction obtained in terms of transitions is 18.2% for buffer
depth while the net power reduction after the extra power circuitry is taken into
account is 17.99%. This is achieved without using an extra bus line thus saving
on design space. The compromise is in bandwidth utilization which can be
adjusted by choosing a proper block length.
64
Bibliography [1]. Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti, “Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, Accepted for publication in special issue of Journal of Low Power Electronics, October 2010. [2] Abinesh R., Bharghava R., Suresh Purini, Govindarajulu Regeti, “Transition Inversion based Low Power data coding scheme for Buffered Data Transfer”, 23rd International Conference on VLSI Design, January 2010 [3] Joint Winner of Intel Research Challenge (also known as Intel Scholar Program) 2008-2009 http://www.intel.com/cd/corporate/education/APAC/ENG/in/news/news43/419015.htm [4] Abinesh R., Bharghava R., M.B. Srinivas, “Transition Inversion Based Low Power Data Coding Scheme for Synchronous Serial Communication”, isvlsi, pp.103-108, 2009 IEEE Computer Society Annual Symposium on VLSI, 2009 [5]. The New York Times Technology section April 1, 2008. Light and Cheap, Netbooks Are Poised to Reshape PC Industry. [6]. http://lesswatts.org/ Intel sponsored community project for software based low power development. [7]. http://www.intel.com/consumer/products/style/netbook.htm Netbook vs. Laptop and Entry Level Desktops. [8]. Netbook design considerations by Texas Instruments. http://focus.ti.com/docs/solution/folders/print/581.html [9]. Nano-cmos scaling problems and implications. Nano-CMOS Circuit and Physical Design, Ban P. Wong, Anurag Mittal, Yu Cao, and Greg Starr, John Wiley & Sons Inc. [10]. J. M. Rabaey. Digital Integrated Circuits. Prentice Hall, 1996. [11]. http://en.wikipedia.org/wiki/Advanced_Configuration_and_Power_Interface Open industry standard for power management [12]. Dhiman, G. and Rosing, T. S. 2007. Dynamic voltage frequency scaling for multi-tasking systems using online learning. In Proceedings of the 2007 international Symposium on Low Power Electronics and Design (Portland, OR, USA, August 27 - 29, 2007). ISLPED '07. ACM, New York, NY, 207-212. [13]. M. R. Stan, W. P. Burleson. Bus-Invert Coding for Low Power I/O, IEEE Transactions on Very Large Integration Systems, Vol. 3, No. 1, pp. 49-58, March 1995. [14]. L. Benini, G. De Micheli, E. Macii, D. Sciuto, C. Silvano. Asymptotic Zero-Transition Activity Encoding for Address Buses in Low-Power Microprocessor-Based Systems, IEEE 7th Great Lakes Symposium on VLSI, Urbana, IL, pp. 77-82, Mar. 1997.
65
[15]. E. Musoll, T. Lang, and J. Cortadella. Working-Zone Encoding for reducing the energy in microprocessor address buses. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 6, no. 4, Dec 1998 [16]. W. Fornaciari, M. Polentarutti, D.Sciuto, and C. Silvano, “Power Optimization of System-Level Address Buses Based on Software Profiling,” CODES, pp. 29-33, 2000. [17] P. Panda, N. Dutt, “ Reducing Address Bus Transitions for Low Power Memory Mapping”, European Design and Test Conference, pp. 63-67, March 1996. [18] E. Musoll, T. Lang, and J. Cortadella, “Exploiting the locality of memory references to reduce the address bus energy”, Proceedings of International Symposium on Low Power Electronics and Design, pp. 202-207, Monterey CA, August 1997. [19] http://www.green500.org/lists.php Green 500 list [20] Sushant Sharma, Chung-Hsing Hsu, and Wu-chun Feng, “Making a case for a green 500 list”, 2nd IEEE IPDPS Workshop on High-Performance, Power-Aware Computing, April 2006 [21]. Martin, T. L., Siewiorek, D. P., Smailagic, A., Bosworth, M., Ettus, M., and Warren, J. 2003. A case study of a system-level approach to power-aware computing. ACM Trans. Embed. Comput. Syst. 2, 3 (Aug. 2003), 255-276. [22]. Mircea R. Stan, Kevin Skadron, "Guest Editors' Introduction: Power-Aware Computing," IEEE Computer, vol. 36, no. 12, pp. 35-38, Dec. 2003, doi:10.1109/MC.2003.1250876 [23]. Khargharia, B., Hariri, S., and Yousif, M. S. 2008. Autonomic power and performance management for computing systems. Cluster Computing Vol.11, No.2 (Jun. 2008), 167-181. [24] http://www.nvidia.com/object/optimus_technology.html Nvidia Optimus Technology Homepage. [25] http://hothardware.com/Articles/NVIDIA-Optimus-Mobile-Technology-Preview/ Preview and explanation of Nvidia Optimus. [26] http://news.lenovo.com/article_display.cfm?article_id=1301 One PC, Two Devices: Lenovo Reveals the Industry’s First Hybrid Notebook. [27] http://ces.cnet.com/8301-31045_1-10427615-269.html Hands on with Lenovo's CES showstoppers: U1 Hybrid, Skylight, and S10-3t up close. [28] http://www.engadget.com/2010/01/05/lenovo-ideapad-u1-hybrid-hands-on-and-impressions/ Lenovo IdeaPad U1 Hybrid hands-on and impressions. [29]. Tomas Akenine-Moller, Jacob Strom, “Graphics Processing Units for Handhelds”, Proceedings of the IEEE Vol. 96, No. 5, pp. 779-789 May 2008. Invited Paper
66
[30] Jun Yang, Rajiv Gupta, Chuanjun Zhang. Frequent value encoding for low power data buses. ACM Trans. Design Autom. Electr. Syst. 9(3): 354-384 (2004) [31] C. Su, C. Tsui, and A. Despain. Saving power in the control path of embedded processors, IEEE Design and Test of computers, 11(4):24–30, 1994 [32] Wei-Chung Cheng, Massoud Pedram. Memory Bus Encoding for Low Power: A Tutorial. ISQED 2001: 199-204 [33] Giuseppe Ascia, Vincenzo Catania, Maurizio Palesi, "A Genetic Bus Encoding Technique for Power Optimization of Embedded Systems", J.J. Chico and E. Macii (Eds.) PATMOS 2003, LNCS 2799, pp. 21–30, 2003. Springer-Verlag Berlin Heidelberg 2003 [34] C. Su, C. Tsui, and A. Despain, “Saving power in the control path of embedded processors”, IEEE Design and Test of computers, 11(4):24–30, 1994. [35] L. Benini, G. D. Micheli, E. Macii, D. Sciuto, and C. Silvano, “Address bus encoding techniques for system-level power optimization”, In IEEE Design Automation and Test Conference in Europe, pages 861–866, Paris, France, Feb. 1998. [36] L. Benini, G. D. Micheli, E. Macii, M. Poncino, and S. Quer. Power optimization of core-based systems by address bus encoding, IEEE Transactions on Very Large Scale Integration, 6(4), Dec. 1998. [37] Saneei M, Afzali-Kusha A, Navabi Z. Serial Bus Encoding For Low Power Applications, International Symposium on System-On-Chip, pp. 1-4, November 2006. [38] Kangmin Lee, Se-Joong Lee, Hoi-Jun Yoo. SILENT: serialized low energy transmission coding for on-chip interconnection networks. ICCAD 2004: 448-451 [39] Stephen H. Hall, Garrett W Hall, James A McCall. High Speed Digital System Design- A Handbook of Interconnect Theory and Design Practices. Wiley Interscience. pp 57-61 [40] Edited by Christian Piguet, Low-Power Processors and Systems on Chips, CRC Press. [41]. Sun-Mo(Steve) Kang, Elements of Low Power Design for Integrated Systems. ISLPED 2003 [42]. Srinivas Devadas, Sharad Malik, A Survey of Optimization Techniques Targeting Low Power VLSI Circuits. 32nd ACM/IEEE DAC 1995 [43] Keith Buchanan. The evolution of interconnect technology for silicon integrated circuitry. GaAs MANTECH 2002. [44] S. M. Sze, Kwok Kwok Ng. Physics of semiconductor devices. Wiley Interscience. pp 149-150
67
[45]. Microsoft Architecture Journal Vol. 18 Theme: Green Computing http://msdn.microsoft.com/en-us/architecture/bb410935.aspx [46]. Samuelson & Nordhaus, Microeconomics, 17th ed. page 110. McGraw Hill 2001 [47]. http://www.pcauthority.com.au/Feature/173700,pc-building-intels-turbo-boost-vs-amds-turbo-core.aspx PC Building: Intel's Turbo Boost vs AMD's Turbo Core. [48]. http://flac.sourceforge.net/comparison.html FLAC comparison report to other lossless codecs. [49]. http://www.spec.org/cpu/ Standard Performance Evaluation Corporation. [50]. http://news.cnet.com/8301-11128_3-20004378-54.html Can green tech operate under Moore's Law?.