low power techniques in graphics processing units

Upload: deepak-verma

Post on 05-Apr-2018

261 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    1/13

    Low Power Techniques in Graphics Processing Units

    Deepak Verma

    [email protected] of Computer Engineering

    Syracuse University

    Syracuse, NY 13210

    Abstract

    Power can be minimized at either system

    level or architecture level or algorithm

    level or micro architecture level or gate

    level or circuit level. Here, we are going to

    discuss the different levels where the

    minimization has been done in history and

    the present researchs going on to reduce

    the power at various levels.

    Introduction

    Most of the power consumption in computeris done in the Graphics Processing Units. So

    here we are going to study few of the areas

    where the research is being done in this field.

    In section 1, the Advancement done is pastto reduce the chip size is and its cost is

    explained, whereas in other sections the

    Present Research is mentioned.Improvements at Algorithmic level are

    discussed in section 2 along with the new

    technologies like Low power videoprocessor, High Quality motion Estimation,

    Real time DVB-S2 Low-Density Parity-

    Check (LDPC) for GPUs. Development atArchitecture level is explained with an

    example of Low Power Interconnects forSIMD Computers in section 3. The last levelof minimization i.e. at Hardware level with

    technology of Hardware-Efficient Belief

    Propagation and Area Optimized Low Power

    1-Bit Full-Adder is in section 4.

    History

    1. Implementation of LowPower One-Chip MUSE

    Video Processor

    In past the emphasis has been done to do theLogic minimization to reduce the power in

    chip to further decrease the cost of chip.

    The low power design allowed us to mount

    the chip on inexpensive plastic packages as

    if the chips consume more than 1.5w then itwas necessary to mount them on expensive

    ceramic packages. Here we deal with the

    circuit reduction, in previous chip sets, 160

    word * 234 bit, RAM consists of two partssince each are of RAM existed on another

    chip. But the implementation of one chip

    results in a reduced bit size 480 word* 65 bit.

    This memory consists of the same 6-

    transistor memory cell as the previous

    single-port SRAM devices but the individual

    cells are accessed through shift registers as

    shown below.

    mailto:[email protected]:[email protected]:[email protected]
  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    2/13

    By changing from using address decoders

    with an address buffer and address transient

    detector to a shift register, the memory

    circuit size was reduced. We estimated thesize of memory and its peripherals. Since the

    access speed of single-port ram is not high

    enough to access at 16MHZ, we use 1to-3serial-to-parallel converter and 3-to-1

    parallel-to-serial converter and needed three

    160 word * 65 bit RAM blocks. Single portRAM is larger than the sequential access

    memory.

    Memory speed was increases from approx.

    5Mhz to 16 Mhz. Data output from previoussingle port SRAM blocks had to propagate

    through (1) the address counter, (2) the

    address buffer, (3) the pre-decoder, (4) themain decoder, (5) the memory cell, (6) the

    amplifier. In the dedicated memory block,

    data only propagate through (1) the shift reg.,(2) the memory cell,(3) the amplifier. Thus

    the access speed increases.

    Power is reduced by lowering the operating

    voltage from 3.7v to 3.3v and reducing the

    chip area through circuit reduction.

    Power, P=CV2f;

    C=capacitance(chip area)

    V=operating voltage,

    F=frequency.

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    3/13

    PresentIn this section we are going to discuss the

    various levels on which the research is being

    done to lower the power.

    2. Algorithmic level2.1. Low Power Video

    Processor

    Multiple power saving methods were applied

    to a video processor for color digital video

    and still cameras. Architectural levelmethods failed to save However changing

    the algorithm to work on pixel differences

    yielded 3-15% power reduction in typicalcases.The designer is often constrained by system

    level specifications that cannot be changed,

    thus also prohibiting low power redesign atthat level. What remains is for the designer

    to use best judgment at the architectural

    levels (RTL and behavioral), and at the

    algorithmic level.

    Power Saving Methods That Were

    RejectedTwo power reduction methods were

    investigated which were rejected:

    Asynchronous Design:Three factors are typically expected to

    reduce power: The clock network is

    eliminated, each module receives inputs onlywhen it needs to compute, and dynamic

    voltage scaling may be employed. This

    method was shown to save up to 80% oftotal power during periods of low activity,

    when the processor may be slowed down.

    The bundled data methodology with delaylines and full handshake Interconnect is

    employed, but found that the extra powerrequired by the delay lines and the

    handshake circuits far exceeded the powersaved by the elimination of the clock. This

    was due to the very low frequency of the

    clock (13.5MHz, the video input/output rate).

    Bus Switching Reduction:

    This is possible by selecting between

    sending a value or its complement.Hamming distance logic on the sender side

    determines which of the value or its

    complement incurs less switching, comparedto the previous value that is dynamically

    stored on the bus. Analysis shows that, for

    the average conditions at this video

    processor, the bus load must exceed 1pFbefore this method shows any

    benefit. Thus, it is inapplicable to this small

    processor.

    The Winner: Algorithmic TransformationWe have taken advantage of the facts that

    video pixels are often spatially correlated,and that most of the processing algorithm is

    linear.

    Thus, we resorted to computing thedifference of every two successive pixels,

    and converting the linear section of the

    algorithm to work on those differences. Thedifferences are mostly zero or 1-2 bit

    numbers, and the logic exploits it.

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    4/13

    This observation is obviously false near

    edges in the image.

    Due to rounding errors, the differencealgorithm performs poorly after any sharp

    image gradients. We have retained the

    original circuitry and have employed it eachtime an edge has been encountered. Once the

    gradient has subsided and relatively

    stationary pixel levels have beenreestablished, the difference algorithm is

    turned back on and the original algorithm is

    shut off.The new combined original/difference

    algorithm has been designed to create output

    that deviates by no more than a single digitalvalue from the original (a single lsb error),

    and simulations have verified this on all ourtest images.

    Five images were simulated on the newcircuit which yielded different portions of

    pixel differences (Tab. below). Images A-C

    exhibit typical edge contents, and result inless than 50% pixels which could be

    processed as differences, and in up to 15%

    power saving. Image D has little edges, andflat image E contains but one value. Both

    exhibit very high ratio of pixel differences.

    However, they require less power in thebaseline processor, so that there is no saving

    in both cases. Actually, there is some loss,

    due to the additional overhead of the more

    complex architecture.Ignoring these extreme cases, the new

    architecture is useful for power saving on

    more typical images.

    Powersaving

    (%)

    Reducedcurrent

    (mA)

    Pixeldifferences

    (%)

    Baselinecurrent

    (mA)

    Image

    15 6.3 5 7.4 A

    8 5.4 9 5.9 B

    7 5.3 37 5.7 C

    -3 4.3 66 4.2 D

    -12 0.9 98 0.8 E

    2.2. High Quality motionEstimation

    Motion estimation is the most expensivecomputational part in the video encoding

    process. The chip implements a new high

    performance motion estimation algorithmbased on a modified genetic search strategy

    that can be enhanced by tracking motion

    vectors through successive frames.

    The digital TV formats like the mainlevel/main profile of MPEG2, the

    requirements for both, processing power andIO bandwidth are extremely high if excellent

    encoding quality is required as it is the casein studio environment. Therefore one main

    focus of the VLSI activities was to provide a

    solution for a front-end high qualitymotion estimator which satisfies these

    constraints and implements some coding

    optimizations, this vlsi aims to replace

    complex, FPGA-based prototype hardware.

    Basic search approach is based in Genetic

    algorithm. It consists of six basic steps:

    initialization, i.e find a starting set ofchromosomes, each of them

    corresponding to one motion vector;

    evaluation of the fitness of the

    chromosomes ,i.e. calculation of the MAEcriterion of the corresponding motion

    vectors; selection of the fittest

    chromosomes, i.e. of the set of motionvectors with the lowest chromosomes MAE;

    crossover of the chromosomes, i.e.

    produce new motion vectors(childrens)from the selected set of motion vectors

    (parents); and mutation of the childrens,

    i.e. randomly change the new vectorsaccording to a defined probability; and

    iteration of the creation of new populations

    (i.e. repeat steps 2 to 5) until a defined

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    5/13

    convergence has been reached. This scheme

    has been used as a basis to develop the new

    high performance, low complexity motion

    estimation algorithm of the IMAGE chip.

    The chromosomes is represented directly

    by the motion vector with its two

    components concatenated. The spatialcorrelation of the motion vectors in the

    frame is exploited, i.e. instead of the search

    window center; the best motion vector of theprevious macro-block in the same slice is

    used as initialization of the search. A set of 9

    fittest chromosomes(best motion vectors

    so far) is always kept to assure that the finalmotion vector is the one with the lowest

    MAE of all performed matching and not

    only the best result of the last population.The mean of the two parent motion

    vectors is used as crossover operations and

    mutation is performed by adding smallrandom vectors to the generated sons.

    Furthermore in many cases, the motion

    vector fields of consecutive frames arehighly correlated. This character can be

    exploited to significantly improve the results

    of our algorithm by applying a VECTORTRACING technique. This corresponds to

    add at the initialization phase the bestvectors of the nine surrounding macro-

    blocks in the reference frame(with

    appropriate scaling for the B frames).

    These adaptations result into the Vector-

    traced Modified Genetic Search Algorithm

    (VT-MGS).

    For very complex sequence (like basketball),the prediction quality for scenes with very

    fast motion can further enhanced byapplying a 2-phase vector tracing scheme. Inthe first phase, the sequence is treated like a

    low delay coding sequence, and non-

    MPEG2 conform predictions in displayorder are performed with the only goal to

    calculate very exact tracing vectors by

    adding up partial results of this estimation

    (e.g. to form the tracing vector P->1, the

    non-MPEG2 motion vectors B1->1, B2->B1,

    and P->B2 are added up and stored). In thesecond phase, MPEG2 conform motion

    estimation is done using the pre-calculated

    initialization vectors. This 2-Phase motionestimations referred as VT-MGS2 implies of

    course an significant increase in processing

    power (by 50%) due to additional first phase.

    2.3.Real time DVB-S2 Low-Density Parity-Check (LDPC)

    for GPUs

    Till now the computational power requiredto decode large code words in real-time was

    not available. Nearly Although LDPC

    decoding solutions have recently beenproposed for multi-core platforms, they

    mainly address short and regular codes. In

    this paper we propose for the first timeLDPC decoders based on Graphics

    Processing Units (GPU) for the

    computationally demanding case of irregularLDPC codes adopted in the Digital Video

    Broadcasting

    Irregular nature of these LDPC codes can

    impose memory access constraints and this,associated with large code size, createschallenges which are difficult to overcome.

    Also, the scheduling mechanism imposes

    important restrictions on the attempt toparallelize the algorithm. Thread-level and

    data-level parallelism can be conveniently

    exploited, together with the use of fast localmemories, to harness the computational

    efficiency of these GPU-based signal

    processing algorithms.

    The algorithms developed supportmulticodeword decoding and are scalable to

    future GPU generations, which are expectedto have a higher number of cores. We show

    that it is possible to achieve real-time DVB-

    S2 LDPC decoding with throughputs above

    90 Mbps on ubiquitous GPU computingplatforms.

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    6/13

    The LDPC codes adopted in DVB-S2 have a

    periodic nature, which allows the

    exploitation of suitable representations ofdata structures for attenuating their

    computational requirements. The properties

    of DVB-S2 codes are exploited for the GPUparallel architectures in this paper. Parity-

    check matrix H has the form shown below

    H(N-K) * N = [A (N-K) * K| B(N-K) * (N-K)] =

    = a0,0 . . . . . a0,k-1 |1 0 . . . . . . 0 0 0a0,0 . . . . . a0,k-1 |1 1 0 . . . . . .0 . .

    . |0 1 1 . . . .. . . . . . .

    . |. . . .. . . . . .. . . ..

    . |. . .

    . |. . . 0 .

    . |. . . .

    . |. . .

    . |.

    aN-K-2,0 . . . . . aN-K-2,K-1| 0 0 . . . . . 1 1 0

    aN-K-1,0 . . . . . aN-K-1,K-1| 0 0 . . . . . 0 1 1

    where A is sparse and B is a staircase lowertriangular matrix. The periodicity constraints

    imposed on the pseudo- random generation

    of A allow a significant reduction in thestorage requirements without code

    performance loss.

    The Min-Sum algorithm was adopted in thiswork to perform the decoding of

    computationally intensive long LDPC codes.

    PROPOSED PARALLEL

    ALGORITHM FOR MANY-COREGPU

    A computing system with a GPU consists ofa host, typically a Central Processing Unit

    (CPU) that is used for programming and

    controlling the operation of the GPU. TheGPU is a massively parallel processing

    engine that can speed up processing by

    simultaneously performing the same

    operation on distinct data distributed bymany arithmetic processing units.

    GPUs are programmable and one of the most

    widely used programming models is theNVIDIA Compute Unified Device

    Architecture (CUDA). The execution of a

    kernel on a GPU is distributed across a gridof thread blocks with adjustable size.

    Parallel multithreaded LDPC decoder

    processinga) kernels 1 and 2 on the GPU using one

    thread per node of the

    Tanner graph where, for example

    b), BNd, BNf,BNKare BNs connected to CN0,and threads are grouped and processed in B

    blocks on the

    c) GPU many-core architecture.

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    7/13

    The algorithm developed attempts to exploit

    two major capabilities of GPUs: the massive

    use of thread and data parallelism and theminimization of memory accesses, which

    often degrade performance in multi-core

    systems.Multithread-based processing: In order to

    extract the essence of full thread-level

    parallelism from the GPU, the proposedDVB-S2 LDPC decoder exploits a thread-

    per-node approach (thread per row and

    thread per column based processing). Figureillustrates this strategy with 16 threads per

    block (here represented by tc0,tc15) being

    processed in parallel inside block 0 for theCheck Node processing indicated in kernel 1

    from Algorithm 1. A similar approach isapplied to the remaining threads tc16, tcN-K

    of kernel 1, which are grouped and executedin other blocks of the grid. Also, in kernel 2

    threads tB0,..tBN-1 perform the equivalent

    parallel Bit Node processing. The efficiencyof this parallelism is achieved by adopting a

    flooding schedule strategy that eliminates

    data dependencies in the exchange ofmessages between BNs and CNs.

    Additionally, to fully exploit the massive

    processing power of the GPU, the algorithmperforms multicodeword decoding by

    decoding 16 code words in parallel.

    Moreover, this solution uses 8 bit to

    represent data, which compares favorablywith existing state-of-the-art VLSI DVB-S2

    LDPC decoders that typically use 5 or 6 bit

    to represent Data.

    Coalesced accesses to data structures:In a GPU, parallel accesses to the slow

    global memory may kill performance andshould, whenever possible, be avoided. To

    optimize this type of operation, data is

    contiguously aligned in memory, whichfavors coalescence to take effect and allows

    several threads to access corresponding data

    in simultaneous , as depicted in figure below.Nevertheless, modern GPU hardware can be

    more efficient at dealing with out-of-order

    memory accesses and related issues.

    This is one of the efficient parallelalgorithms to perform the massive decoding

    of DVB-S2 LDPC codes on GPUs and

    High throughputs can be achieved for real-time applications, with values surpassing the

    90 Mbps.

    3. Architecture level3.1.Low Power Interconnects for

    SIMD Computers.A limit of the SIMD width differentarchitectures(like 3D graphics, high

    definition video, image processing and

    wireless comm.) is the scalability of the

    interconnect network between the processingelements in terms of both area and power.

    We use XRAM instead of SRAM, which is alow power high performance matrix style

    crossbar. one of the most power efficient

    ways to utilize this transistor area is throughintegrating multiple processing elements (PE)

    within a die. This is represented by many

    architectures in the form of increasednumber of single instruction multiple data

    (SIMD) lanes in processors, and the shift

    from multi-core to many-core architectures.Network-on-Chip architectures show that thecrossbar itself consumes between 30% to

    almost 50% of the total interconnect power.

    Another critical problem is that existingcircuit topologies in traditional interconnects

    do not scale well, because of the complexity

    in control wire and control signal generation

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    8/13

    logic which directly effects the delay and

    power consumption.

    One circuit technique that helps solve thecontrol complexity problem is to embed the

    interconnect control within the cross points

    of a matrix style crossbar using SRAM cells.This differs from the traditional technique

    where interconnections are set by an external

    Controller. Other circuit techniques, likeusing the same output bus wires to program

    the cross point control, help reduce the

    number of control wires needed within theXRAM. Finally borrowing low voltage

    swing techniques that are currently used in

    SRAM arrays improves performance andlowers the energy used in driving the wires

    of the XRAM. Though these techniques helpsolve the performance and scaling problem

    of traditional interconnects, one drawback isflexibility; the XRAM is only able to store a

    certain number of swizzle configurations at a

    given time. A case study shows that theXRAM achieves 1.4x performance and

    consumes 2.5x less power in a color-space

    conversion algorithm.

    XRAM fundamentals.

    The input buses run horizontally while the

    output buses run vertically, creating an arrayof cross points. Each cross point contains a

    6T SRAM bit cell. The state of the SRAMbit cell at a cross point determines whether

    or not input data is passed onto the output

    bus at the cross point. Along a column, onlyone bit cell is programmed to store a logic

    high and create a connection to an input.

    Matrix type crossbars incur a huge areaoverhead because of quadratically increasing

    number of control signals that are required to

    set the connectivity at the cross points. Tomitigate this, XRAM uses techniques similar

    to what is employed in SRAM arrays. In an

    SRAM array, the same bit line is used toread as well as write a bit cell. Until the

    XRAM is programmed the output buses do

    not carry any useful data. Hence, these can

    be used to configure the SRAM cells at the

    cross points without affecting functionality.

    Along a channel (output bus), each SRAM

    cell is connected to a unique bit line of thebus. This allows for the programming of

    multiple SRAM cells (as many bit lines

    available in the channel) simultaneously.XRAM re-uses output channels for

    programming, resulting in improvement of

    silicon utilization to 45%. To furtherimprove silicon utilization, multiple SRAM

    cells can be embedded at each cross point to

    cache more than one shuffle configuration.

    We find that many applications, especially in

    the signal processing domain, only utilize a

    small number of permutations over and overagain. By caching some of the patterns that

    are most frequently used, XRAM reduces

    power and latency by eliminating the need toconfigure and reprogram the XRAM for

    those patterns.

    compared to conventionally implementedcrossbars, the area scales linearly with the

    product of the inputoutput ports while

    consuming almost 50% less energy.

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    9/13

    Compared to conventional MUX based

    implementations, the XRAM improves

    performance by 1.4x and between 1.5-2.5xlower power for applications such as color-

    space conversion.

    4. Hardware level4.1.Hardware-Efficient Belief

    Propagation

    Loopy belief propagation (BP) is an effective

    solution for assigning labels to the nodes of a

    graphical model such as the Markov randomfield (MRF), but it requires high memory,

    bandwidth, and computational costs.

    Furthermore, the iterative, pixel-wise, and

    sequential operations of BP make it difficultto parallelize the computation. In this paper,

    we propose two techniques to address theseissues. BP algorithms generally require a

    great amount of memory for storing the

    messages, typically on the order of tens to

    hundreds times larger than the input data.Besides since each message is processed

    hundreds of times, the saving/loading of

    messages consumes considerable bandwidth.Therefore, although BP may work on high-

    end platforms such as desktops. Because themessages are sequentially updated and eachmessage is constructed via a sequential

    procedure, it is difficult to utilize hardware

    parallelism to accelerate BP.Tile-based BP is used to address these issues.

    Tile-based BP splits the Markov random

    field (MRF) into many tiles and only storesthe messages across the neighboring tiles.

    The memory and bandwidth required by this

    technique is only a fraction of the ordinary

    BP algorithms. But the quality of the results,as tested by the publicly available

    Middlebury MRF benchmarks, is

    comparable to other efficient algorithms.

    Tile-Based BP

    Let us first consider the process of

    generating outgoing messages of a node p,we need four incoming messages toward p ,

    the data costs of p, and the smoothness cost

    between p and q. Besides these, we do notneed the data of the nodes far away from p.

    this property is exploited in bipartite graph

    where the nodes are split into two sets so thatevery edge connects two nodes of different

    sets. To generate messages from the first set

    to the second one, we only need themessages from the second set to the first one.

    Therefore, only a half of the messages arestored An interesting question is, what arethe data required to generate the messages

    toward p. The answer is the data costs of ps

    neighbors, and the messages sent from theneighbors of these neighbors, as shown in

    Fig. (b). Again, we do not have to access the

    variables outside the group of those nodes.

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    10/13

    This rule can be easily extended. To generate

    the messages from the shaded nodes in Fig.

    (b) tops neighbors, we only need themessages from their neighbors. Therefore, if

    we have the messages from the boundary of

    a region, we can sequentially generate themessages inward. After we reach the region

    center, we can then sequentially generate the

    outward messages. The only required inputsare the data costs and smoothness costs of

    this region and the messages of the boundary

    nodes. This concept can be extended tomultiple regions and iterations. For example,

    we can spit the nodes of MRF into two sets,

    as shown in Fig. One set N1 contains thenodes in a 3-by-3 tile and the other set N2

    contains all other nodes. When we performBP in N2, without knowing messages in

    N1(dotted edges in Fig), we only need themessages coming from N1 to drive the

    propagation (outward arrows in Fig). All the

    messages in the tile are irrelevant to themessage passing outside the tile because they

    are never used in evaluating. Only messages

    that must always reside in the memory arethose sending from one subset to another.

    When we perform message passing in one

    subset, the messages in the other one can beremoved without affecting the operation.

    Given enough computations, the

    approximation quality is good enough to

    drive the propagation to converge.

    We can see that as the algorithm iterates(Ti=

    iterations), the energy keeps decreasing. This

    technique greatly reduced the memory,

    bandwidth, and computational costs of BP

    and enabled the parallel processing. With

    these two techniques, BP becomes moresuitable for low-cost and power-limited

    consumer electronics. We have demonstrated

    the applicability of the proposed techniquesfor various applications in the Middlebury

    MRF benchmark. A VLSI circuit and a GPU

    program for stereo matching based on theproposed techniques

    have been developed. These techniques can

    also be applied to other parallel platforms.

    4.2.AREA OPTIMIZED LOWPOWER 1-BIT FULL-ADDER

    we proposed a low power 1-bit full adder(FA) with 10-transistors and this is used in

    the design using low power 1-bit full adder

    in the implementation of ALU, the power

    and area are greatly reduced to more than70% compared to conventional design and

    30% compared to transmission gates. So, the

    design is attributed as an area efficient andlow power ALU. This design does not

    compromise for the speed as the delay of the

    full adder is minimized thus the overall delay.

    The leakage power of the design is alsoreduced by designing the full adder with less

    number of power supply to groundconnections. A conventional CMOS full

    adder consists of 28 transistors. But, here we

    have designed a full adder only with 10number of transistors, which occupies very

    less area and also consumes very less power.

    Table shows the carry status of full adder. If

    both A and B are 1s then carry is generated

    because summing A and B would makeoutput SUM 0 and CARRY 1. If both A

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    11/13

    and B are 0s then summing A and B would

    give us 0 and any previous carry is added

    to this SUM making CARRY bit 0.This is in effect deleting the CARRY.

    Static complementary CMOS adders using

    28 transistors.

    Fourteen Transistor (14T) Full adder with

    Transmission Gates

    10 Transistor Full Adder Design(10TFA)The proposed 10TFA also takes the three

    inputs A, B and Cin. The third input Cin

    represents carry input to the first stage. The

    outputs are SUM and CARRY.The full adder circuit uses 0.18 m CMOS

    process technology, which provides

    transistors with three characteristics, namelyhigh-speed, low-voltage and low-leakage.As the main target of this design is to

    minimize Power, so the transistors areselected for it accordingly. The typical

    supply voltage for this process is 1.8 V. The

    10- transistor 1-bit full adder is designed attransistor level, using 0.18 m CMOS process

    technology. Based on the simulation, the 10-

    transistor 1-bit full adder consumes6.2995W Power where as a conventional

    full adder consumes 16.675W, whichshows a 62.2% of power savings.

    The total Power Consumption of a CMOS

    circuit includes: dynamic Power

    Consumption, static Power Consumption andshort circuit power consumption. The last

    two items are neglected due to their low

    contribution to the power. The dominantfactor is the dynamic power based on the

    equation, P= CLfVdd2.

    The instantaneous power P (t) drawn fromthe power supply is proportional to the

    supply current Idd (t) and the supply voltage

    Vdd (t).

    P (t) =Idd (t) Vdd (t)The energy consumed over that time interval

    T is the integral of instantaneous power.

    E= Idd (t) Vdd (t)

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    12/13

    S.N

    o.Design Cell Ener

    gy

    y(pJ)

    Power

    (W

    )

    1 CMOS 2 x 1

    MUX

    0.92

    15

    4.60

    75

    2 4 x 1

    MUX

    3.02

    45

    15.1

    23

    3 Logical

    MUX

    8.52

    12

    42.6

    06

    Conventi

    onal Full

    Adder

    3.33

    42

    16.6

    75

    4 Transmiss

    ion Gates

    2 x 1

    MUXwith

    Transmiss

    ion gates

    0.36

    25

    1.60

    79

    5 4 x 1

    MUX

    withTransmiss

    ion gates

    0.85

    25

    4.26

    25

    6 LogicalMUX

    with

    Transmiss

    ion gates

    7.42

    85

    37.1

    45

    7 10

    TransistorFull

    Adder

    1.25

    99

    6.29

    95

    Design ALUwith

    CMO

    S

    gates

    ALUwith

    CMOS

    gates &10-

    Transistor full

    adder

    ALU withTransmissio

    n gates &

    10-Transistor

    full adder

    Energy(p

    J)

    840.8

    2

    351.95 239.52

    Power

    (W)

    4204.

    5

    1759.5 1197.5

    the leakage power is also very less as thenumber of power supply to ground

    connections are greatly reduced. The power

    consumption of 16-bit ALU with 10transistor full adder is observed to be

    1197.5w.

    References

    [1] Implementation of low power one-chip

    muse by Tetsuo Aoki

    [2] A Low Power Video Processor by Uzi

    Zangi and Ran Ginosar

    [3] Image:A low cost, low power video

    processor by Friederich Mombers

    [4] REAL-TIME DVB-S2 LDPC

    DECODING ON MANY-CORE GPUACCELERATORS by Gabriel Falcao , Joao

    Andrade , Vitor Silva and Leonel Sousa

    [5] Low Power Interconnects for SIMDComputers by Mark Woh

    [6] Hardware-Efficient Belief Propagation

    by Chia-Kai Liang, Chao-Chung Cheng,

    Yen-Chieh Lai

  • 8/2/2019 Low Power Techniques in Graphics Processing Units

    13/13

    [7]AREA OPTIMIZED LOW POWERARITHMETIC AND LOGIC UNIT by T.

    Esther Rani

    [8]http://ieeexplore.ieee.org.libezproxy2.syr.ed

    u/search/advsearch.jsp