async2010 - a7 - basit riaz sheikh - an operand-optimized a synchronous ieee 754 double-precision...

Upload: muneeb-imtiaz

Post on 07-Apr-2018

222 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    1/28

    An Operand-Optimized Asynchronous IEEE 754

    Double-Precision Floating-Point Adder

    Basit Riaz Sheikh and Rajit Manohar

    Computer Systems Laboratory

    Cornell University

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    2/28

    1/25

    Motivation Fast floating-point computations are important in science and engineering

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    1 PetaFLOP Supercomputer consumes 5-10 Mega Watts Data centers relocating closer to hydroelectric dams

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    3/28

    2/25

    IEEE Double Precision Format

    (-1)s 1.significandx 2exponent-1023

    Format for most(but not all) cases

    Vast range of inputs adds to hardware complexity

    Addition/Subtraction most frequent floating-point operation

    S Exponent Significand

    1 11 52

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    4/28

    3/25

    Baseline FPA ArchitectureS Exponent SignificandS Exponent Significand

    Front-End/Unpack

    Right Shift Val, NaN/Inf/Denormal, Swap

    Right Alignment Shift

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    1.0101 x 25 + 1.1101 x 22

    1.0101 x 25

    +______________1.1101 x 2

    20.0011101 x 25

    1.1000101 x 25

    Significand

    Addition/Subtraction

    8-bit

    KS Adder

    0 1

    Carry

    Logic

    8-bit

    KS Adder

    0 1

    8-bit

    KS Adder

    0 1

    1.1101 x 210 - 1.1100 x 210

    1.1101 x 210

    - 1.1100 x 210______________

    0.0001 x 2101.0000 x 26

    Leading One Prediction

    and Decoding (LOP/LOD)

    RIGHT PIPE

    1-Bit Left/Right Shift

    Rounding & Incrementer

    LEFT PIPE

    Left Normalize Shifter

    LOP Correction Shift

    Left or Right

    Select

    Pack

    S Exponent Significand

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    5/28

    4/25

    Outline Motivation

    Baseline FPA Architecture

    Baseline FPA EvaluationEnergy breakdown

    Operand-dependent optimizations

    Interleaved Adder/Incrementer

    Conditional Left/Right Pipelining

    Two-way Right Align Shift

    Zero-input Bypass Path

    LOP/LOD minimization Control simplification for conditional bit inversion logic

    Evaluation of the Optimized FPA

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    6/28

    5/25

    Evaluation of Baseline FPA First fully implemented asynchronous high-performance FPA

    QDI highly pipelined circuits

    Standard transistor sizing techniques Transistor-level simulation with conservative wire loads

    Functional correctness verified with gate-level simulation tool

    65nm bulk CMOS processTypical-Typical process corner

    25C operating temperature

    Gate and sub-threshold leakage included in power values

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    7/28

    "Floating-Point Fused Multiply-Add Architectures",

    Quinnell, Swartzlander, Lemonds6/25

    Evaluation of Baseline FPA

    State-of-the-art synchronous FPA (65nm SOI) by Quinnell et al. consumed

    177pJ/op @ 666MHz

    At 1V, baseline FPA has 3.2X higher throughput and consumes 2.6X

    less energy/op

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    8/28

    7/25

    Baseline FPA Energy BreakdownS Exponent Significand S Exponent Significand

    Front-End/Unpack

    Right Shift Val, NaN/Inf/Denormal, Swap

    Right Alignment Shift

    Significand

    Adder/Subtractor

    Leading One Prediction

    and Decoding (LOP/LOD)

    LEFT PIPE

    Left Normalize Shifter

    LOP Correction Shift

    RIGHT PIPE

    1-Bit Left/Right Shift

    Rounding & Incrementer

    Pack

    S Exponent Significand

    16%15%

    20%

    7%

    13%

    12%

    3%

    3%

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    Left or Right

    Select

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    9/28

    8/25

    Profiling Input Patterns Most synchronous FPAs include complex circuitry to attain

    constant latency

    constant throughput for the best, average, and worst case input patterns alike

    But how often does the worst-case happen?

    Intels PIN toolkit to understand input stream patterns

    A set of 10 diverse floating-point applications from SPEC2006

    and PARSEC benchmark suites

    10 billion input operands were profiled for each application

    and then used for statistical analysis

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    10/28

    9/25

    Operand Dependent OptimizationsS Exponent Significand S Exponent Significand

    Front-End/Unpack

    Right Shift Val, NaN/Inf/Denormal, Swap

    Right Alignment Shift

    Significand

    Adder/Subtractor

    Leading One Prediction

    and Decoding (LOP/LOD)

    LEFT PIPE

    Left Normalize Shifter

    LOP Correction Shift

    RIGHT PIPE

    1-Bit Left/Right Shift

    Rounding & Incrementer

    Pack

    S Exponent Significand

    Left or Right

    Select

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    11/28

    10/25

    Radix-4 Adder Carry-Chain In the worst-case, carry propagates through all bits

    Synchronous FPAs use expensive tree adder topologies to ensure

    constant low latency and high throughput

    90% of time, maximum carry-chain length is limited to 7 radix-4 positions

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    12/28

    11/25

    Interleaved Asynchronous Adder

    Interleaved Adder consumes 79% less energy/op at 1.4% higher throughput 35% less transistors than baseline adder topology

    Negligible difference between best case and average case throughput!

    Interleave Send

    Interleave Merge

    Ripple

    Adder

    Ripple

    Adder

    Ripple

    Adder

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    Ripple

    Adder

    Ripple

    Adder

    Ripple

    Adder

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    13/28

    12/25

    Interleaved Incrementer Logic Baseline FPA uses Carry-Select Incrementer for 53-bit significand

    Parallel tree logic for increment carry computation

    Over 90% operations using incrementer, carry length < 4 radix-4 positions

    Replace carry-select incrementer with an interleaved incrementer

    Two radix-4 ripple incrementers

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    14/28

    13/25

    Conditional Left/Right PipelinesS Exponent Significand S Exponent Significand

    Front-End/Unpack

    Right Shift Val, NaN/Inf/Denormal, Swap

    Right Alignment Shift

    Significand

    Adder/Subtractor

    Leading One Prediction

    and Decoding (LOP/LOD)

    LEFT PIPE

    Left Normalize Shifter

    LOP Correction Shift

    RIGHT PIPE

    1-Bit Left/Right Shift

    Rounding & Incrementer

    Left or Right

    Select

    Pack

    18% Power Savings

    For 80% + input patterns

    Exponent Difference > 1

    Or

    Addition

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    S Exponent Significand

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    15/28

    14/25

    Conditional Left/Right PipelinesS Exponent Significand S Exponent Significand

    Front-End/Unpack

    Right Shift Val, NaN/Inf/Denormal, Swap

    Right Alignment Shift

    Significand

    Adder/Subtractor

    Leading One Prediction

    and Decoding (LOP/LOD)

    LEFT PIPE

    Left Normalize Shifter

    LOP Correction Shift

    RIGHT PIPE

    1-Bit Left/Right Shift

    Rounding & Incrementer

    Left or Right

    Select

    Pack

    13% Power Savings

    Exponent Difference < 2

    &

    Effective Subtraction

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    S Exponent Significand

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    16/28

    15/25

    Alignment Shift OptimizationS Exponent Significand S Exponent Significand

    Front-End/Unpack

    Right Shift Val, NaN/Inf/Denormal, Swap

    Right Alignment Shift

    Significand

    Adder/Subtractor

    Leading One Prediction

    and Decoding (LOP/LOD)

    LEFT PIPE

    Left Normalize Shifter

    LOP Correction Shift

    RIGHT PIPE

    1-Bit Left/Right Shift

    Rounding & Incrementer

    Pack

    S Exponent Significand

    Left or Right

    Select

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    16%

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    17/28

    16/25

    Alignment Shift Pattern

    Benchmark applications exhibit a common property

    A significant proportion of right align shift values range between 0 to 3inclusive

    Is the logarithmic align shifter an energy-efficient choice?

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    18/28

    17/25

    Two-Way Right-Align Shift

    Shifted significand skips the remaining two shift pipelines

    Minimal logic for shifted out bits with shifts between 0 to 3

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    19/28

    18/25

    Zero-Input Operands

    Few applications have a significant proportion of zero operands

    Apps involving sparse-matrix manipulations such as Dealand Soplex

    Zero-input operands use the full FPA datapath in most synchronous FPAs

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    20/28

    19/25

    Zero-input Bypass PathS Exponent Significand S Exponent SignificandFront-End/Unpack

    Right Shift Val, NaN/Inf/Denormal, Swap

    Right Alignment Shift

    Significand

    Adder/Subtractor

    Leading One Prediction

    and Decoding (LOP/LOD)

    LEFT PIPE

    Left Normalize Shifter

    LOP Correction Shift

    RIGHT PIPE

    1-Bit Left/Right Shift

    Rounding & Incrementer

    Left or Right

    Select

    Pack

    WCHB

    Control

    WCHB

    WCHB

    Data bypass path

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

    S Exponent Significand

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    21/28

    20/25

    Control Slack: Design Space Exploration

    8 WCHB control pipelines to avoid throughput hit for non-zero inputs

    Some zero-input patterns still take a big throughput hit

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    22/28

    21/25

    Control Slack: Design Space Exploration Addition of two WCHB stages for sign, exponent, and significand bits

    For Mix-pattern sequence, throughput increases by 7.5% to 2 GHz

    For Mix-flip sequence, throughput increases by 49.8% to 1.95 GHz

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    23/28

    22/25

    Evaluating Optimized FPA

    Combines all data-dependent optimization techniques

    56.7% reduction in energy/op while maintaining average throughput

    52 GFLOPS/Watt at 1.3 GHz throughput

    19% reduction in leakage power

    Motivation - Baseline FPA - Operand Optimizations -

    Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    24/28

    23/25

    Operand-Dependent Behavior

    Latency is also highly operand-dependent

    32.8% reduction in latency for zero-input operands

    3.5% latency reduction for align shifts of 0 to 3

    Motivation - Baseline FPA - Operand Optimizations -

    Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    25/28

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    26/28

    25/25

    Summary Efficient floating-point computations critical Synchronous FPAs non-optimum for average-case inputs

    First transistor-level Asynchronous High-Performance FPA Detailed Energy consumption breakdown of FPA datapath

    Profiling floating-point operands in diverse applicationbenchmarks

    Operand-dependent optimizations reduced energy/op by56.7% while preserving throughput

    Motivation - Baseline FPA - Operand Optimizations - Evaluation

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    27/28

    An Operand-Optimized Asynchronous IEEE 754

    Double-Precision Floating-Point Adder

    Basit Riaz Sheikh and Rajit Manohar

    Computer Systems Laboratory

    Cornell University

    M ti ti B li FPA O d O ti i ti E l ti

  • 8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

    28/28

    27/25

    Fine-grain Pipelining Quasi-delay-insensitive (QDI) circuits Highly pipelined (30 stages) to maximize throughput

    Pre-charge enable half-buffer (PCEHB) for data computation

    Weak-condition half-buffer (WCHB) for buffer and copy

    Motivation - Baseline FPA - Operand Optimizations - Evaluation