async2010 - a7 - basit riaz sheikh - an operand-optimized a synchronous ieee 754 double-precision...

8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder

1/28

An Operand-Optimized Asynchronous IEEE 754

Double-Precision Floating-Point Adder

Basit Riaz Sheikh and Rajit Manohar

Computer Systems Laboratory

Cornell University


2/28

1/25

Motivation Fast floating-point computations are important in science and engineering

Motivation - Baseline FPA - Operand Optimizations - Evaluation

1 PetaFLOP Supercomputer consumes 5-10 Mega Watts Data centers relocating closer to hydroelectric dams


3/28

2/25

IEEE Double Precision Format

(-1)s 1.significandx 2exponent-1023

Format for most(but not all) cases

Vast range of inputs adds to hardware complexity

Addition/Subtraction most frequent floating-point operation

S Exponent Significand

1 11 52



4/28

3/25

Baseline FPA ArchitectureS Exponent SignificandS Exponent Significand

Front-End/Unpack

Right Shift Val, NaN/Inf/Denormal, Swap

Right Alignment Shift


1.0101 x 25 + 1.1101 x 22

1.0101 x 25

+______________1.1101 x 2

20.0011101 x 25

1.1000101 x 25

Significand

Addition/Subtraction

8-bit

KS Adder

0 1

Carry

Logic

8-bit

KS Adder

0 1

8-bit

KS Adder

0 1

1.1101 x 210 - 1.1100 x 210

1.1101 x 210

- 1.1100 x 210______________

0.0001 x 2101.0000 x 26

Leading One Prediction

and Decoding (LOP/LOD)

RIGHT PIPE

1-Bit Left/Right Shift

Rounding & Incrementer

LEFT PIPE

Left Normalize Shifter

LOP Correction Shift

Left or Right

Select

Pack



5/28

4/25

Outline Motivation

Baseline FPA Architecture

Baseline FPA EvaluationEnergy breakdown

Operand-dependent optimizations

Interleaved Adder/Incrementer

Conditional Left/Right Pipelining

Two-way Right Align Shift

Zero-input Bypass Path

LOP/LOD minimization Control simplification for conditional bit inversion logic

Evaluation of the Optimized FPA



6/28

5/25

Evaluation of Baseline FPA First fully implemented asynchronous high-performance FPA

QDI highly pipelined circuits

Standard transistor sizing techniques Transistor-level simulation with conservative wire loads

Functional correctness verified with gate-level simulation tool

65nm bulk CMOS processTypical-Typical process corner

25C operating temperature

Gate and sub-threshold leakage included in power values



7/28

"Floating-Point Fused Multiply-Add Architectures",

Quinnell, Swartzlander, Lemonds6/25

Evaluation of Baseline FPA

State-of-the-art synchronous FPA (65nm SOI) by Quinnell et al. consumed

177pJ/op @ 666MHz

At 1V, baseline FPA has 3.2X higher throughput and consumes 2.6X

less energy/op



8/28

7/25

Baseline FPA Energy BreakdownS Exponent Significand S Exponent Significand

Front-End/Unpack



Significand

Adder/Subtractor



LEFT PIPE



RIGHT PIPE



Pack


16%15%

20%

7%

13%

12%

3%

3%


Left or Right

Select


9/28

8/25

Profiling Input Patterns Most synchronous FPAs include complex circuitry to attain

constant latency

constant throughput for the best, average, and worst case input patterns alike

But how often does the worst-case happen?

Intels PIN toolkit to understand input stream patterns

A set of 10 diverse floating-point applications from SPEC2006

and PARSEC benchmark suites

10 billion input operands were profiled for each application

and then used for statistical analysis



10/28

9/25

Operand Dependent OptimizationsS Exponent Significand S Exponent Significand

Front-End/Unpack



Significand

Adder/Subtractor



LEFT PIPE



RIGHT PIPE



Pack


Left or Right

Select



11/28

10/25

Radix-4 Adder Carry-Chain In the worst-case, carry propagates through all bits

Synchronous FPAs use expensive tree adder topologies to ensure

constant low latency and high throughput

90% of time, maximum carry-chain length is limited to 7 radix-4 positions



12/28

11/25

Interleaved Asynchronous Adder

Interleaved Adder consumes 79% less energy/op at 1.4% higher throughput 35% less transistors than baseline adder topology

Negligible difference between best case and average case throughput!

Interleave Send

Interleave Merge

Ripple

Adder

Ripple

Adder

Ripple

Adder


Ripple

Adder

Ripple

Adder

Ripple

Adder


13/28

12/25

Interleaved Incrementer Logic Baseline FPA uses Carry-Select Incrementer for 53-bit significand

Parallel tree logic for increment carry computation

Over 90% operations using incrementer, carry length < 4 radix-4 positions

Replace carry-select incrementer with an interleaved incrementer

Two radix-4 ripple incrementers



14/28

13/25

Conditional Left/Right PipelinesS Exponent Significand S Exponent Significand

Front-End/Unpack



Significand

Adder/Subtractor



LEFT PIPE



RIGHT PIPE



Left or Right

Select

Pack

18% Power Savings

For 80% + input patterns

Exponent Difference > 1

Or

Addition




15/28

14/25

Conditional Left/Right PipelinesS Exponent Significand S Exponent Significand

Front-End/Unpack



Significand

Adder/Subtractor



LEFT PIPE



RIGHT PIPE



Left or Right

Select

Pack

13% Power Savings

Exponent Difference < 2

&

Effective Subtraction




16/28

15/25

Alignment Shift OptimizationS Exponent Significand S Exponent Significand

Front-End/Unpack



Significand

Adder/Subtractor



LEFT PIPE



RIGHT PIPE



Pack


Left or Right

Select


16%


17/28

16/25

Alignment Shift Pattern

Benchmark applications exhibit a common property

A significant proportion of right align shift values range between 0 to 3inclusive

Is the logarithmic align shifter an energy-efficient choice?



18/28

17/25

Two-Way Right-Align Shift

Shifted significand skips the remaining two shift pipelines

Minimal logic for shifted out bits with shifts between 0 to 3



19/28

18/25

Zero-Input Operands

Few applications have a significant proportion of zero operands

Apps involving sparse-matrix manipulations such as Dealand Soplex

Zero-input operands use the full FPA datapath in most synchronous FPAs



20/28

19/25

Zero-input Bypass PathS Exponent Significand S Exponent SignificandFront-End/Unpack



Significand

Adder/Subtractor



LEFT PIPE



RIGHT PIPE



Left or Right

Select

Pack

WCHB

Control

WCHB

WCHB

Data bypass path




21/28

20/25

Control Slack: Design Space Exploration

8 WCHB control pipelines to avoid throughput hit for non-zero inputs

Some zero-input patterns still take a big throughput hit



22/28

21/25

Control Slack: Design Space Exploration Addition of two WCHB stages for sign, exponent, and significand bits

For Mix-pattern sequence, throughput increases by 7.5% to 2 GHz

For Mix-flip sequence, throughput increases by 49.8% to 1.95 GHz



23/28

22/25

Evaluating Optimized FPA

Combines all data-dependent optimization techniques

56.7% reduction in energy/op while maintaining average throughput

52 GFLOPS/Watt at 1.3 GHz throughput

19% reduction in leakage power

Motivation - Baseline FPA - Operand Optimizations -

Evaluation


24/28

23/25

Operand-Dependent Behavior

Latency is also highly operand-dependent

32.8% reduction in latency for zero-input operands

3.5% latency reduction for align shifts of 0 to 3

Motivation - Baseline FPA - Operand Optimizations -

Evaluation


25/28


26/28

25/25

Summary Efficient floating-point computations critical Synchronous FPAs non-optimum for average-case inputs

First transistor-level Asynchronous High-Performance FPA Detailed Energy consumption breakdown of FPA datapath

Profiling floating-point operands in diverse applicationbenchmarks

Operand-dependent optimizations reduced energy/op by56.7% while preserving throughput



27/28

An Operand-Optimized Asynchronous IEEE 754

Double-Precision Floating-Point Adder

Basit Riaz Sheikh and Rajit Manohar

Computer Systems Laboratory

Cornell University

M ti ti B li FPA O d O ti i ti E l ti


28/28

27/25

Fine-grain Pipelining Quasi-delay-insensitive (QDI) circuits Highly pipelined (30 stages) to maximize throughput

Pre-charge enable half-buffer (PCEHB) for data computation

Weak-condition half-buffer (WCHB) for buffer and copy


async2010 - a7 - basit riaz sheikh - an operand-optimized a synchronous ieee 754 double-precision...

Documents