-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
1/28
An Operand-Optimized Asynchronous IEEE 754
Double-Precision Floating-Point Adder
Basit Riaz Sheikh and Rajit Manohar
Computer Systems Laboratory
Cornell University
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
2/28
1/25
Motivation Fast floating-point computations are important in science and engineering
Motivation - Baseline FPA - Operand Optimizations - Evaluation
1 PetaFLOP Supercomputer consumes 5-10 Mega Watts Data centers relocating closer to hydroelectric dams
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
3/28
2/25
IEEE Double Precision Format
(-1)s 1.significandx 2exponent-1023
Format for most(but not all) cases
Vast range of inputs adds to hardware complexity
Addition/Subtraction most frequent floating-point operation
S Exponent Significand
1 11 52
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
4/28
3/25
Baseline FPA ArchitectureS Exponent SignificandS Exponent Significand
Front-End/Unpack
Right Shift Val, NaN/Inf/Denormal, Swap
Right Alignment Shift
Motivation - Baseline FPA - Operand Optimizations - Evaluation
1.0101 x 25 + 1.1101 x 22
1.0101 x 25
+______________1.1101 x 2
20.0011101 x 25
1.1000101 x 25
Significand
Addition/Subtraction
8-bit
KS Adder
0 1
Carry
Logic
8-bit
KS Adder
0 1
8-bit
KS Adder
0 1
1.1101 x 210 - 1.1100 x 210
1.1101 x 210
- 1.1100 x 210______________
0.0001 x 2101.0000 x 26
Leading One Prediction
and Decoding (LOP/LOD)
RIGHT PIPE
1-Bit Left/Right Shift
Rounding & Incrementer
LEFT PIPE
Left Normalize Shifter
LOP Correction Shift
Left or Right
Select
Pack
S Exponent Significand
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
5/28
4/25
Outline Motivation
Baseline FPA Architecture
Baseline FPA EvaluationEnergy breakdown
Operand-dependent optimizations
Interleaved Adder/Incrementer
Conditional Left/Right Pipelining
Two-way Right Align Shift
Zero-input Bypass Path
LOP/LOD minimization Control simplification for conditional bit inversion logic
Evaluation of the Optimized FPA
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
6/28
5/25
Evaluation of Baseline FPA First fully implemented asynchronous high-performance FPA
QDI highly pipelined circuits
Standard transistor sizing techniques Transistor-level simulation with conservative wire loads
Functional correctness verified with gate-level simulation tool
65nm bulk CMOS processTypical-Typical process corner
25C operating temperature
Gate and sub-threshold leakage included in power values
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
7/28
"Floating-Point Fused Multiply-Add Architectures",
Quinnell, Swartzlander, Lemonds6/25
Evaluation of Baseline FPA
State-of-the-art synchronous FPA (65nm SOI) by Quinnell et al. consumed
177pJ/op @ 666MHz
At 1V, baseline FPA has 3.2X higher throughput and consumes 2.6X
less energy/op
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
8/28
7/25
Baseline FPA Energy BreakdownS Exponent Significand S Exponent Significand
Front-End/Unpack
Right Shift Val, NaN/Inf/Denormal, Swap
Right Alignment Shift
Significand
Adder/Subtractor
Leading One Prediction
and Decoding (LOP/LOD)
LEFT PIPE
Left Normalize Shifter
LOP Correction Shift
RIGHT PIPE
1-Bit Left/Right Shift
Rounding & Incrementer
Pack
S Exponent Significand
16%15%
20%
7%
13%
12%
3%
3%
Motivation - Baseline FPA - Operand Optimizations - Evaluation
Left or Right
Select
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
9/28
8/25
Profiling Input Patterns Most synchronous FPAs include complex circuitry to attain
constant latency
constant throughput for the best, average, and worst case input patterns alike
But how often does the worst-case happen?
Intels PIN toolkit to understand input stream patterns
A set of 10 diverse floating-point applications from SPEC2006
and PARSEC benchmark suites
10 billion input operands were profiled for each application
and then used for statistical analysis
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
10/28
9/25
Operand Dependent OptimizationsS Exponent Significand S Exponent Significand
Front-End/Unpack
Right Shift Val, NaN/Inf/Denormal, Swap
Right Alignment Shift
Significand
Adder/Subtractor
Leading One Prediction
and Decoding (LOP/LOD)
LEFT PIPE
Left Normalize Shifter
LOP Correction Shift
RIGHT PIPE
1-Bit Left/Right Shift
Rounding & Incrementer
Pack
S Exponent Significand
Left or Right
Select
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
11/28
10/25
Radix-4 Adder Carry-Chain In the worst-case, carry propagates through all bits
Synchronous FPAs use expensive tree adder topologies to ensure
constant low latency and high throughput
90% of time, maximum carry-chain length is limited to 7 radix-4 positions
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
12/28
11/25
Interleaved Asynchronous Adder
Interleaved Adder consumes 79% less energy/op at 1.4% higher throughput 35% less transistors than baseline adder topology
Negligible difference between best case and average case throughput!
Interleave Send
Interleave Merge
Ripple
Adder
Ripple
Adder
Ripple
Adder
Motivation - Baseline FPA - Operand Optimizations - Evaluation
Ripple
Adder
Ripple
Adder
Ripple
Adder
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
13/28
12/25
Interleaved Incrementer Logic Baseline FPA uses Carry-Select Incrementer for 53-bit significand
Parallel tree logic for increment carry computation
Over 90% operations using incrementer, carry length < 4 radix-4 positions
Replace carry-select incrementer with an interleaved incrementer
Two radix-4 ripple incrementers
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
14/28
13/25
Conditional Left/Right PipelinesS Exponent Significand S Exponent Significand
Front-End/Unpack
Right Shift Val, NaN/Inf/Denormal, Swap
Right Alignment Shift
Significand
Adder/Subtractor
Leading One Prediction
and Decoding (LOP/LOD)
LEFT PIPE
Left Normalize Shifter
LOP Correction Shift
RIGHT PIPE
1-Bit Left/Right Shift
Rounding & Incrementer
Left or Right
Select
Pack
18% Power Savings
For 80% + input patterns
Exponent Difference > 1
Or
Addition
Motivation - Baseline FPA - Operand Optimizations - Evaluation
S Exponent Significand
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
15/28
14/25
Conditional Left/Right PipelinesS Exponent Significand S Exponent Significand
Front-End/Unpack
Right Shift Val, NaN/Inf/Denormal, Swap
Right Alignment Shift
Significand
Adder/Subtractor
Leading One Prediction
and Decoding (LOP/LOD)
LEFT PIPE
Left Normalize Shifter
LOP Correction Shift
RIGHT PIPE
1-Bit Left/Right Shift
Rounding & Incrementer
Left or Right
Select
Pack
13% Power Savings
Exponent Difference < 2
&
Effective Subtraction
Motivation - Baseline FPA - Operand Optimizations - Evaluation
S Exponent Significand
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
16/28
15/25
Alignment Shift OptimizationS Exponent Significand S Exponent Significand
Front-End/Unpack
Right Shift Val, NaN/Inf/Denormal, Swap
Right Alignment Shift
Significand
Adder/Subtractor
Leading One Prediction
and Decoding (LOP/LOD)
LEFT PIPE
Left Normalize Shifter
LOP Correction Shift
RIGHT PIPE
1-Bit Left/Right Shift
Rounding & Incrementer
Pack
S Exponent Significand
Left or Right
Select
Motivation - Baseline FPA - Operand Optimizations - Evaluation
16%
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
17/28
16/25
Alignment Shift Pattern
Benchmark applications exhibit a common property
A significant proportion of right align shift values range between 0 to 3inclusive
Is the logarithmic align shifter an energy-efficient choice?
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
18/28
17/25
Two-Way Right-Align Shift
Shifted significand skips the remaining two shift pipelines
Minimal logic for shifted out bits with shifts between 0 to 3
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
19/28
18/25
Zero-Input Operands
Few applications have a significant proportion of zero operands
Apps involving sparse-matrix manipulations such as Dealand Soplex
Zero-input operands use the full FPA datapath in most synchronous FPAs
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
20/28
19/25
Zero-input Bypass PathS Exponent Significand S Exponent SignificandFront-End/Unpack
Right Shift Val, NaN/Inf/Denormal, Swap
Right Alignment Shift
Significand
Adder/Subtractor
Leading One Prediction
and Decoding (LOP/LOD)
LEFT PIPE
Left Normalize Shifter
LOP Correction Shift
RIGHT PIPE
1-Bit Left/Right Shift
Rounding & Incrementer
Left or Right
Select
Pack
WCHB
Control
WCHB
WCHB
Data bypass path
Motivation - Baseline FPA - Operand Optimizations - Evaluation
S Exponent Significand
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
21/28
20/25
Control Slack: Design Space Exploration
8 WCHB control pipelines to avoid throughput hit for non-zero inputs
Some zero-input patterns still take a big throughput hit
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
22/28
21/25
Control Slack: Design Space Exploration Addition of two WCHB stages for sign, exponent, and significand bits
For Mix-pattern sequence, throughput increases by 7.5% to 2 GHz
For Mix-flip sequence, throughput increases by 49.8% to 1.95 GHz
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
23/28
22/25
Evaluating Optimized FPA
Combines all data-dependent optimization techniques
56.7% reduction in energy/op while maintaining average throughput
52 GFLOPS/Watt at 1.3 GHz throughput
19% reduction in leakage power
Motivation - Baseline FPA - Operand Optimizations -
Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
24/28
23/25
Operand-Dependent Behavior
Latency is also highly operand-dependent
32.8% reduction in latency for zero-input operands
3.5% latency reduction for align shifts of 0 to 3
Motivation - Baseline FPA - Operand Optimizations -
Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
25/28
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
26/28
25/25
Summary Efficient floating-point computations critical Synchronous FPAs non-optimum for average-case inputs
First transistor-level Asynchronous High-Performance FPA Detailed Energy consumption breakdown of FPA datapath
Profiling floating-point operands in diverse applicationbenchmarks
Operand-dependent optimizations reduced energy/op by56.7% while preserving throughput
Motivation - Baseline FPA - Operand Optimizations - Evaluation
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
27/28
An Operand-Optimized Asynchronous IEEE 754
Double-Precision Floating-Point Adder
Basit Riaz Sheikh and Rajit Manohar
Computer Systems Laboratory
Cornell University
M ti ti B li FPA O d O ti i ti E l ti
-
8/6/2019 ASYNC2010 - A7 - Basit Riaz Sheikh - An Operand-Optimized A Synchronous IEEE 754 Double-Precision Floating-Point Adder
28/28
27/25
Fine-grain Pipelining Quasi-delay-insensitive (QDI) circuits Highly pipelined (30 stages) to maximize throughput
Pre-charge enable half-buffer (PCEHB) for data computation
Weak-condition half-buffer (WCHB) for buffer and copy
Motivation - Baseline FPA - Operand Optimizations - Evaluation