27/09/10floating point unit1 an energy-efficient combined floating point and integer alu for...
TRANSCRIPT
![Page 1: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/1.jpg)
27/09/10 Floating Point Unit 1
An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures
A literature study by
Tom Bruintjes
01/10/10
![Page 2: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/2.jpg)
Assignment
Design or modify a Floating Point Unit so that can also be used as Integer Unit, and determine its cost in terms of Area and Energy efficiency.
Requirements
- Floating Point addition and multiplication & Integer addition and multiplication
- Pipeline should be shallow (preferably no more than 2-stages)
- Low area costs
- Low power consumption
201/10/10
![Page 3: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/3.jpg)
3
Motivation
Multicore architecture
- MPSoC
- Tile Processor
Hetrogeneous but no Floating Point
- Too expensive (area, energy)
- Fixed Point alternative
- Software Emulation
Tilera TILE-Gx100(100 cores but no floating point)
01/10/10
![Page 4: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/4.jpg)
4
Motivation (2)
What if we did add a FPU?
- High performance FP ops
- A lot of hardware needed
- Complex datapath → High latency (low frequency)
→ Deep pipeline
- A lot of area wasted if FP is idle
01/10/10
![Page 5: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/5.jpg)
5
Motivation (3)
Idea: Add FP core and make it compatible with Integer operation so that Integer ops can be offloaded to the FP core when it is idle.
The shared core should be deployable in an embedded system (MPSoC), hence the low area and power consumption requirements.
Few pipeline stages to keep compiler manageable.
01/10/10
![Page 6: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/6.jpg)
6
Floating Point - History
Need for FP recognized early
The First FPU:Konrad Zuse’s Z1 (1938)
- 22-bit floating-point numbers
- storage of 64 numbers
- sliding metal parts memory
01/10/10
![Page 7: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/7.jpg)
7
Floating Point – History (2)
In the beginning floating-point hardware was typically an optional feature
- “scientific computers”
- extremely expensive
Then FP became available in the form of (“math”) Co-processors
- Intel x87 (486 vs )
- Weitek
Mid 90’s: most GPP’s are equipped with FP units
Current situation: FP also in small processors
01/10/10
![Page 8: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/8.jpg)
Why Floating Point
8
Unsigned/Signed
(…,-2,-1),0,1,2,3,…
[0000,0001,0010,0011]
- what about rational numbers or very large/small numbers ?
Fixed Point
0.11, 1.22, 2.33,…
[00.11, 01.10, 10.11]
Limited range and precision
- Solution: Floating Point (scientific) notation
- 1.220 x 105 (12.20 x 104 or 122.0 x 103, hence floating point)
01/10/10
![Page 9: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/9.jpg)
Floating Point representation/terminology
Floating Point representation
- Sign S
- Significand M (not Mantissa!)
- Exponent E (biased)
- Base (implicit)
Binary representation
[1 | 00001111 | 10101010101010101010101]
6.02 * 1023
Exponent
Base (radix)
Significand(mantissa)
01/10/10
![Page 10: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/10.jpg)
Binary Floating Point storage (issues)
Normalization
- Prevent redundancy: 0.122 * 105 vs 1.22 * 104
- Normalization means that the first bit is never a zero
- For binary numbers this means MSB is always 1 → “hidden bit”
Single, Double or Quad precision
- 32 bits: single (23-bit significand & 8 bit exponent)
- 64 bits: double (52-bit significand & 8 bit exponent)
Base is implicit
- 2, 10 or 16 are common
Special cases? (NaN, 0, ∞)
01/10/10
![Page 11: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/11.jpg)
11
The road to getting standardized
Many ways to represent a FP number
- Significand (sign-magnitide, two’s complement, BCD)
- Exponent (biased, two’s complement)
- Special numbers
Unorganized start
- Every company used their own format
- IBM, DEC, Cray
Highly incompatible
- 2 * 1.0 on machine A gives a different result then B
- Situation even worse for exceptions (e.g., underflow and overflow)
01/10/10
![Page 12: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/12.jpg)
12
IBM System/360 & Cray-1
IBM highlights
- Sign magnitude & biased exponent
- Base-16 numeral system (more efficient/less accurate)
Cray-1 highlights
- Sign magnitude & biased exponent
- Very high precision (64-bit single precision)
01/10/10
![Page 13: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/13.jpg)
13
IEEE-754
Standardized FP since 1985 (updated in 2008)
Arithmetic formats - binary and decimal Floating Point data (+special cases)
Operations - arithmetic and operations applied to arithmetic formats
Rounding algorithms - rounding routines for arithmetic and conversion
Exceptions handling - exceptional conditions
Format (binary or decimal)
- Sign magnitide significand & biased exponent
- base-2 or base-10
- N = (-1) S * (1.M) * 2 e-127
01/10/10
![Page 14: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/14.jpg)
14
IEEE-754 (2)
Operations
- Minimum set: Add, Sub, Mul, Div, Rem, Rnd to Int, Comp
- Recommended set: Log,…
Rounding modes
- Round to nearest, ties to even - Round Up
- Round to zero - Round down
Exceptions
- Invalid operation - Overflow
- Division by zero - Underflow
- Overflow
- Underflow
01/10/10
![Page 15: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/15.jpg)
Rounding
15
Almost never exact FP representation
[1.11110]*25 (62d)
[1.11111]*25 (63d)
Rounding is required
IEEE-754 rounding modes:
- Round to nearest (ties to even)
- Round to zero
- Round up
- Round down
Rounding (to nearest) algorithm based on 3 LSBs (guard bits)
0-- (down) | 100 (even) | 1-- (up)
01/10/10
![Page 16: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/16.jpg)
Floating Point arithmetic
More complex than Integer
Lots of shifting results and overhead due to exceptional cases
Addition
2.01 * 1012
1.33 * 1011 +
1. Check for zeros.
2. Align significands so exponents match (guard bits): rightshift!
3. Add/Subtract significands.
4. Normalize and Round the result
1601/10/10
![Page 17: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/17.jpg)
Floating Point addition
17
1. Check for zeros.
2. Align significands so exponents match
3. Add/Subtract significands.
4. Normalize and Round the result
01/10/10
![Page 18: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/18.jpg)
Floating Point Arithmetic (2)
18
Multiplication
1. Checking for zeros.
2. Multiplying significands
3. Adding exponents (correct for double bias)
4. Normalizing & Rounding the result
Division
1. Checking for zeros.
2. Divide significands
3. Subtract exponents (correct for double bias)
4. Normalizing & Rounding the result
01/10/10
![Page 19: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/19.jpg)
Floating Point Architecture
19
Architecture is a combination of HW, SW, Format, Exceptions, …
Focus on hardware (datapath) of a Floating Point Unit- Multiplier- Adder/Subtracter(- Divider)- Shifters- Comparators- Leading Zero Detection- Incrementers
How are components connected, what techniques are used and how does that influence the efficiency of the FPU?- Latency (paralelism)- Throughput (ILP, pipeline stages)- Area & Power (clockgating)
01/10/10
![Page 20: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/20.jpg)
Highlighted Architectures
20
UltraSparc T2
Itanium
Cell
01/10/10
![Page 21: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/21.jpg)
UltraSparc T2
27/09/10Floating Point Unit 21
UltraSparc T2 was released in 2007 by Sun
Features- Multicore (since 2008 SMP capable) microprocessor- Eight cores, 8 threads = 64 threads concurrently- Up to 1.6GHz- Two Integer ALUs per core- One FPU per core- “Open” design
Applications- Only servers produced by Sun
01/10/10
![Page 22: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/22.jpg)
UltraSparc T2 Floating Point
22
Eight cores, each with a FPU- Single and Double precision IEEE
Conventional FPU design- Dedicated datapath for each instruction
UltraSparc characteristics- Pipeline for addition/multiplication
6 stages, 1 instruction per cycle → shared- Combinatorial division datapath- Area and power efficient
clock gatingreduced switching
01/10/10
![Page 23: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/23.jpg)
Itanium
23
Intel and HP combined efforts to revolutionize computer architecture in ‘98- Complete overhaul of the legacy x86 architecture based on instruction level parallelism- RISC replace by VLIW - Large registers
First Itanium appeared in 2001, the latest model (Tukwila) is from February 2010
Tukwila features- 2-4 Cores per CPU- Up to 1.73GHz- Four Integer ALUs per core- Two FPUs per core
01/10/10
![Page 24: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/24.jpg)
Itanium
24
Very powerful very big- Two full IEEE double precision FP units- Leader in SPECfp- Single and double precision + custom formats
Architecture- Unfortunately (too) much details are undisclosed- So why look at Itanium at all? Because what has been disclosed is interesting:
Fused Multiply-Add
01/10/10
![Page 25: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/25.jpg)
Fused Multiply Add
25
FMA architecture fused multiply and add instructions(A*C)+B vs A*C and A+B
FMA advantages- Atomic MAC operations (~double performance)- Only one rounding error
Expensive?- Multiplication: Wallace Tree of CSAs- Partial addition product: 3:2 CSA- Full adder for conversion CS format
- Leading Zero Detection/Anticipation- Shifters for alignment and Postnormalization
No: end-around-carry principle
01/10/10
![Page 26: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/26.jpg)
End-around carry multiplication
26
Carry-save adder vs Full adder
CSA chain
CSA tree
Add one more CSA before conversion
→ →
01/10/10
![Page 27: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/27.jpg)
Fused Multiply-Add (2)
27
FP ops based on Fused Multiply-Add architecture
FMA: fma.[pc].[sf].f1 = f3 f4 f2 f1 = (f2 * f4) + f2ADD: fadd.[pc].[sf].f1 = f3 (f0) f2 f0 hardwired to +1.0MUL: fmul.[pc].[sf].f1 = f3 f4 (f1) f1 hardwired to +0.0
- Not as efficient as single add and multiply instructions
Division and Square Root- Division and Square Root can be implemented in Software- Lookup table for initial estimate (1/a and 1/√a)- Newton Raphson approximation (1 approximation and 13 FMA instructions on the Itanium)- Intel FPU bug! ($475.000.000)
01/10/10
![Page 28: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/28.jpg)
Cell
28
Combined efforts from Sony, Toshiba and IBM- Sony: Architecture & Applications - IBM: SOI process technology- Toshiba: Manufacturing- Develpment started 2000, 400 people, $400M- First Cell in 2006
Applications- Playstation 3- Blue ray- HDTVs- High performance computing
Features- 9 cores (PPC and SPE) for Integer and FP- 3.2GHz- All SIMD instruction
01/10/10
![Page 29: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/29.jpg)
Cell (2)
29
1 PPC and 8 SPEs- PPC for compatibility- SPEs for performance
1 FPU per SPE- 4 single precision cores per FPU- 1 double precision core per FPU
Why separate?- Performance requirements for SP Float too high for a double precision unit
01/10/10
![Page 30: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/30.jpg)
Single Precicion FP in the Cell
30
Single precision- Full FMA unit- Similar approach as Itanium- DIV/SQRT/Convert/… in software
Aggressive optimization- Denormal numbers forced to zero- NaN/∞ treated as normal number- Only round to zero
01/10/10
![Page 31: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/31.jpg)
Shared Integer/FP ALUs
31
Have FPUs been used for Integer operations in the past?- Yes, in fact the UltraSparc T2 and Cell already do so- Cell: converts Integers into some format that can be processed by the SPfpu- UltraSparc: Maps Integer multiplication, addition and division directly on the respective FP hardware, however not the full MAC capabilities…
Issues- Overhead due to FP specific hardware- Priorities- Starvation
01/10/10
![Page 32: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/32.jpg)
Approach
32
Design FPU- Implement single precision core and drop most of the stuff that makes FP so expensive …. Much like the Cell processor- Widen the design to make it compatible with 32-bit Integer operands
Add integer capability- Add switches and control in the design to support Integer operands- …without affecting FP performance
Optimization- Optimize the design for efficiency- Area/Power
Measure Performance, Area and Power Consumption- 65 or 90nm
01/10/10
![Page 33: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/33.jpg)
Approach – Floating Point Unit
33
Formatting- Close to IEEE format (Not GPP but don’t make it too obscure, i.e. Itanium)
- Sign magnitude- Biased exponent- Base-2
- Single Precision (double is excessive)- Initially ignore special cases
Architecture- Fused-Multiply-Add unit only + compares
A la Cell: Shifter, Tree Multiplier, CSA, Full adder- Initially three pipeline stage 1) Align/Multiply
2) Add/Prepare normalization3) Post-normalize
- Reduce to two stages if possible
01/10/10
![Page 34: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/34.jpg)
Approach – Floating Point Unit (2)
34
IEEE-754 compatibility
- Format (not all the special cases)
- Arithmetic (next slide)
- Rounding modes- Round to zero- Round to nearest- Round up- Round down
Exceptions and special cases- Denormalized numbers- NaN, Infinity (to be determined)- Exceptions (underflow, overflow, etc.)
01/10/10
![Page 35: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/35.jpg)
Approach – Floating Point Unit (3)
35
FP Arithmetic
- Multiplication
- Addition
- Division
- Square Root
- Conversion
- Compare
} Fused Multiply-Add
→ Software
→ Software
→ Software
01/10/10
![Page 36: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/36.jpg)
Approach – Integer Unit
36
32-bit signed Integer ALU- Preferably two’s complement (most common representation)- Single precision maps nicely to 2x32bit registers
Arithmetic mapping- Addition → Full adder- Multiplication → Wallace Tree- MAC- Shift → Aligner
Reconfiguring- Initially no bypassing (drain pipeline before reconfiguring)
01/10/10
![Page 37: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/37.jpg)
Proposed architecture
37
32-bit Input registers- FP: 32-bit significand & 32-bit exponent- Integer: 32-bit signed
3-Stage pipeline- Stage 1: Aligner for FP or Barrelshifter
32x32 Multiplier- Stage 2: Full Adder and Leading Zero Det.- Stage 3: Normalization and Rounding
2-stage pipeline?- Merge stage 2 and 3
01/10/10
![Page 38: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/38.jpg)
Testing/Benchmarking
38
After functional testing, implementation in 65 or 90nm
Measure area and power usage- Benchmark to be determined
01/10/10
![Page 39: 27/09/10Floating Point Unit1 An energy-efficient combined floating point and integer ALU for recongurable multi-core architectures A literature study by](https://reader035.vdocument.in/reader035/viewer/2022062516/56649e025503460f94aec200/html5/thumbnails/39.jpg)
Questions
39
Whatever the question,lead is the answer.
01/10/10