mos high performance arithmetic - inriaarith23.gforge.inria.fr/slides/horowitz.pdf · mos high...
TRANSCRIPT
Arithmetic Is Important
2
Then
Now(TegraK1)
What Is Hard?
9999999+ 1
3
Proof:
4
2nd Gen
1st Gen
And Getting The Data You Need
But we didn’t notice this until much later …
To notice this problem• We need many advances in technology
5
3rd Gen – Relays (Z3 1940)
6
http://history-computer.com/ModernComputer/Relays/Zuse.html
A few adds/sec
4th Gen – Tubes (Eniac 1945)
7
5000 Adds/sec
5th Gen – Transistors (TX-0, Transistor 1 1953)
8
TX-0
Modern Era – IC, 1961
Image from State of the Art © Stan Augarten
9
Moore’s Law
Number of components on IC doubles every yearLater modified to doubling every 18 to 24 months
From Electronics, Volume 38, Number 8, April 19, 1965
10
ECL Computers
11
Microprocessor – MOS Processor (1974)
12
4004
nMOS 1978
13
8086
CMOS 1985 – To Present
14
80386
CMOS (Arithmetic) Design
What makes a good design?
15
Try to Balance 4 Parameters
Area
Performance
Power
Design Time
16
The Good News …
By the time CMOS came along
• There had been a lot of work on arithmetic• Booth coding• Wallace trees• Ling coding• Tree adders• Manchester carry chains• SRT division• …
17
The Bad News
The best logical design depends on technology
Remember the carry dependency?• For relays a Manchester carry chain is the best
• All the delay is in changing the relay state
18
P1_pv1
G1_pv1
P0_pv1
G0_pv1
P2_pv1
G2_pv1
More Bad News
The metrics you are optimizing work in opposition
19
Performance
Ener
gy
Just to Make Life More Complicated
Your metrics change w/ technology scaling
20
Must Use Technology Independent Metrics
Performance• In terms of a FO4 delay
21
0
100
200
300
400
500
600
700
0.20.40.60.811.2
Gat
e de
lay
(pS)
Technology Ldrawn (um)
Fanout=4 inverter delay at TT, 90% Vdd, 125C
500 * Ldrawn
1 4 16
FO4
Area and Energy
Area• Measure linear dimensions in “features”
Energy• CMOS energy is ∗• Normalize by ∗
22
23
Dennard’s Scaling
The triple play:• Get more gates, 1/L2 1/2
• Gates get faster, CV/i • Energy per switch CV2 3
Dennard, JSSC, pp. 256-268, Oct. 1974
Three Era’s of CMOS Arithmetic Design
Getting Going• Area constrained
Party time!• Performance Constrained
The hangover• Power Constrained
24
GETTING STARTED
25
Life in the 80’s
Just learning how to design complex chips• Chips had 100K transistors• Almost no CAD tools
• Worried about it getting the design done• And getting all the functions to fit on chip
Getting the design to fit was job one• Getting it to go fast was job two
26
Main Effects on Arithmetic Circuits
Merged Function Blocks• ALU
27
LookupTable
A
BP
F
Precharge Logic
28
Dual pMOSNetwork
staticcurrent
A
B
precharge
evaluate evaluate
precharge
non-overlapping(good, but not always
possible)
evaluate
precharge
Psuedo-nMOS CMOS Pre-Charge
WW
2W
2W
Carry Chains, and Carry Skip Adders
29
Cin Cout
P0*P1*P2*P3
Carry Carry Carry Carry
C0 C1 C2 C3
PGCin =0XORCin=1XORMux
Iterating Structures
Main processor• Just use instructions (micro-code) and ALU
Co-processor• Also used iterating structures• But built these structures for multiple or division• Often asynchronous
30
MIPS R3010 Multiplier
Clocked by internal oscillator, not external clock
31
CSA CSA
A Self-Timed Pipeline
in+in-
in+in-
in+in-
in+in-
out+out-
out+out-
out+out-
out+out-
C
clkclk clk clk
C CC
1 1 1 1 1
prec
h
prec
h
prec
h
prec
h
32
A Self-Timed Pipeline
Data enters at the far left and the NOR gate flips• This activates the C-element
in+in-
in+in-
in+in-
in+in-
out+out-
out+out-
out+out-
out+out-
C
clkclk clk clk
C CC
0 1 1 1 1
prec
h
prec
h
prec
h
prec
h
33
A Self-Timed Pipeline
First logic block goes into evaluate
in+in-
in+in-
in+in-
in+in-
out+out-
out+out-
out+out-
out+out-
C
clkclk clk clk
C CC
0 1 1 1 1eval
prec
h
prec
h
prec
h
34
A Self-Timed Pipeline
in+in-
in+in-
in+in-
in+in-
out+out-
out+out-
out+out-
out+out-
C
clkclk clk clk
C CC
0 0 1 1 1eval
prec
h
prec
h
prec
h
35
A Self-Timed Pipeline
Second block goes into evaluatePrimary inputs are deasserted, flipping the first NOR gate
in+in-
in+in-
in+in-
in+in-
out+out-
out+out-
out+out-
out+out-
C
clkclk clk clk
C CC
1 0 1 1 1eval
eval
prec
h
prec
h
36
Division - SRT
Ted Williams• Completely self-timed
37
PARTY TIME!
38
Performance, Performance, Performance
Scaling provided• Enough transistors• Low energy, and fast gates
Goal was to find the fastest structures• Lots of dual rail domino logic• Started to build full array/trees
• Many of the trees were regular (4:2 adder) for designer sanity
39
Ling Adder Implementation
Sam Naffziger (HP, 1996) presented a 64b adder• 7 FO4 delay (< 1nS): pretty darn fast• 0.5m CMOS
From VLSI lecture notes in early 2000’s
40
Kogge Stone Adders
41
H640 H64
62
(g0, t0)
H4/I4
H16/I16
H64
cin (g62, t62)
Alignment Shifter
Build full shifter
42
Even Fuse Multiplier and Adder Together
IBM Power 6 FMA
• 5 GHz 7-stage in 65nm
• Dependent unrounded results forwarded making dependent latency 6 cycles instead of 7
• (6,6,7) design
43
Life Was Good, For a While
44http://cpudb.stanford.edu/
THE HANGOVER
45
But You Have to Pay Eventually
46http://cpudb.stanford.edu/
The Power Limit
47http://cpudb.stanford.edu/
Watts/m
m2
48
Power Increased Because We Were Greedy
10x too large
Clever
http://cpudb.stanford.edu/
This Power Problem Is Not Going Away:P = C * Vdd2 * f
49http://cpudb.stanford.edu/
L0.6
Think About It
50
32 bit CMOS Adder Design Space
51
10
10
23468 1
1
Delay in 100ps
Ener
gy in
pJ
dual rail Sklansky Ling
static Sklanskydomino Sklansky Ling w/ 2bit sum select
Performance Metrics
Normally think of delay of unit• But that only matter if there is a dependent op
Many applications have many non-dependent ops• These are throughput based systems• Adding units improves performance
52
The Rise of Multi-Core Processors
http://cpudb.stanford.edu/53
The Stagnation of Multi-Core Processors
http://cpudb.stanford.edu/54
Throughput Based Designs
For applications with abundant parallelism• Leveraging parallelism helps energy efficiency
But when do you stop• Lower performance is almost always lower energy
Minimum energy designs, • Sea of very slow processors• Meters of silicon area
What to optimize?
55
Optimize Energy/Op vs. Area/Throughput
56
Floating Point Optimization180nm – ITRS 10nm
57
In This Space the Details Matter
Implementation of Booth Mux• More important then whether Booth 2, or Booth 3
How you wire the CSA array• Is more important than the type of counter
Most fancy adder tricks• Produce worse designs
58
Built an FP Generator in 2013
59
https://sites.google.com/a/stanford.edu/fpgen/home
FMA Output
60
CMA vs. FMA
61
For Latency For Throughput
Have A Shiny Ball, Now What?
62
Today FP Units are Not the Problem
8 cores
L1/reg/TLB
L2
L3
63
Rough Energy Numbers (45nm)
IntegerAdd
8 bit 0.03pJ32 bit 0.1pJ
Mult8 bit 0.2pJ32 bit 3 pJ
FPFAdd
16 bit 0.4pJ32 bit 0.9pJ
FMult16 bit 1pJ32 bit 4pJ
MemoryCache (64bit)
8KB 10pJ32KB 20pJ1MB 100pJ
DRAM 1.3-2.6nJ
70 pJ
I-Cache Access Register FileAccess
25pJ 6pJ Control
Instruction Energy Breakdown
Add
64
What Is Going On Here?
CPUs
GP DSPs
Dedicated
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 200.1
1
10
100
1000
10000
Energy Efficien
cy (M
OPS/m
W)
CPUs+GPUs~1000x
65
The Truth:It’s More About the Algorithm then the Hardware
All Algorithms
GPU Alg
66
Highly Local Computation Model
67
Highly Local Computation Model
68
Highly Local Computation Model
69
Compose These Cores into a Pipeline
Program in space, not time• Makes building programmable hardware more difficult
70
71
User code
Cool images
Great, But Can A User Program It?Frankencamera 4
71
Goals
Have user code in a image friendly language• Language should facilitate writing image/vision processing
Analyze/compile the language for different targets• CPU / GPU / FPGA
Create not just the hardware bit file• But also the hardware drivers and application level API
72
How:Constructors to Encode Domain Knowledge
Encapsulate domain knowledge in the system
Build constructor from lower level constructors
Clean interfaces are critical
Reuse both constructor and most of the configuration file
73
Halide Language
Language for creating fast image processing appsSeparate algorithm from scheduleTarget CPU and GPU
74
What Halide Does For You
Tiled
Fused
Vectorized
Multithreaded
11x faster• And not readable
75
Architecture Template:Stencil Functions and Line Buffers
Stencil functions consume sliding windows of data• Huge locality
To capture this locality need to buffer a few lines• Line buffer is the hardware buffer block.
76
Design Flow
77
Performance Results
Performance compared to Nvidia TK1
78
Energy Results
79
Conclusions
Designing the best arithmetic unit depends on:• Technology and constraints• Finding the right metrics is critical
Details matter• Must assess performance/area/energy of your idea• Generators (procedural knowledge) is a good approach to do this
Key to performance scaling in the future is the memory• Need applications with high locality
80