reiner hartenstein, university of kaiserslautern, · pdf filereiner hartenstein, university of...
TRANSCRIPT
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
Reconfigurable HPC
Reconfigurable HPC
part 2
Data-Stream-based Computing
Reiner Hartenstein
TU Kaiserslautern
May 14, 2004 , TU Tallinn, Estonia
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
2
Preface
Response to the Computenik Shock:
the completely wrong roadmap to HPC
blinders to ignore the impact of morphware
... continue to bang their heads against the memory wall
instead of
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
3
Crusty Computing Sciences
[David Padua, John Hennessy]
98.5% vN-only
this monopoly is the problem
went morphware
dead
went morphware
exhausted
memory wall
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
4
Exhausted: Dead Supercomputer Society
•ACRI •Alliant •American Supercomputer
•Ametek •Applied Dynamics •Astronautics •BBN •CDC •Convex •Cray Computer •Cray Research •Culler-Harris •Culler Scientific •Cydrome •Dana/Ardent/ Stellar/Stardent
•DAPP •Denelcor •Elexsi •ETA Systems •Evans and Sutherland •Computer •Floating Point Systems •Galaxy YH-1 •Goodyear Aerospace MPP •Gould NPL •Guiltech •ICL •Intel Scientific Computers •International Parallel Machines
•Kendall Square Research •Key Computer Laboratories
[Gordon Bell, keynote at ISCA 2000].
•MasPar •Meiko •Multiflow •Myrias •Numerix •Prisma •Tera •Thinking Machines •Saxpy •Scientific Computer •Systems (SCS) •Soviet Supercomputers •Supertek •Supercomputer Systems •Suprenum •Vitesse Electronics
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
5
HPC going configware
http://fpl.org
International Conference on Field-Programmable Logic
and Applications (FPL)
Aug. 20 – Sept 1, 2004, Antwerp, Belgium
288 submissions ! accel. µProc.
... going into every type of application
they all work on high performance © 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
6
>> Machine vs. Anti Machine <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Final Remarks http://www.uni-kl.de
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
7
Reconfigurable Computing: a second programming domain
Migration of programming to the structural domain
The opportunity to introduce the structural domain to programmers ...
The structural domain has become RAM-based
... to bridge the gap by clever abstraction mechanisms using a simple new machine paradigm
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
8
machine vs. anti machine
progra
m
counter
DPU CPU
RAM memory
(r)DPA
without sequencer
asMA: auto-sequencing Memory Array
........
asM
asM
asM
asM
asM
(r)DPA
........
data stream machine (anti machine)
data
counter
memory bank
asM
asM: auto-sequencing Memory
instruction stream machine (von Neumann etc.)
matter vs. anti matter
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
9
Data-stream-based Computing
The Anti Universe of Computing
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
10
The anti universe
•Paul Dirac predicted a complete anti universe consisting of antimatter
•“There are regions in the universe, which consist of antimatter .....
•We are not aware, that there is a new area in computing sciences , which consists of antimatter of computing
• .... But there are asymmetries”
•Reconfigurable Computing is made from this antimatter: data-stream-based computing
•when a particle hits its antiparticle, both are converted into energy: Annihilation
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
11
anti particles
• 1956: anti neutron created on Bevatron
• 1928: Paul Dirac: „there should be an anti electron having positive charge“ (Nobel price 1933)
• 1932: Carl David Anderson detected this „positron“ in cosmic radiation (Nobel price 1936)
• 1955 Owen Chamberlain et al. create anti proton on Bevatron
• 1954: new accelerators: cyclotron, like Berkeley‘s Bevatron
• 1965: creation of a deuterium anti nucleus at CERN
hydrogen anti hydrogen
• 1995: hydrogen anti atom created at CERN – by forcing positron and anti proton to merge by very low energy.
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
12
Matter & Antimatter: Atom and Anti Atom
The World of Matter -
machine paradigm: the Atom
Anti Matter -
machine paradigm: Anti Atom
+ + -
- - +
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
13
Matter & Antimatter of Informatics : Machine and Anti Machine
+
CPU
- 1936 1st electronic computer (Konrad Zuse)
Machine paradigm: „von Neumann“
1946 v. N. machine paradigm
1971 1st microprocessor (Ted Hoff)
1979 „data streams“ (systolic array: Kung / Leiserson)
- DPU
+
Anti Machine paradigm
1990 anti machine paradigm published
1995 rDPA / DPSS (supersystolic: Rainer Kress)
novel
compilation
techniques
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
14
- DPU
(r)DPU
+
CPU
instruction stream
Matter vs. antimatter: CPU vs. DPU
-
progra
m
counter
DPU CPU
without sequencer
+
dat
a st
ream
(r)DPA
dat
a st
ream
s
+
+
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
15
heavy anti atoms: DPA = DPU array
- DPA
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU
- DPU -
DPA
+
+
+
+
+
+
+
+
+
cohere
nt d
ata
stre
ams
spin
ning
aro
und
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
16
Terminology: DPU versus CPU ...
• DPU: data path unit • DPA: DPU array • GA: gate array • rDPU: reconfigurable DPU • rDPA: reconfigurable DPA • rGA: reconfigurable GA
• DPU is no CPU: there is nothing central - like in a DPA
DPU DPU
DPU instruction sequencer
CPU
DPA (r)
(r)
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
17
Parallelism by Concurrency
+ -
+
- -
+
- +
+
-
- +
- +
independent instruction streams difficult ...
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
18
Annihilation?
- +
-
+ -
+
avoidable by careful
methodology
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
19
CS education .....
software person
procedural
structural
hardware person
Configware / Software Co-Design? Hardware / Software Co-Design?
Annihilation avoidable by CS curricular revision
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
20
>> Terminology <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Final Remarks http://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
21
„Re-configurable Hardware“ ??
„Re-configurable Hardware“ ??
this „Hardware“ is not hard !
We need a concise terminology: a consensus is on the way
it‘s Morphware
Terminology has been highly confusing
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
22
Configware and Flowware ... ... are the sources for programming morphware.
Software is the source for programming traditional hardwired processors (instruction-stream-driven: von Neumann machine paradigm and its derivatives)
For Configware and Flowware we prefer the anti machine paradigm – conterpart of von Neumann
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
23
de facto Duality of RAM-based platforms
traditional new
RAM-based platform CPU morphware (FPGA, rDPA ..)
„running“ on it: software configware
machine paradigm von Neumann etc.: instruction-stream-based
anti machine: data-stream-based
2nd paradigm
We now have 2 types of programmable platforms
hardware viewed as frozen configware:
Just earlier binding
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
24
Terminology: Digital System Platforms clearly distinguished
platform source running on it
machine paradigm
hardware (not running on it)
none
morphware
fine grain rGA (FPGA) configware
coarse grain
rDPU, rDPA reconfigurable data stream processor
flowware & configware anti machine
data stream processor (hardwired) flowware
instruction stream processor software von Neumann machine
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
25
Terminology: Digital System Platforms clearly distinguished
platform source running on it
machine paradigm
hardware (not programmable)
none
morphware
fine grain rGA (FPGA) configware
coarse grain
rDPU, rDPA reconfigurable data stream processor
flowware & configware anti machine
data stream processor (hardwired) flowware
instruction stream processor software von Neumann machine
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
26
Importance of binding time
Configuration: like a kind of pre-packed frozen-in „super instruction fetch“
load time
compile time
Reconfigurable Computing
not all switching is done by Configware
run time Microprocessor Parallel Computer
time of “instruction
fetch” read new
instruction read new instruction
c 0 1
for a pipe network
1
0
c
1
0
c
configure datapaths
fabrication time
Full custom oder ASIC
fabricate a
datapath
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
27
Software vs Flowware and Configware
Programming source for instruction-stream-based computing (von Neumann etc.):
The programming source for data-stream-based computing operations (the anti machine paradigm):
Programming sources for Reconfigurable Computing (morphware):
Software
Flowware
Flowware and Configware
Sources for Embedded Systems:
Flowware, Configware & Software
µProc.
compile
rec.anti machine
compile
rec.anti machine µProc.
partit. compiler
hdw.anti machine
d.schedule
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
28
control-procedural vs. data-procedural
The structural domain is primarily data-stream-based:
..... mostly not yet modelled that way: most flowware is hidden by its indirect
instruction-stream-based implementation
Flowware converts „procedural vs. structural“ into „control-procedural vs. data-procedural“ ...
Flowware
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
29
Flowware Languages
MoPL: fully supporting the anti machine paradigm – the counterpart of the von Neumann paradigm
general purpose:
Brook: for modern graphics hardware
Streams-C: defines 1-D streams; generates VHDL (LANL Open-source C compiler targets FPGAs)
DSP-C: allows to describe key features of DSPs www.dsp-c.org
(Data Streaming Language examples)
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
30
domain procedural structural
computing in ... time only* space and time
program source software*
hardwired reconfigurable
currently emerging
(hardware +) software**
(hardware +) flowware
configware + flowware
„instruction“ fetch at runtime
before fabrication at loading time
data „fetch“ at run time **) software „simulates“ flowware
algorithms variable
resources variable
reconfigurable:
http://hartenstein.de
programming: procedural vs. structural
algorithms fixed
resources fixed
fully hardwired: not programmable
*) only one source needed
algorithms variable
resources fixed
CPU:
embedded systems: data-stream-based instruction-stream-based
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
31
Glossary
DPU data path unit rDPU reconfigurable DPU DPA data path array (DPU array) rDPA reconfigurable DPA ISP instruction set processor AM anti machine AMP data stream processor* rAMP reconfigurable AMP
*) no “dataflow machine”
platform category
source „running“ on platform
machine paradigm
hardware (not programmable) none
ISP** software von Neumann
• morphware configware FPGA: none
data stream processor (AMP*)
flowware
anti machine reconfigurable
AMP (rAMP)
flowware &
configware
digital system platforms:
morphware use granularity (path width) (re)configurable blocks
reconfigurable logic • fine grain (FPGA) (~1 bit) CLBs
reconfigurable computing coarse grain (e.g. 32 bits) rDPUs (e.g. ALU-like)
multi granular: by slice bundling rDPU slices (e.g. 4 bits)
categories of morphware:
approaching consensus
**) Von Neumann etc. *) data stream processor © 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
32
data streams*: not new
1980: data streams (Kung, Leiserson: systolic arrays)
1989: data-stream-based Xputer architecture
1990: rDPU (Rabaey)
1994: Flowware Language MoPL (Becker et al.)
1995: super systolic array (rDPA) + DPSS tool (Kress)
1996+: Streams-C language, SCCC (Los Alamos), SCORE, ASPRC, Bee (UC Berkeley), DSP-C, Brook, ...
1996: configware / software partitioning compiler (Becker)
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
33
URLs: Software vs Flowware and Configware
µProc.
compile
rec.anti machine
compile
rec.anti machine µProc.
partit. compiler
hdw.anti machine
d.schedule
http://data-streams.org/
http://anti-machine.org/
http://kressarray.de/
http://flowware.net/
http://configware.org/
http://morphware.net/
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
34
>> Embedded Computing <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Final Remarks http://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
35
Fine-grain Morphware: Drawbacks
• FPGA Architectures – SRAM-based Look-up Tables (LUTs)
– Problems:
• Routing: reduces Performance
• Bad Ratio: active / passive Elements
reconfigurable Interconnect (Switching Boxes)
Configurable Logic
Block (CLB)
Source: R. Hartenstein
LUT
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
36
Reconfigurability Overhead
S S
S S
resources needed for reconfigurability
partly for configuration code storage
L
L L
L L
L
L L L
area used by application
“hidden RAM” not shown
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
37
Throughput vs. Efficiency
1000
100
10
1
0.1
0.01
0.001 2 1 0.5 0.25 0.13 0.1 0,07
MOPS / mW
µ feature size
S S
S S
resources needed for
reconfigurability
L
L L
L L
L
L L L
area used by application
1 Bit CLB
T. Claasen et al.: ISSCC 1999
Wiring by abutment: 32 Bit example
*) R. Hartenstein: ISIS 1997
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
38
One more argument for coarse grain
100
1000
10
1
0.1
0.01
0.001 2 1 0.5 0.25 0.13 0.1 0,07
MOPS / mW
µ feature size
T. Claasen et al.: ISSCC 1999
Wiring by abutment: a 32 Bit KressArray example
if coarse grain cells are full custom and
mesh-connected, and 2nd level interconnect
ressources layouted over the cells
*) R. Hartenstein: ISIS 1997
the array is almost as
area-efficient as hardwired
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
39
It’s a Paradigm Shift !
• Using FPGAs (fine grain reconfigurable) just mainly has been classical Logic Synthesis on a “strange hardware” platform
• Coarse Grain Reconfigurable Arrays (rDPAs) (Reconfigurable Computing), however, mean a really fundamental Paradigm Shift
• This is still ignored by CS and EE Curricula and almost all R&D scenes
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
40
rDPA (coarse grain) vs. FPGA (fine grain)
roughly: area efficiency
(transistors/chip, orders of magnitude)
hardwired 4
FPGA 2 µProc 0
rDPA 4
roughly: performance (MOPS/mW,
orders of magnitude)
hardwired 3
FPGA 2
µProc 0
rDPA 3
DSP 1
Status: ~1998
commodity commodity
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
41
Mega-rGAs 10 000 000
1 000 000
100 000
10 000
1 000
1984 1986 1988 1990 1992 1994 1996 1998 2000 2002 2004
planned
Virtex II
XC 40250XV
Virtex
XC 4085XL
100
System gates per rGA chip
Jahr
[Xilinx Data]
200
500
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
42
• FPGA Fabric-based on Virtex-II Architecture
Source: Ivo Bolsens, Xilinx
On Chip Memory
Controller
Power PC Core
Embeded RAM
Rocket IO
entire system on a single chip
all you need on board all you need on board
• Xilinx Virtex-II Pro FPGA Architecture
• PowerPC 405 RISC CPU (PPC405) cores
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
43
Mask & NRE cost [ST microelectronics]
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
44
>> coarse-grained Platforms <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Final Remarks http://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
45
computing in space
Computing in space and time
data
streams
y 1 0 ( )
y 2 0 ( )
y 3 0 ( )
-
-
-
y 1
y 2
y 3
-
-
- x
1
x 2
x 3
-
- -
computing in time
a 12
a 11 a
21
a 32
a 31
a 23
a 33
a 22
a 13
placement
systolic arrays etc.
and other transformations migration by re-timing
this dichotomy is completely ignored
by our CS curricula
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
46
DPA
x x x
x x x
x x x
|
| |
x x
x
x
x
x
x x
x
- -
-
input data streams
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
| output data streams
time
port #
time
time
port # time
port #
Flowware defines: ... which data item at which time at which port
Flowware programs data streams
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
47
Mathematicians X-ing
Mathematicians like to wax rhapsodic about the elegance, beauty and depth of their proofs.
John Horgan: „The End of Science Revisited“
Systolic
Synthesis
Mathematicians like the beauty and elegance of Systolic Arrays. Due to a lack of depth in understanding, their efforts yielded poor synthesis algorithms.
Reiner Hartenstein
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
48
2
Generalized Stream-based Computing System heterogenous Array of rDPUs (reconf. data path units)
Scheduler
Mapper
expression tree
DPU architectures
y
+ *
x
a
1
simultaneous
placement
& routing
3
+
+ +
+
*
*
* sh
* sh
sh sh
xf
xf
-
- data
streams
4
The same mapper for both: Reconfigurable, or hardwired
Kress DPSS [1995]
Configware
Compiler
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
49
Supersystolic Array Principles
• take systolic array principles
• replace classical synthesis by simulated annealing
• yields the supersystolic array
• a generalization of the systolic array
• no more restricted to regular data dependencies
• now reconfigurability makes sense: use morphware
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
50
flowware history
time
port #
time
DPA
x x x
x x x
x x x
|
| |
x x
x
x
x
x
x x
x
- -
-
input data streams
x x
x
x
x
x
x x
x
- -
-
-
-
-
-
-
-
-
-
-
x x x
x x x
x x x
|
|
|
|
|
|
|
|
|
|
|
| output data streams
time
port # time
port #
1980: data streams
(Kung, Leiserson) 1995: super systolic
rDPA (Kress) 1996+: SCCC (LANL),
SCORE, ASPRC, Bee (UCB), ...
(tutorials and courses available on all this)
flowware history:
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
51
computing paradigms and methodologies
1946: machine paradigm (von Neumann)
1980: data streams (Kung, Leiserson)
1989: anti machine paradigm
1990: rDPU (Rabaey)
1994: anti machine high level programming language
1995: super systolic array (rDPA)
1996+: SCCC (LANL), SCORE, ASPRC, Bee (UCB), ...
1997+: discipline of distributed memory architecture
1997: configware / software partitioning compiler
flow
war
e*
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
52
Super Pipe Networks
pipeline properties array applications
shape resources
mapping scheduling
(data stream formation)
systolic array
regular data
dependencies only
linear only
uniform only
linear projection or algebraic synthesis
super-systolic rDPA
no restrictions simulated
annealing or P&R algorithm
(e.g. force-directed) scheduling algorithm
*
*) KressArray [1995]
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
53
Programming Language Paradigms
language category Von Neumann Languages Anti Machine Languages
both deterministic procedural sequencing: traceable, checkpointable
operation sequence driven by:
read next instruction, goto (instr. addr.),
jump (to instr. addr.), instr. loop, loop nesting
no parallel loops, escapes, instruction stream branching
read next data item, goto (data addr.),
jump (to data addr.), data loop, loop nesting, parallel loops, escapes, data stream branching
state register program counter data counter(s)
address computation
massive memory cycle overhead overhead avoided
Instruction fetch memory cycle overhead overhead avoided
parallel memory bank access interleaving only no restrictions
language features control flow + data manipulation
data streams only (no data manipulation)
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
54
Similar Programming Language Paradigms
language category Computer Languages Xputer Languages
both deterministic procedural sequencing: traceable, checkpointable
sequencingdriven by:
read next instruction, goto (instruction addr.), jump (to instruction addr.), instruction loop, instruction loop nesting no parallel loops, instruction loop escapes, instruction stream branching
read next data object, goto (data addr.), jump (to data addr.), data loop, data loop nesting, parallel data loops, data loop escapes, data stream branching
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
55
Machine Paradigms
machine category Computer (the Machine:
“v. Neumann”) The Anti Machine
driven by: Instruction streams data streams (no “dataflow”)
engine principles instruction sequencing sequencing data streams
state register single program counter (multiple) data counter(s)
Communication path set-up .
at run time at load time
resource DPU (e.g. single ALU) DPU or DPA (DPU array) etc. data path
operation sequential parallel pipe network etc.
( “instruction fetch” )
also hardwired implementations* *) e g. Bee project Prof. Broderson
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
56
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
array size: 10 x 16 = 160 rDPUs
mapping algorithms efficently onto rDPA
rout thru only
not used backbus connect
SNN filter on KressArray
by the way: example of scalability / relocatability by EDA support
also FPGA scalability (avoid routing congestion) by EDA solution
rDPA
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
57
rDPU not used used for routing only operator and routing port location markerLegend: backbus connect
route-thru-only rDPU
3 vert. NNports, 32 bit
http://kressarray.de
Xplorer Plot: SNN Filter Example
+ [13]
2 hor. NNports, 32 bit
operator
result
operand
operand
route thru
backbus connect
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
58
>>> distributed memory
(application-specific)
distributed memory
The new discipline of
[] Herz et al.: proc. IEEE ICECS 2002
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
59
Synthesizable distributed memory architecture...
Memory (data memory)
memory bank
memory bank
memory bank
memory bank
memory bank
...
...
Scheduler
for a Stream-based Soft Machine
rDPA “instructions”
Compiler
Sequencers (data stream
generator)
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
60
http://kressarray.de
Efficient Memory Communication should be directly supported by the Mapper Tools
sequencers
memory ports
application
not used
Legend: Optimized Parallel Memory Controller
Synthesizable Distributed Memory
An example by Nageldinger’s KressArray Xplorer
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
61
Coarse Grain Architectures
style project first
publ.
source architecture granularity fabrics mapping intended target application
DP-FPGA 1994 [4] 2-D array 1 & 4 bit multi-granular Inhomog. routing channels switchbox routing regular datapaths
KressArray 1995 [5,11] 2-D mesh family: sel. pathwidth multiple NN & bus segments (co-)compilation (adaptable)
Colt 1996 [12] 2-D array 1 & 16 bit inhomogenous run time reconfiguration highly dynamic reconfig.
Matrix 1996 [15] 2-D mesh 8 bit, multi-granular 8NN, length 4 & global lines multi-length general purpose
RAW 1997 [17] 2-D mesh 8 bit, multi-granular 8NN switched connections switchbox rout experimental
Garp 1997 [16] 2-D mesh 2 bit global & semi-global lines heuristic routing loop acceleration
REMARC 1998 [18] 2-D mesh 16 bit NN & full length buses (info not available) multimedia
MorphoSys 1999 [19] 2-D mesh 16 bit NN, length 2 & 3 global lines manual P&R (not disclosed)
CHESS 1999 [20] hexagon 4 bit, multi-granular 8NN and buses JHDL compilation multimedia
DReAM 2000 [21] 2-D array 8 &16 bit NN, segmented buses co-compilation next generation wireless
CS2000 family 2000 [23] 2-D array 16 & 32 bit inhomogenous array (not disclosed) communication
MECA family 2000 [24] 2-D array multi-granular (not disclosed) (not disclosed) tele- & datacommunication
CALISTO 2000 [25] 2-D array 16 bit multi-granular (not disclosed) (not disclosed) tele- & datacommunication
mesh
FIPSOC 2000 [26] 2-D array 4 bit multi-granular (not disclosed) (not disclosed) tele- & datacommunication
RaPID 1996 [27] 1-D array 16 bit segmented buses channel routing pipelining linear
PipeRench 1998 [29] 1-D array 128 bit (sophisticated) scheduling pipelining
PADDI 1990 [30] crossbar 16 bit central crossbar routing DSP
PADDI-2 1993 [32] crossbar 16 bit multiple crossbar routing DSP and others Cross bar
Pleiades 1997 [33] mesh+crossbar multi-granular multiple segmented crossbar switchbox routing multimedia
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
62
Primarily Mesh-based ….
market project bits granularity source
KressArray variable U. Kaiserslautern
Garp 2 UC Berkeley
CHESS 4 Hewlett Packard
Matrix
RAW8 M.I.T.
Colt 1 & 16 Virginia Tech
DReAM 8 &16 TU Darmstadt
REMARC Stanford
research
MorphoSys UC Irvine
CALISTO Slicon Spice
MECA family
16
Malleable
CS2000 family 16 & 32 Chameleon Systems
FIPSOC 16 & analog SIDSA
commercial
XPP XPU128 32 PACT Corp.
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
63
UC Berkeley (Jan Rabaey)
market project bits granularity source
PADDI
PADDI-2research
Pleiades
16 UC Berkeley
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
64
commercial rDPA example:
PACT XPP - XPU128
XPP128 rDPA
• Evaluation Board available, and • XDS Development Tool with Simulator
buses not
shown
rDPU
CF
G
PAE
core
ALU CtrlALU
CF
GC
FG
PAE
core
CF
GC
FG
PAE
core
PAE
core
ALU CtrlALUALU CtrlALU
CF
GC
FG
CF
GC
FG
• Full 32 or 24 Bit Design working silicon • 2 Configuration Hierarchies
© PACT AG, Munich http://pactcorp.com
rDPA
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
65
XPP64A: Platform Development Board - SDR Board In Debug Phase -> XPP64A Chips from STMicro Fab
- Assembly & Test / Available March 2003
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
66
PACT Corp
• Xtreme Processor Platform (XPP) family of IP cores, high-speed data-stream-capable, scalable, reconfigurable clusters of arrays of 32-bit DPUs with embedded memories, and high-speed I/O ports -
• Application development support software featuring a flow graph-style algorithm mapping language - to minimize training requirements.
• XPP's fabrics, featuring automatic DataFlow synchronization and flagged Event Network to dynamically configure the execution flow,
• Supports dynamic RTR: hierarchical configuration managers free the designer from chip-level details and ensure that configurations are independently loaded in exactly the intended order.
• Automatic event-based task swapping along with data streams: released resources automatically reconfigured immediately
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
67
Sequential Processor Model
Conventional processors use the sequential model:
Each operation takes one clock cycle.
Multiple operations are computed consecutively.
Register Operation 1
Time
Operation 2
Operation 3
Operation 4
Operation 5
© 2003, PACT AG
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
68
A New Parallel Processor Paradigm
Multiple computations are configured as code sections onto
a two dimensional array.
Time
x
y
Data Buffer
© 2003, PACT AG
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
69
Parallel Processor Model
Multiple code sections are computed sequentially.
Time
y
x Operation 2
Section 1
Section 2
Section 3
© 2003, PACT AG
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
70
Dataflow Performance
Traditional Microprocessor XPP Architecture
SHIFT
Instruction
Memory and cache
Configuration
Memory and cache
MULT
One operation
per cycle
Many
operations
per cycle FFT
Viterbi
Basic machine operations
performed on
single words
Complex Functions
performed on
data streams
ADD
Register
One word
ALU Buffer
Stream
of words
Array of ALUs
Filter
© 2003, PACT AG
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
71
instruction stream-based Compilation Principles
scheduler
parser
source text
library
link/load instruction call placement
1-D memory space
execution order by location
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
72
Datastream-based Compilation Principles
library
data stream assembly
scheduler
mapper placement & routing
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
73
>> Dual Machine Paradigms <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Final Remarks http://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
74
3rd machine model became mainstream
1957
1967
1977
1987
1997
2007
computer age (PC age)
accel.
design
µProc.
compile
(Makimtos wave)
mainframe age
main frame
compile
instruction- stream-based
DPA r r µProc.
morphware age
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
75
benefit from RAM-based & 2nd paradigm
RAM-based platform needed for:
• flexibility, programmability • avoiding the need of specific silicon
simple 2nd machine paradigm needed as a common model:
• to avoid the need of circuit expertize • needed to to educate zillions of programmers
what programming source language ?
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
76
McKinsey Curve: dynamics of R&D disciplines
maturity of a discipline
year
fundmental issues
consolidation
saturation: limitations met
new discipline on top of it by ....
... by innovation
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
77
EDA Industry Revolutions
1978
Transistor entry: Applicon, Calma, CV ...
1992
Synthesis: Cadence, Synopsys ... 1985
Schematics entry: Daisy, Mentor, Valid ...
courtesy [Keutzer / Newton]
EDA industry paradigm switching every 7 years
1999 HLLs, (Co-) Compilation
Data-Stream-based DPU arrays
2006 coming closer to programmers‘ mind set
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
78
How to achieve acceptance
No hardware description languages
Courses tailored for students not being hardware-savvy
Tools usable by users not being hardware designers
EDA tools based on term rewriting [Arvind] [Mauricio Ayala]
[Courtesy Richard Newton]
Your name here: your proposals
how to hide the ugliness from the user [Herman Schmit]
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
79
configware compiler
anti machine
configware compiler
source „program“
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
80
anti machine
Configware Compilation
data
counter
memory bank
asM
asM
asM
asM
asM
asM
........
(r)DPA configware compiler
source „program“
mapper
Placement & routing
data scheduler
flowware code
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
81
symbiosis of machine models
µProc.
configware compiler
software compiler
anti machine
source „program“
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
82
symbiosis of machine models
DPA r r µProc.
partitioning compiler
source „program“
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
83
Software / Configware Co-Compilation
Analyzer / Profiler
SW code
SW compiler
para d igm “vN" machine
CW Code
CW compiler
anti machine paradigm
Partitioner
Resource Parameters
supporting different platforms
Juergen Becker’s CoDe-X, 1996
High level PL source
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
84
Loop Transformation Examples
loop 1-8 body body endloop
loop 1-8 body endloop
loop 9-16 body endloop
fork
join
strip mining
loop 1-4 trigger endloop
loop 1-2 trigger endloop
loop 1-8 trigger endloop
reconf.array: host: loop 1-16 body endloop
sequential processes: resource parameter driven Co-Compilation
loop unrolling
Reiner Hartenstein, University of Kaiserslautern, Germany http://hartenstein.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
85
>> final remarks <<
• Machine vs. Anti Machine
• Terminology
• Morphware Platforms
• coarse-grained Platforms
• Dual Machine Paradigms
• Final Remarks http://www.uni-kl.de
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
86
Jürgen Becker
Dissertation
Jürgen Becker:
• ... Automatically partitioning Co-compiler
• (configware / software co-compilation)
• Resource-parameter-driven retargettable
• Profiler-driven optimization
• Accepts HLL „ALE-X“ (extended C subset)
• (subset: pointers not supported)
Professor at Univ. Karlsruhe
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
87
Rainer Kress
Dissertation
Rainer Kress:
• ... on mapping applications onto his* KessArray
• DPSS datapath synthesis system
• Including a data scheduler
• (data stream scheduler)
• Generalization of the Systolic Array
• (KressArray is a super systolic array)
• 32 bit design via Eurochip support
infineon technologies, Munich
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
88
More Presentations / Literature
• http://hartenstein.de/keynotes.html
• http://hartenstein.de/publications.html
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
89
Antimatter Search ?
Antimatter Search
in EE & CS we do not need to search
© 2004, [email protected] http://hartenstein.de
TU Kaiserslautern
90
END