(keynote) (from hpc to) new horizons of very high performance computing (vhpc): hurdles and chances...

(keynote)(from HPC to)

New Horizons of Very High Performance Computing

(VHPC): Hurdles and Chances

Reiner Hartenstein

TU Kaiserslautern

Rhodes Island, Greece, April 25-26, 2006

© 2006, [email protected] http://hartenstein.de2

TU KaiserslauternReconfigurable Supercomputing

(VHPC) going commercial

Cray XD1

silicon graphics RASC

… it‘s a paradigm shift !… and other vendors


TU Kaiserslautern

The Pervasiveness of RC

162,000

127,000

158,000113,000

171,000194,000

# of hits by Google

1,620,000

915,000

398,000

272,000

647,000

1,490,000

# of hits by Google

“FPGA and ….”ECE-savvy scene Math/SW-savvy sceneunqualified for RC ?


TU Kaiserslautern

world-wide a mass movement

Methodology ?

reminds me to the mass migration of lemmings

terminology chaosnot really a sense of direction

an urgent need to get organized


TU Kaiserslautern>> Outline <<

•Reconfigurable Computing Paradox

•The Supercomputing Paradox

•We are using the wrong model

•Coarse-grained Reconfigurable Devices

•Super Pentium for Desktop Supercomputer

http://www.uni-kl.de


TU KaiserslauternThe Reconfigurable Computing

Paradox

very poor effective integration density

„very power-hungry“ [Rick Kornfeld*]

very poor application development support

poor FPGA technology:

lower clock frequencies, and more expensive.

RC education: extremely poor, or none

Languages and tools unacceptable for software peoplemost hardware experts (86%**) hate their tools

**) DeHon ‘98 *) personal communication

poor tools:

poor education:

However, brilliant

results everywhere

what paradox ?

ignored by CS curricula

… teach like for a 50 year old mainframe …


TU Kaiserslautern

Computing Curricula 2004fully ignores

Reconfigurable Computing

Joint Task Force for

FPGA & synonyma: 0 hits

not even here

(Google: 10 million hits)

Education ?


TU Kaiserslautern

Computing Curricula v.2005:no changes other than „… FPGA, etc.“(not really mentioning that it‘s missing)

Completed ?

Taskforce activity completed ?Next task force in 2020 or later ?


TU Kaiserslautern

End of this week: brainstorming session at DARPA:

(urgently needed – overdue! )

Tools ?


TU Kaiserslautern

fine-grained RC: 1st DeHon‘s Law Technology:

reconfigurability overhead>

routing congestion

wiring overhead

overhead:

>> 10 000

1980 1990 2000 2010100

103

106

109

FPGAlogical

FPGArouted

density:

FPGAphysical

(Gordon Moore curve)

transistors / microchip

(microprocessor)

immense area inefficiency

[1996: Ph. D, MIT]


TU Kaiserslautern

X 2/yr

FPGA

published speed-up factors

1980 1990 2000 2010100

103

106

109

8080

Pentium 4

7%/yr

50%/yr

http://xputers.informatik.uni-kl.de/faq-pages/fqa.html

10 000

Los Alamos traffic simulation


47

real-time face detectionreal-time face detection6000

video-rate stereo vision


900pattern

recognitionpattern

recognition730

SPIHT wavelet-based image compressionSPIHT wavelet-based image compression 457Smith-Waterman pattern matching

Smith-Waterman pattern matching

288

BLASTBLAST52protein identificationprotein identification

40

molecular dynamics simulationmolecular dynamics simulation

88

Reed-Solomon Decoding

Reed-Solomon Decoding2400

Viterbi DecodingViterbi Decoding

400

FFTFFT

100

1000MA

CMA

C

Grid-based DRC:no FPGA: DPLA on MoM by TU-KL


20002000

2-D FIR filter [TU-KL]2-D FIR filter [TU-KL]

39,4

Lee Routing (by TU-KL)


160

Grid-based DRC („fair

comparizon“)


comparizon“)1500015000

DSP and wirelessImage processing,Pattern matching,

Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA

MoM Xputer architecture

Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

pre-FPGA era


TU Kaiserslautern

pre FPGA era: Why DPLA* was so good

Close to Moore because of small overhead (wiring, programmability, routing)

Large arrays of canonical boolean expressions

PLA layout ~similar to RAM / ROM layout:

Mid’ 80ies: first very tiny FPGAs available

*) designed by TU-KL, fabricated by E.I.S. German multi university project

GAG Generic Address Generator to avoid address computation overhead

2ASM: Auto-Sequencing MemoryASM

[M. Herz et al.: ICECS 2003, Dubrovnik]

Reiner Hartenstein

ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik


TU Kaiserslautern(anti-von-Neumann machine

paradigm)Data Counter instead of Program CounterGeneralization of the DMA

datacounter

GAG RAM

ASM: Auto-Sequencing MemoryASM

GAG & enabling technology:published 1989 [by TU-KL],Survey paper: [M. Herz et al.*: IEEE ICECS 2003, Dubrovnik] *) IMEC & TU-KL

**) -- patented by TI** 1995

Storge Scheme optimization methodology, etc.

Reiner Hartenstein

ASM means: no instruction streams neededfor address computationGeneralization of DMAM. Herz et al.: ICECS 2003, Dubrovnik


TU Kaiserslautern

Thousands or Millions of $ for free

Application migration [from supercomputer] resulting not only in massive speed-upsElectricity bills reduced by an order of magnitude and even more you may get for free…. up to millions of $ dollars per year

(also a matter of national energy policy)

GoogleAmsterdam

NY


TU KaiserslauternReconfigurable Scientific

Computing How software types do programming the FPGAs ?Hiring a good student from the EE Dept. ?

Because of Missing RC education: Far away from optimum solutions ?Much higher speedup achievable ?

1 or 2 more orders of magnitude ? 100.000 ? 1.000.000 ?


TU Kaiserslautern

X 2/yr

FPGA

By education: better speed-up factors ?

1980 1990 2000 2010100

103

106

109

8080

P4

7%/yr

50%/yr


10 000



47




900pattern

recognitionpattern

recognition730



288


40


88




400

FFTFFT

100

1000MA

CMA

C



20002000


39,4



160


comparizon“)




Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA


Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000

tool

s & e

du a

vaila

ble

?


TU Kaiserslautern

The Supercomputing Paradox

Growing listed Teraflops

Often limited sustained Teraflops

Almost stalled application implementation progress

Increasing number of processors running in parallel

COTS processor decreasing cost

Very high total cost of the Tera(?)flops

promising technology

poor results

Scientists waiting for affordable compute capacity

The Law of More

Reiner Hartenstein

programmer productivity shrinking with growing number of processors


TU Kaiserslautern

Why traditional supercomputing / HPC failed

instruction-stream-based: memory-cycle-hungry

the wrong way, how the data are moved around

because of the wrong multi-core interconnect architecture

extr

emel

y unbal

ance d

stolen from Bob Colwell

CPU


TU Kaiserslautern

Earth Simulator

5120 Processors, 5000 pins eachES 20: TFLOPS

Crossbar weight: 220 t, 3000 km of thick cable,moving data around

inside the


TU Kaiserslautern

Bringing together data and processor

moving the grand piano

by SoftwareMoving data to the processor:


TU Kaiserslautern

coarse-grained RC: Hartenstein‘s Law

rDPA

FPGArouted

>> 10 000

1980 1990 2000 2010100

103

106

109

(Gordon Moore curve)

transistors / microchip

rDPA physical rDPA logical

area efficiency very close to Moore‘s law

[1996: ISIS, Austin, TX]

e.g.

KressArray

family


TU Kaiserslautern

X 2/yr

FPGA

higher speed-up factors by coarse-grained?

1980 1990 2000 2010100

103

106

109

8080

P4

7%/yr

50%/yr


10 000



47




900pattern

recognitionpattern

recognition730



288


40


88




400

FFTFFT

100

1000MA

CMA

C



20002000


39,4



160


comparizon“)




Multimedia

Bioinformatics

GRAPEGRAPE20

Astrophysics

DPLADPLA


Microprocessor

rela

tive

perf

orm

anc

e

Memory

10 000

x1.25 / yr (Moore)

cryptocrypto

1000Coa

rse-

grai

ned

arra

ys ?


TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

array size: 10 x 16 = 160 rDPUs

Coarse grain is about computing, not logic

rout thru only

not usedbackbus connect

SNN filter on KressArray (mainly a pipe network)

[Ulrich Nageldinger]

reconfigurable Data Path Unit, e. g. 32 bits wide

reconfigurable Data Path Unit, e. g. 32 bits wide

no CPUrDPUrDPU


TU Kaiserslautern

SW 2coarse-grained CW migration example

rDPUrDPU rDPUrDPU rDPUrDPU rDPUrDPU
















S

+


TU KaiserslauternCompare it to software solution on CPU

on a very simple CPU C = 1

memory cycles

nanoseconds

if C then read A

read instruction

instruction decoding

read operand*

operate & register transfers

if not C then read B

read instruction


add & store

read instruction


operate & register transfers

store result

total

S

+

ABR C

Clock200

=1

S

+

S = R + (if C then A else B endif);


TU Kaiserslautern

hypothetical branching example to illustrate software-to-configware

migration

*) if no intermediate storage in register file

C = 1simple conservative CPU example

memory cycles

nanoseconds

if C then read A

read instruction 1 100instruction decoding

read operand* 1 100operate & reg. transfers

if not C then read B


add & store


operate & reg. transfers

store result 1 100

total 5 500


S

+

ABR C

clock200 MHz(5 nanosec)

=1

no m

emor

y cy

cles

:

no m

emor

y cy

cles

:

spee

d-up

fac

tor

= 1

00

spee

d-up

fac

tor

= 1

00


TU Kaiserslautern

moving the locality of operation into the route of the data stream by P&R

Why the speed-up? What‘s the difference?

instead of moving data by instruction streams


TU Kaiserslautern

rDPU not used used for routing only operator and routing port location markerLegend: backbus connect

rout thru only

not usedbackbus connect[Ulrich Nageldinger]

The wrong mind set ....


=1

+

ABR C

section of a very large pipe network:

decision

not knowing this solution:symptom of the hardware / software chasm

and the configware / software chasm

„but you can‘t implement decisions!“

We need Reconfigurable Computing Education


TU Kaiserslautern

The new paradigm: how the data are traveling

not transport-triggered: old hat

pipeline, or chaining

super systolic array

no, not by instruction execution

DPU DPU DPU

vN Move Processor

instruction-driven

+ instruction-driven

[Jack Lipovski, EUROMiCRO, Nice, 1975]

P&R: move locality of operation, not data !


TU Kaiserslautern

DPA

xxx

xxx

xxx

|

||

x x

x

x

x

x

x x

x

- -

-

input data stream

xx

x

x

x

x

xx

x

--

-

-

-

-

-

-

-

-

-

-

xxx

xxx

xxx

|

|

|

|

|

|

|

|

|

|

|

|

|

|output data streams

„data

streams“ time

port #

time

time

port #time

port #

define: ... which data item at which time at which port

Data streams

(pipe network)

H. T. Kung paradigm(systolic array)

implemented by distributed

memory

datacounter

GAG RAM

ASM

ASM

ASM

ASM

ASM

ASM

AS

M

AS

M

AS

M

AS

M

AS

M

AS

MASM: Auto-Sequencing

Memory

50 & more on-chip ASM are feasible

50 & more on-chip ASM are feasible


TU Kaiserslautern

The Generalization of the Systolic Array

[R. Kress]:use optimization algorithmse. g.: simulated annealing

Achievement: also non-linear and non-uniform pipes, and even more wild pipe structures possible

reconfigurability makes sense

discard algebraic synthesis methods

remedy?

only for applications with regular data dependencies

Kress-Kung paradigmsuper systolic array



• Reconfigurable Computing Paradox

• The Supercomputing Paradox

• We are using the wrong model

• Coarse-grained Reconfigurable Devices

• Super Pentium for Desktop Supercomputer



TU Kaiserslautern

Here is the common model

data-stream-based

instruction-stream-

based

software code

accelerator reconfigurable

accelerator hardwired

configware code

CPU

it’s not von Neumann the vN monopoly in our curricula is severely harmful

wagging the dog

the tail is

we need dual paradigm education


TU Kaiserslautern

A potential Pentium successorDiscard most caches

have 64* cores, 0.5 - 1 GHz

with clever interconnect for:

concurrent processes and

and for multithreading,

Kung-Kress pipe network

The Desk-top Supercomputer!

*) CPU mode / DPU mode capability

and, for

CPU

mod

eDP

U m

ode


TU Kaiserslautern“Super Pentium” configuration

examplerDPUrDPU rDPUrDPU rDPUrDPU




rDPUrDPU rDPUrDPU rDPUrDPU












CPUCPU

CPUCPU CPUCPU

CPUCPU


TU Kaiserslautern

e. g.: ~ 8 x 8 rDPA: all feasible under 500 MHz

GamesGames MusicMusicVideosVideos

SMeXPPSMeXPP

CameraCamera

Baseband-Baseband-ProcessorProcessor

Radio-Radio-InterfaceInterface

AudioAudio--InterfaceInterface

SD/MMC CardsSD/MMC Cards

LCD DISPLAY

rDPArDPA

• Variable resolutions and refresh rates• Variable scan mode characteristics• Noise Reduction and Artifact Removal• High performance requirements• Variable file encoding formats• Variable content security formats• Variable Displays• Luminance processing• Detail enhancement• Color processing• Sharpness Enhancement• Shadow Enhancement• Differentiation • Programmable de-interlacing heuristics• Frame rate detection and conversion• Motion detection & estimation & compensation• Different standards (MPEG2/4, H.264)• A single device handles all modes

World TV & game console & multi media center

http://pactcorp.com


TU Kaiserslautern

Dual Paradigm Application Development

instruction-stream-

based

software code


accelerator hardwired

configware codedata-stream-based

CPU

software/configwareco-compiler

high level language


TU KaiserslauternSoftware / Configware Co-

Compilation

Juergen Becker’s CoDe-

X, 1996

CPUCPU

Resource Parameters

supportingdifferentplatforms

SWcompiler

CWcompiler

C language source

Partitioner





Placement &

Routing

Placement &

Routing(Move the Locality of Operation

)


TU Kaiserslautern

Bringing together data and processor

Move the stool

byConfigware

Place the location of execution into the data pipe


TU Kaiserslautern>> Conclusions <<






•Conclusions http://www.uni-kl.de


TU Kaiserslautern

Conclusions (1): Hurdles

Obstacles are:

unbelievably disastrous tools market:

unbelievably ignorant curricula:

enabling technologies available, partly decades old, but not used

transdisciplinary models not available nor taught at CS, nor elsewhere

fragmentation into application-domain-specific cultures and trick boxes

… teach like for a 50 year old mainframe …


TU Kaiserslautern

Conclusions (2): Future Work

CS disciplines must recognize and accept its strategic role and its responsibility toward all its application disciplines: embedded and scientific computing.

The monopoly of the von-Neumann-based mind set in CS education:

heavily stalls progress in R&D, not only in HPC causes high cost in R&D, not only in supercomputing

The von-Neumann-only-based mind set in CS urgently needs to go to adopt the dual paradigm common model

CS graduates are not qualified for our job market


TU Kaiserslautern

Conclusions (3): Chances

New horizons: chances are brilliant


TU Kaiserslautern

thank you


TU Kaiserslautern

END


TU Kaiserslautern

thank you


TU Kaiserslautern

Backup:


TU Kaiserslautern

Co-Compiler Enabling Technology

is available from academia

only a small team needed for commercial re-implementation

on the road map to the Personal Supercomputer


TU KaiserslauternCompilation: Software vs.

Configware

source program

softwarecompiler

software code

Software Engineeri

ng

Software Engineeri

ng

configware code

mapper

configwarecompiler

scheduler

flowware code

source „program“

Configware

Engineering

Configware

Engineering

placement &

routing

data

C, FORTRANMATHLAB


TU Kaiserslautern

configware resources: variable

Nick Tredennick’s Paradigm Shifts explain the differences

2 programming sources needed

flowware algorithm: variable

Configware EngineeringConfigware Engineering

Software EngineeringSoftware Engineering

1 programming source needed

algorithm: variable

resources: fixedsoftware

CPU


TU Kaiserslautern

Co-Compilation

softwarecompiler

software code

Software / Configware Co-Compiler

Software / Configware Co-Compiler

configware code

mapperconfigware

compiler

scheduler

flowware code

data

C, FORTRAN, MATHLAB

automatic SW / CW partitionersimulated annealing

simulated annealing

simulated annealing

simulated annealing


TU Kaiserslautern

Co-Compiler for Hardwired Kress/Kung Machine

[e. g. Brodersen]

softwarecompiler

software code

Software / Flowware

Co-Compiler

Software / Flowware

Co-Compiler

flowwarecompiler

scheduler

flowware code

data

source

automatic SW / CW partitioner


TU KaiserslauternThe first archetype machine model

mainframe

CPU

compile orassemble

proceduralpersonalization

Software IndustrySoftware Industry Software Industry’sSecret of Success

simple basic .Machine Paradigm

personalization:RAM-based

instruction-stream- based mind set

“von Neumann”


TU KaiserslauternThe 2nd archetype machine model

compilestructural

personalization

Configware IndustryConfigware Industry

Configware Industry’sSecret of Success

personalization:RAM-based

data-stream- based mind set

“Kress-Kung”


simple basic .Machine Paradigm


TU Kaiserslautern

„Saves more than $10,000 in electricity bills per year (7¢ / kWh) - .... per 64-processor 19" rack“ [Herb Riley, R. Associates]


TU Kaiserslauternmodern FPGA bestsellers:

The new model is reality:FPGA fabrics, together with several µprocessors, many memory banks, and other IP cores, on the same COTS microchip


TU Kaiserslautern

500MHz FlexibleSoft Logic Architecture

200KLogic Cells

500MHz Programmable DSP Execution Units

0.6-11.1GbpsSerial Transceivers

500MHz PowerPC™ Processors(680DMIPS)

withAuxiliary Processor Unit

1Gbps DifferentialI/O

500MHz multi-portDistributed 10 Mb SRAM

500MHz DCM DigitalClock Management

DSP platform FPGA[courtesy Xilinx Corp.]

(keynote) (from hpc to) new horizons of very high performance computing (vhpc): hurdles and chances...

Documents

yr fpga

yr http

google fpga

overhead overhead

prefpga era slide

d fir filter tu

organized slide

mit slide