compilers and applications kathy yelick dave judd, ronny krashinsky, randi thomas, samson kwok,...

Compilers and Applications

Kathy Yelick

Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang,

Adam Janin, Thinh Nguyen

Computer Science Division

UC Berkeley

Compiling for VIRAM

• Long-term success of DIS technology depends on simple programming model, i.e., a compiler

• Needs to handle significant class of applications– IRAM: multimedia, graphics, speech and image processing

– ISTORE: databases, signal processing, other DIS benchmarks

• Needs to utilize hardware features for performance– IRAM: vectorization

– ISTORE: scalability of shared-nothing programming model

IRAM Compilers

• IRAM/Cray vectorizing compiler [Judd]– Production compiler

• Used on the T90, C90, as well as the T3D and T3E

• Being ported (by SGI/Cray) to the SV2 architecture

– Has C, C++, and Fortran front-ends (focus on C)

– Extensive vectorization capability

• outer loop vectorization, scatter/gather, short loops, …

– VIRAM port is under way

• IRAM/VSUIF vectorizing compiler [Krashinsky]– Based on VSUIF from Corinna Lee’s group at Toronto which is based

on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford

– This is a “research” compiler, not intended for compiling large complex applications

– It has been working since 5/99.

IRAM/Cray Compiler Status

• MIPS backend developed in this year– Validated using a commercial test suite for code generation

– Generated code run through vas assembler

• Vector backend recently started– Testing with vsim under way this week

• Leveraging from Cray– Automatic vectorization

– Basic instruction scheduling framework

Vectorizer

C

Fortran

C++

Frontends Code Generators

PDGCS

IRAM

C90

ISTORE Compiler

• Titanium language is an extension of Java– tc is the Titanium compiler

• Recent progress: – improved portability of generated code and the compiler itself,

including port to Cray parallel machines

– additions to generate annotations on C code to improve fine-grained parallelism (on Tera MTA) and vectorization

• New benchmarking efforts– database primitives: sorting, hash-join and index-nested-loop join

– 3d FFT and linear solvers (LU)

Optimizer

Java

TitaniumC +comm

ISTOREt3etc

cc

Code Gen C compiler

Applications

• Hand-written kernels for single-chip VIRAM– focus on multimedia kernels, see IRAM hardware talk

• Compiled programs for single-chip VIRAM– 2 examples from IRAM/VSUIF: decryption and mvm

– most effort devoted to IRAM/Cray compiler

• Performance benchmarks for ISTORE– 3d FFT

– Others

• SAM benchmarks for ISTORE

Automatic Vectorization

• Vectorizing compilers very successful on scientific applications– not entirely automatic, especially for C/C++

– good tools for training users

• Multimedia applications have– shorter vector lengths

– can sometime exploit outer loop vectorization for longer vectors

– often leads to non-unit strides

– tree traversals could be written as scatter/gather (breadth-first),

• although automating this is far from solved

e.g., image compression

IRAM/VSUIF Decryption (IDEA)

• IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF (with unrolling by hand)• Note scalability of both #lanes and data width

0

2

4

6

8

16 32 64

vpw

GO

P/s

2

4

8

# lanes

VIRAM/VSUIF Matrix/Vector Multiply

• VIRAM/VSUIF does reasonably well on long loops

0

200

400

600

800

1000

1200

do

t

saxp

y

pad

ded

saxp

y

stri

p

han

do

pt

Mflop/s

mvm vmm

• 256x256 single matrix• Compare to 1600 Mflop/s

(peak without multadd)• Note BLAS-2 (little reuse)• ~350 on Power3 and EV6

• Problems specific to VSUIF– hand strip-mining

results in short loops– reductions– no multadd support

3D FFT on ISTORE

• Performance of large 3D FFT’s depend on 2 factors– speed of 1D FFT on a single node (next slide)

– network bandwidth for “transposing” data

– 1.3 Tflop FFT possible w/ 1K IRAM nodes and .5 TB/s bw

1D FFT on IRAM

Size (#points in FFT)

1024512256128

Tim

e (m

icro

seco

nd

s)

0

50

100

150

200

250Naive without autoincrementNaive with autoincrementVhalfup/dn, 2 LD/ST, opt. for 1Vhalfup/dn, 2 LD/ST, opt. for 2Vhalfup/dn, 1 LD/ST, opt. for 1

TigerSHARC DSP 41us(Analog Devices) (32bit)

IRAM 37us (32bit)

TMS320C6000 DSP 124us (Texas Instruments) (32 bits)

DSP56002 DSP 908 us (Motorola) (24 bits)

• FFT study on IRAM [Randi Thomas]– hand-coded and scheduled

– use of ISA features to make in-register FFTs fast (128 point)

– bit-reversal time not included; will also use ISA support

Other ISTORE Applications

• Working on several performance applications for ISTORE– Database primitives: sorts, joins, scans, etc. [Kar Ming Tang]

– RT_STAP

• QR Decomposition vectorizes easily, partially complete in IRAM/VSUIF

– Conjugate Gradient [Samson Kwok]

• Dominated by sparse matrix-vector multiply

• Current performance: 500/250 Mflops (single/double) on VIRAM

• Compare to 10s of Mflops on most RISC machines

– Dense linear algebra [Simon Yau]

– Considering other DIS benchmarks, such as MoM

Conclusions

• Significant compiler progress:– Cray collaboration key [Dave Judd UCB @ Eagan ]

– Good tech transfer model

– Vector code gen and instruction scheduling next steps

• Even VSUIF version indicates reasonable performance– Commercial-quality compiler will allow non-toy applications, e.g., Speech

• Benchmarks– Have been used to help with final ISA design

– Simulated results validate performance claims

– Models show real advantage to Intelligence in Memory (and Disk)

– Machines scale and with simpler programming and optimization model than conventional multiprocessors

compilers and applications kathy yelick dave judd, ronny krashinsky, randi thomas, samson kwok,...

Documents