compilers and applications kathy yelick dave judd, ronny krashinsky, randi thomas, samson kwok,...
TRANSCRIPT
Compilers and Applications
Kathy Yelick
Dave Judd, Ronny Krashinsky, Randi Thomas, Samson Kwok, Simon Yau, Kar Ming Tang,
Adam Janin, Thinh Nguyen
Computer Science Division
UC Berkeley
Compiling for VIRAM
• Long-term success of DIS technology depends on simple programming model, i.e., a compiler
• Needs to handle significant class of applications– IRAM: multimedia, graphics, speech and image processing
– ISTORE: databases, signal processing, other DIS benchmarks
• Needs to utilize hardware features for performance– IRAM: vectorization
– ISTORE: scalability of shared-nothing programming model
IRAM Compilers
• IRAM/Cray vectorizing compiler [Judd]– Production compiler
• Used on the T90, C90, as well as the T3D and T3E
• Being ported (by SGI/Cray) to the SV2 architecture
– Has C, C++, and Fortran front-ends (focus on C)
– Extensive vectorization capability
• outer loop vectorization, scatter/gather, short loops, …
– VIRAM port is under way
• IRAM/VSUIF vectorizing compiler [Krashinsky]– Based on VSUIF from Corinna Lee’s group at Toronto which is based
on MachineSUIF from Mike Smith’s group at Harvard which is based on SUIF compiler from Monica Lam’s group at Stanford
– This is a “research” compiler, not intended for compiling large complex applications
– It has been working since 5/99.
IRAM/Cray Compiler Status
• MIPS backend developed in this year– Validated using a commercial test suite for code generation
– Generated code run through vas assembler
• Vector backend recently started– Testing with vsim under way this week
• Leveraging from Cray– Automatic vectorization
– Basic instruction scheduling framework
Vectorizer
C
Fortran
C++
Frontends Code Generators
PDGCS
IRAM
C90
ISTORE Compiler
• Titanium language is an extension of Java– tc is the Titanium compiler
• Recent progress: – improved portability of generated code and the compiler itself,
including port to Cray parallel machines
– additions to generate annotations on C code to improve fine-grained parallelism (on Tera MTA) and vectorization
• New benchmarking efforts– database primitives: sorting, hash-join and index-nested-loop join
– 3d FFT and linear solvers (LU)
Optimizer
Java
TitaniumC +comm
ISTOREt3etc
cc
Code Gen C compiler
Applications
• Hand-written kernels for single-chip VIRAM– focus on multimedia kernels, see IRAM hardware talk
• Compiled programs for single-chip VIRAM– 2 examples from IRAM/VSUIF: decryption and mvm
– most effort devoted to IRAM/Cray compiler
• Performance benchmarks for ISTORE– 3d FFT
– Others
• SAM benchmarks for ISTORE
Automatic Vectorization
• Vectorizing compilers very successful on scientific applications– not entirely automatic, especially for C/C++
– good tools for training users
• Multimedia applications have– shorter vector lengths
– can sometime exploit outer loop vectorization for longer vectors
– often leads to non-unit strides
– tree traversals could be written as scatter/gather (breadth-first),
• although automating this is far from solved
e.g., image compression
IRAM/VSUIF Decryption (IDEA)
• IDEA Decryption operates on 16-bit ints • Compiled with IRAM/VSUIF (with unrolling by hand)• Note scalability of both #lanes and data width
0
2
4
6
8
16 32 64
vpw
GO
P/s
2
4
8
# lanes
VIRAM/VSUIF Matrix/Vector Multiply
• VIRAM/VSUIF does reasonably well on long loops
0
200
400
600
800
1000
1200
do
t
saxp
y
pad
ded
saxp
y
stri
p
han
do
pt
Mflop/s
mvm vmm
• 256x256 single matrix• Compare to 1600 Mflop/s
(peak without multadd)• Note BLAS-2 (little reuse)• ~350 on Power3 and EV6
• Problems specific to VSUIF– hand strip-mining
results in short loops– reductions– no multadd support
3D FFT on ISTORE
• Performance of large 3D FFT’s depend on 2 factors– speed of 1D FFT on a single node (next slide)
– network bandwidth for “transposing” data
– 1.3 Tflop FFT possible w/ 1K IRAM nodes and .5 TB/s bw
1D FFT on IRAM
Size (#points in FFT)
1024512256128
Tim
e (m
icro
seco
nd
s)
0
50
100
150
200
250Naive without autoincrementNaive with autoincrementVhalfup/dn, 2 LD/ST, opt. for 1Vhalfup/dn, 2 LD/ST, opt. for 2Vhalfup/dn, 1 LD/ST, opt. for 1
TigerSHARC DSP 41us(Analog Devices) (32bit)
IRAM 37us (32bit)
TMS320C6000 DSP 124us (Texas Instruments) (32 bits)
DSP56002 DSP 908 us (Motorola) (24 bits)
• FFT study on IRAM [Randi Thomas]– hand-coded and scheduled
– use of ISA features to make in-register FFTs fast (128 point)
– bit-reversal time not included; will also use ISA support
Other ISTORE Applications
• Working on several performance applications for ISTORE– Database primitives: sorts, joins, scans, etc. [Kar Ming Tang]
– RT_STAP
• QR Decomposition vectorizes easily, partially complete in IRAM/VSUIF
– Conjugate Gradient [Samson Kwok]
• Dominated by sparse matrix-vector multiply
• Current performance: 500/250 Mflops (single/double) on VIRAM
• Compare to 10s of Mflops on most RISC machines
– Dense linear algebra [Simon Yau]
– Considering other DIS benchmarks, such as MoM
Conclusions
• Significant compiler progress:– Cray collaboration key [Dave Judd UCB @ Eagan ]
– Good tech transfer model
– Vector code gen and instruction scheduling next steps
• Even VSUIF version indicates reasonable performance– Commercial-quality compiler will allow non-toy applications, e.g., Speech
• Benchmarks– Have been used to help with final ISA design
– Simulated results validate performance claims
– Models show real advantage to Intelligence in Memory (and Disk)
– Machines scale and with simpler programming and optimization model than conventional multiprocessors