Carnegie Mellon
Meet Stevephanie, the ComputerSpiral: Automatic Library GenerationSpiral: Automatic Library Generation
Franz Franchetti
Electrical and Computer EngineeringElectrical and Computer EngineeringCarnegie Mellon University
UW MSR Institute 2008, August 6th: “The Concurrency Challenge”
Carnegie Mellon
The Current Architecture Space2010 and later
Larrabee2008
Virtex 5FPGA+ 4 CPUsSun Niagara
32 threads
Larrabee
multicore
Cell BE8+1 cores
before 2000
IBM C l 64
Xtreme DATA Opteron + FPGA
CPU platforms
Core2 DuoCore2 Extreme
SGI RASC Itanium + FPGA
N idi GPU
IBM Cyclops6480 cores
Nvidia GPUs128 processors ATI/AMD merger
CPU+GPU fusionClearSpeed96 cores
programmability
Carnegie Mellon
Example: DFT on “Nice” Quadcore
24
26
Discrete Fourier Transform (single precision): 2 x Core2 Extreme 3 GHzPerformance [Gflop/s]
18
20
22
24
12
14
16
30x
best code(computer-generated)
6
8
10
0
2
4
16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
Numerical Recipes (ANSI C)
Problem size
What’s going on?
Carnegie Mellon
DFT Plot: Analysis
Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHzGflop/s
25
30
15
20
Multiple threads: 2x
5
10
SIMD vector instructions: 3x
0
5
16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144input size
Memory hierarchy: 5x
input size
High performance is hard even on “simple” architectures
Carnegie Mellon
Current Solution
Legions of programmers implement and optimize the same functionality for every platform and whenever asame functionality for every platform and whenever a new platform comes out.
Carnegie Mellon
Better: Automatic Performance Tuning
Automate (parts of) the implementation or optimization
Research effortsLinear algebra: Phipac/ATLAS, LAPACK, Sparsity/Bebop/OSKI, FlameTensor computations (TCE)PDE/fi i l F iPDE/finite elements: FenicsAdaptive sortingFourier transform: FFTW, UHFFT Linear transforms: Spiral…
Proceedings of the IEEE special issue, Feb. 2005
Promising new area but more work neededIn particular for parallelism …
Carnegie Mellon
SpiralpLibrary generator for linear transforms (DFT, DCT, DWT, filters, ….) and recently more …
Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog, GPU
Research Goal: “Teach” computers to write fast librariesComplete automation of implementation and optimizationConquer the “high” algorithm level for automationq g g
When a new platform comes out: Regenerate a retuned library
When a new platform paradigm comes out (e.g., CPU+GPU):Update the tool rather than rewriting the library
Intel is using Spiral to generate parts of their MKL and IPP libraries
Carnegie Mellon
Vision Behind Spiral
Current Future
Numerical problem
effo
rt
Numerical problem
algorithm selection
hum
an e
implementationC program
mat
edalgorithm selection
implementation
compilation
utom
ated
auto
compilation
p
Computing platform
au
Computing platform
C code a singularity: Compiler hasno access to high level information
Challenge: conquer the high abstraction level for complete automation
Carnegie Mellon
Main Idea: Joint Mathematical Abstraction
Model: common abstraction= spaces of matching formulas
νabstraction abstraction
νpμ
rewritingdefines
picksearch
architecturespace
algorithmspace
Architectural parameter:V l h
Kernel: bl ioptimizationVector length,
#processors, …problem size, algorithm choice
optimization
Carnegie Mellon
How Spiral WorksProblem specification (transform)
Spiral:
Algorithm Generation
Algorithm Optimization
controlsSpiral: Complete automation of the implementation and optimization task
Implementation
algorithm
arch
controls
p
Basic idea:Declarative representation
Code Optimization
C il ti
C code
Sea
of algorithms
Rewriting systems to d i i Compilation
Compiler Optimizationsperformance
S i l
generate and optimize algorithms
Fast executableSpiral
Carnegie Mellon
Generating Not Finding ParallelismGenerating, Not Finding Parallelism
Shared Memory (SMP, CMP)
Vector SIMD (SSE, VMX, Double FPU…)
Message Passing (Clusters)Message Passing (Clusters)
DMA+Streaming (Cell BE)
G hi P (GPU )
Formulas
Graphics Processors (GPUs)
Pipelining+Sreaming (FPGAs)
Code
HW/SW partitioning (CPU+FPGA)
One methodology optimizes for all types of parallelism
Carnegie Mellon
Program Generation in Spiral (Sketched)Transformuser specified
Optimization at allabstraction levels
Fast algorithmin SPLmany choices
parallelizationvectorization
many choices
∑‐SPL: loop optimizationsoptimizations
C Code: Iteration of this process to search for the fastest
constant foldingscheduling
But that’s not all ………
Carnegie Mellon
Behind The Scenes: Domain‐Specific Rewriting
Algorithms
“Teaches” Spiral about existing algorithm knowledge (~200 journal papers)(~200 journal papers)
“Teaches” Spiral about hardware properties
Programt f ti
“Teaches” Spiral about program transformations
transformations
Hardwarepropertiesproperties
Carnegie Mellon
Formulas Encode Program PropertiesFormulas Encode Program Properties
Embarrassingly parallel, load balanced computationAAAAA
Processor 0
Processor 1
Processor 2
Processor 3A
x y
False sharing
x y
Span space of false‐sharing free programs for empirical tuning
Carnegie Mellon
Going Beyond TransformsGoing Beyond Transforms
Transform = linear operator with one vector input and one vector outputlinear operator with one vector input and one vector output
linear
Key ideas: Generalize to (possibly nonlinear) operators with several inputs andGeneralize to (possibly nonlinear) operators with several inputs and several outputsGeneralize SPL (including tensor product) to OL (operator language)Generalize rewriting systems for parallelizationsGeneralize rewriting systems for parallelizations
Carnegie Mellon
Expressing Kernels as Operator FormulasViterbi DecoderMatrix‐Matrix Multiplication
convolutionalencoder
convolutionalencoder
ViterbidecoderViterbidecoder
010001 11 10 00 01 10 01 11 00 01000111 10 01 01 10 10 11 00
= ×= ×
×
JPEG 2000 (Wavelet, EBCOT) Synthetic Aperture Radar (SAR)interpolation 2D iFFT
matched filtering
preprocessingJPEG 2000 Compression
DWTDWT quantizationquantizationentropy coding
DWTDWT quantizationquantization(EBCOT + MQ)
Carnegie Mellon
General‐Size LibraryGeneral Size LibraryInput:
Transform:Transform:
Algorithms:
Vectorization: 2‐way SSE
Threading: Yes
Output:
Hi h P f Lib
Library Generator Library Generator Optimized library (10,000 lines of C++)
For general input size (not collection of fixed sizes)
High‐Performance Library( )
Vectorized
Multithreaded
With runtime adaptation mechanismp
Performance competitive with hand‐written code
Carnegie Mellon
Generating General‐Size Libraries
Spiral Library Generator
void compute(Y, X, …) { void dft2 (…) {
Recursion Base casesfftw_plan_dft_1d(…) {
Target Infrastructurep ( , , ) {
for(i=0..k-1) c1->compute(…) for(j=0..m-1) c2->compute(…)
}
y[0]=x[0]+x[1]y[1]=x[0]-x[1]
}
hand->buf=fftw_malloc(…);
…}
Spiral generates whole library: FFTW equivalent and more
Carnegie Mellon
Benchmark: DFT on “Nice” QuadcoreDFT (single precision): on 3 GHz 2 x Core 2 Extremeperformance [Gflop/s]
30
Spiral 5 0 SPMD
25
Spiral 5.0 SPMDSpiral 5.0 sequentialIntel IPP 5.0FFTW 3.2 alpha SMPFFTW 3.2 alpha sequential 4 processors
15
202 processors
5
10 data L1$ resident
data L2$ resident
0
5
256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144
i t iinput size
4-way vectorized + up to 4-threaded + adapted to the memory hierarchy
Carnegie Mellon
Summary
Libraries must be well‐tunedNightmare: locality multicore special instructionsNightmare: locality, multicore, special instructions,…
Automating the tuning process is desirableAutomating the tuning process is desirableComputer‐generated code can beat hand‐tuned code
High‐level abstraction is key to performance portabilityThat does not imply “easy to program”
Research challenge: Automate more domainsProgram generators must become part of the ecosystem
Carnegie Mellon
A k l d S i lAcknowledgement: Spiral www.spiral.net
Joint work withYevgen Voronenko
… and the Spiral team (only part shown)
Yevgen VoronenkoFrédéric de MesmayMarkus Püschel
This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and Intel