Download - Meet Stevephanie, the Computer Spiral: Automatic Library ... · scalar, fixed point, vector, parallel, Verilog, GPU ... Matrix‐Matrix Multiplication ViterbiDecoder convolutional

Carnegie Mellon

Meet Stevephanie, the ComputerSpiral: Automatic Library GenerationSpiral: Automatic Library Generation

Franz Franchetti

Electrical and Computer EngineeringElectrical and Computer EngineeringCarnegie Mellon University

UW MSR Institute 2008, August 6th: “The Concurrency Challenge”

Carnegie Mellon

The Current Architecture Space2010 and later

Larrabee2008

Virtex 5FPGA+ 4 CPUsSun Niagara

32 threads

Larrabee

multicore

Cell BE8+1 cores

before 2000

IBM C l 64

Xtreme DATA Opteron + FPGA

CPU platforms

Core2 DuoCore2 Extreme

SGI RASC Itanium + FPGA

N idi GPU

IBM Cyclops6480 cores

Nvidia GPUs128 processors ATI/AMD merger

CPU+GPU fusionClearSpeed96 cores

programmability

Carnegie Mellon

Example: DFT on “Nice” Quadcore

24

26

Discrete Fourier Transform (single precision): 2 x Core2 Extreme 3 GHzPerformance [Gflop/s]

18

20

22

24

12

14

16

30x

best code(computer-generated)

6

8

10

0

2

4

16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

Numerical Recipes (ANSI C)

Problem size

What’s going on?

Carnegie Mellon

DFT Plot: Analysis

Discrete Fourier Transform (DFT) on 2 x Core 2 Duo 3 GHzGflop/s

25

30

15

20

Multiple threads: 2x

5

10

SIMD vector instructions: 3x

0

5

16 32 64 128 256 512 1,024 2,048 4,096 8,192 16,384 32,768 65,536 131,072 262,144input size

Memory hierarchy: 5x

input size

High performance is hard even on “simple” architectures

Carnegie Mellon

Current Solution

Legions of programmers implement and optimize the same functionality for every platform and whenever asame functionality for every platform and whenever a new platform comes out.

Carnegie Mellon

Better: Automatic Performance Tuning

Automate (parts of) the implementation or optimization

Research effortsLinear algebra: Phipac/ATLAS, LAPACK, Sparsity/Bebop/OSKI, FlameTensor computations (TCE)PDE/fi i l F iPDE/finite elements: FenicsAdaptive sortingFourier transform: FFTW, UHFFT Linear transforms: Spiral…

Proceedings of the IEEE special issue, Feb. 2005

Promising new area but more work neededIn particular for parallelism …

Carnegie Mellon

SpiralpLibrary generator for linear transforms (DFT, DCT, DWT, filters, ….) and recently more …

Wide range of platforms supported: scalar, fixed point, vector, parallel, Verilog, GPU

Research Goal: “Teach” computers to write fast librariesComplete automation of implementation and optimizationConquer the “high” algorithm level for automationq g g

When a new platform comes out: Regenerate a retuned library

When a new platform paradigm comes out (e.g., CPU+GPU):Update the tool rather than rewriting the library

Intel is using Spiral to generate parts of their MKL and IPP libraries

Carnegie Mellon

Vision Behind Spiral

Current Future

Numerical problem

effo

rt

Numerical problem

algorithm selection

hum

an e

implementationC program

mat

edalgorithm selection

implementation

compilation

utom

ated

auto

compilation

p

Computing platform

au

Computing platform

C code a singularity: Compiler hasno access to high level information

Challenge: conquer the high abstraction level for complete automation

Carnegie Mellon

Main Idea: Joint Mathematical Abstraction

Model: common abstraction= spaces of matching formulas

νabstraction abstraction

νpμ

rewritingdefines

picksearch

architecturespace

algorithmspace

Architectural parameter:V l h

Kernel: bl ioptimizationVector length,

#processors, …problem size, algorithm choice

optimization

Carnegie Mellon

How Spiral WorksProblem specification (transform)

Spiral:

Algorithm Generation

Algorithm Optimization

controlsSpiral: Complete automation of the implementation and optimization task

Implementation

algorithm

arch

controls

p

Basic idea:Declarative representation

Code Optimization

C il ti

C code

Sea

of algorithms

Rewriting systems to d i i Compilation

Compiler Optimizationsperformance

S i l

generate and optimize algorithms

Fast executableSpiral

Carnegie Mellon

Generating Not Finding ParallelismGenerating, Not Finding Parallelism

Shared Memory (SMP, CMP)

Vector SIMD (SSE, VMX, Double FPU…)

Message Passing (Clusters)Message Passing (Clusters)

DMA+Streaming (Cell BE)

G hi P (GPU )

Formulas

Graphics Processors (GPUs)

Pipelining+Sreaming (FPGAs)

Code

HW/SW partitioning (CPU+FPGA)

One methodology optimizes for all types of parallelism

Carnegie Mellon

Program Generation in Spiral (Sketched)Transformuser specified

Optimization at allabstraction levels

Fast algorithmin SPLmany choices

parallelizationvectorization

many choices

∑‐SPL: loop optimizationsoptimizations

C Code: Iteration of this process to search for the fastest

constant foldingscheduling

But that’s not all ………

Carnegie Mellon

Behind The Scenes: Domain‐Specific Rewriting

Algorithms

“Teaches” Spiral about existing algorithm knowledge (~200 journal papers)(~200 journal papers)

“Teaches” Spiral about hardware properties

Programt f ti

“Teaches” Spiral about program transformations

transformations

Hardwarepropertiesproperties

Carnegie Mellon

Formulas Encode Program PropertiesFormulas Encode Program Properties

Embarrassingly parallel, load balanced computationAAAAA

Processor 0

Processor 1

Processor 2

Processor 3A

x y

False sharing

x y

Span space of false‐sharing free programs for empirical tuning

Carnegie Mellon

Going Beyond TransformsGoing Beyond Transforms

Transform = linear operator with one vector input and one vector outputlinear operator with one vector input and one vector output

linear

Key ideas: Generalize to (possibly nonlinear) operators with several inputs andGeneralize to (possibly nonlinear) operators with several inputs and several outputsGeneralize SPL (including tensor product) to OL (operator language)Generalize rewriting systems for parallelizationsGeneralize rewriting systems for parallelizations

Carnegie Mellon

Expressing Kernels as Operator FormulasViterbi DecoderMatrix‐Matrix Multiplication

convolutionalencoder

convolutionalencoder

ViterbidecoderViterbidecoder

010001 11 10 00 01 10 01 11 00 01000111 10 01 01 10 10 11 00

= ×= ×

×

JPEG 2000 (Wavelet, EBCOT) Synthetic Aperture Radar (SAR)interpolation 2D iFFT

matched filtering

preprocessingJPEG 2000 Compression

DWTDWT quantizationquantizationentropy coding

DWTDWT quantizationquantization(EBCOT + MQ)

Carnegie Mellon

General‐Size LibraryGeneral Size LibraryInput:

Transform:Transform:

Algorithms:

Vectorization: 2‐way SSE

Threading: Yes

Output:

Hi h P f Lib

Library Generator Library Generator Optimized library (10,000 lines of C++)

For general input size (not collection of fixed sizes)

High‐Performance Library( )

Vectorized

Multithreaded

With runtime adaptation mechanismp

Performance competitive with hand‐written code

Carnegie Mellon

Generating General‐Size Libraries

Spiral Library Generator

void compute(Y, X, …) { void dft2 (…) {

Recursion Base casesfftw_plan_dft_1d(…) {

Target Infrastructurep ( , , ) {

for(i=0..k-1) c1->compute(…) for(j=0..m-1) c2->compute(…)

}

y[0]=x[0]+x[1]y[1]=x[0]-x[1]

}

hand->buf=fftw_malloc(…);

…}

Spiral generates whole library: FFTW equivalent and more

Carnegie Mellon

Benchmark: DFT on “Nice” QuadcoreDFT (single precision): on 3 GHz 2 x Core 2 Extremeperformance [Gflop/s]

30

Spiral 5 0 SPMD

25

Spiral 5.0 SPMDSpiral 5.0 sequentialIntel IPP 5.0FFTW 3.2 alpha SMPFFTW 3.2 alpha sequential 4 processors

15

202 processors

5

10 data L1$ resident

data L2$ resident

0

5

256 512 1024 2048 4096 8192 16384 32768 65536 131072 262144

i t iinput size

4-way vectorized + up to 4-threaded + adapted to the memory hierarchy

Carnegie Mellon

Summary

Libraries must be well‐tunedNightmare: locality multicore special instructionsNightmare: locality, multicore, special instructions,…

Automating the tuning process is desirableAutomating the tuning process is desirableComputer‐generated code can beat hand‐tuned code

High‐level abstraction is key to performance portabilityThat does not imply “easy to program”

Research challenge: Automate more domainsProgram generators must become part of the ecosystem

Carnegie Mellon

A k l d S i lAcknowledgement: Spiral www.spiral.net

Joint work withYevgen Voronenko

… and the Spiral team (only part shown)

Yevgen VoronenkoFrédéric de MesmayMarkus Püschel

This work was supported by DARPA DESA program, NSF-NGS/ITR, NSF-ACR, Mercury Inc., and Intel

Download - Meet Stevephanie, the Computer Spiral: Automatic Library ... · scalar, fixed point, vector, parallel, Verilog, GPU ... Matrix‐Matrix Multiplication ViterbiDecoder convolutional

Top Related