lost in abstraction new york times, thursday january 27, 1910 not everyone is such a fan of...

Lost in Abstraction

New York Times, Thursday January 27, 1910

Not everyone is such a fan of abstraction:

SIMD and SIMT Code Generation for Visual Effects usingindexed dependence metadata

Paul Kelly

Group Leader, Software Performance Optimisation

Department of Computing

Imperial College LondonJoint work with Jay Cornwall, Lee Howes, Anton Lokhmotov, Tony Field (Imperial) and Phil Parsonage and Bruno Nicoletti (The Foundry)

The Moore School Lectures

The first ever computer architecture conference

July 8th to August 31st 1946, at the Moore School of Electrical Engineering, University of Pennsylvania

Organised by Eckert and others the summer he left academia (an intellectual property dispute)

A defining moment in the history of computing

To have been there….

http://www.computerhistory.org/collections/accession/102657895

J Presper Eckert (1919-1995)

Co-inventor of, and chief engineer on, the ENIAC, arguably the first stored-program computer (first operational Feb 14th 1946)

27 tonnes, 150KW, 5000 cycles/sec

Picture shows the mercury-delay-line memory device of BINAC, the first stored-program computer in the US, and the world's first commercial digital computer (Eckert-Mauchly Computer Corp, 1949)

…

See

als

o ht

tp://

ww

w.d

igita

l60.

org/

birt

h/th

emoo

resc

hool

/lect

ures

.htm

l#l4

5

ENIAC was designed to be set up manually by plugging arithmetic units together

You could plug together quite complex configurations Parallel - with multiple units working at the same time

The “big idea”: stored-program mode -Plug the units together to build a machine that fetches instructions from memory - and executes themSo any calculation could be set up completely automatically – just choose the right sequence of instructions

ENIAC: “setting up the machine”

http://www.columbia.edu/acis/history/eniac.html

The “von Neumann

bottleneck”The price to pay:

Stored-program mode was serial – one instruction at a time

How can we have our cake -And eat it:

Flexibility and ease of programmingPerformance of parallelism

John Backus

www.post-gazette.com/pg/07080/771123-96.stm

John von Neumann

Wikip

ed

ia, h

ttp://w

ww

.lan

l.gov/h

istory/a

tomicbo

mb

/ima

ges/N

eum

ann

L.G

IF

http://www.post-gazette.com/pg/07080/771123-96.stm

The research challenge

But “It has been shown over and over again…” that this results in a system too complicated to use

How can we get the speed and efficiency without suffering the complexity?

What have we learned since 1946?




What have we learned since 1946?Compilers and out-of-order processors can extract some instruction-level parallelism

Explicit parallel programming in MPI, OpenMP, VHDL are flourishing industries – they can be made to work

SQL, TBB, Cilk, Ct (all functional…), many more speculative proposals

No attractive general-purpose solution




What have we learned since 1946?Program generation….?

Case study: Visual Effects

• The Foundry is a London company building visual effects plug-ins for the movie/TV industry (http://www.thefoundry.co.uk/)

• Core competence: image processing algorithms• Core value: large body of C++ code based on library

of image-based primitives

Opportunity 1:Competitive advantage from exploitation of whatever platform the customer may have - SSE, multicore, vendor libraries, GPUs

Opportunity 2:Redesign of the Foundry’s Image Processing Primitives Library

Risk:Premature optimisation delays deliveryPerformance hacking reduces value of core codebase

http://www.thefoundry.co.uk/



Case study: Visual Effects

The brief:Recommend an architecture for The Foundry’s library

That supports mapping onto diverse upcoming hardware

Single source code, from which high-performance code can be generated for many different classes of architecture

Visual effects in movie post-production

Nuke compositing tool (http://www.thefoundry.co.uk)

Visual effects plugins (Foundry and others) appear as nodes in the node graphWe aim to optimise individual effects for multicore CPUs, GPUs etcIn the future: tunnel optimisations across node boundaries at runtime.

(c) Heribert Raab, Softmachine. All rights reserved. Images courtesy of The Foundry

Visual effects: degrain example

Image degraining effect – a complete Foundry plug-inRandom texturing noise introduced by photographic film is removed without compromising the clarity of the picture, either through analysis or by matching against a database of known film grain patternsBased on undecimated wavelet transformUp to several seconds per frame

Visual effects: degrain example

The recursive wavelet-based degraining visual effect in C++Visual primitives are chained together via image temporaries to form a DAGDAG construction is captured through delayed evaluation.

Indexed functorFunctor represents function over an imageKernel accesses image via indexersIndexers carry metadata that characterises kernel’s data access pattern

One-dimensional discrete wavelet transform, as indexed functorCompilable with standard C++ compilerOperates in either the horizontal or vertical axis

Input indexer operates on RGB components separatelyInput indexer accesses ±radius elements in one (the axis) dimension

Software architectureUse of indexed functors is optimised using a source-to-source compiler (based on ROSE, www.rosecompiler.org)

DAG capture

Source code

analysis

Indexed functor kernels

Functor composition DAG for visual effect

Indexed functor dependence metadata

SIMD/SIMT code

generation

Polyhedral representation of composite

iteration space

Schedule transformation – loop fusionDAG

scheduling

Array contraction and scratchpad

staging

Cod

e ge

nera

tion

Ven

dor

com

pile

r

Two generic targets

Lots of cache per threadLower DRAM bandwidth

32lane32xSMTSIMT

x86

4-laneSIMD

CacheCache

4GBCommodity

DRAM

Scratchpad memory

Scratchpad memory

1GBHighly-interleaved

DRAM

×8 ×24x86

4-laneSIMD

x86

4-laneSIMD

x86

4-laneSIMD

x86

4-laneSIMD

32lane32xSMTSIMT

32lane32xSMTSIMT

32lane32xSMTSIMT

32lane32xSMTSIMT

32lane32xSMTSIMT

32lane32xSMTSIMT

32lane32xSMTSIMT

Very, very little cache per threadVery small scratchpad RAM shared by blocks of threadsHigher DRAM bandwidth

SIMD Multicore CPU SIMT Manycore GPU

Goal: single source code, high-performance code for multiple manycore architectures

Proof-of-concept: two targetsVery different, need very different optimisations

Fusing image filter loopsKey optimisation is loop fusion

A little tricky…for example:

“Stencil” loops are not directly fusable

for (i=1; i<N; i++) V[i] = (U[i-1] + U[i+1])/2

for (i=1; i<N; i++) W[i] = (V[i-1] + V[i+1])/2

Fusing image filter loops

We make them fusable by shifting:

V[1] = (U[0] + U[2])/2for (i=2; i<N; i++) { V[i] = (U[i-1] + U[i+1])/2 W[i-1] = (V[i-2] + V[i])/2}W[N-1] = (V[N-2] + V[N])/2

The middle loop is fusable

We get lots of little edge bits

0,2 2,22,2

2,22,20,2

2,2

2,2

2,2

We walk the dataflow graph and calculate the shift factor (in each dimension) required to enable fusionShift factors accumulate at each layer of the DAGWe build this shift factor into the execution schedule

Calculating shift factors

Wavelet-based degraining consists of 37 whole-image loop nests Image size smaller in later steps due to boundaries

Loop fusion leads to code explosion

Naively fusing these loops flattens whole computation into one traversalSome fragmentation as not every loop body is applied at every point


For correctness, loops must be shifted before being collapsedMuch more fragmentation – one traversal, but a loop nest for each fragment


Array contractionThe benefit of loop fusion comes from array contraction - eliminating intermediate arrays:

V[1] = (U[0] + U[2])/2for (i=2; i<N; i++) { V[i%4] = (U[i-1] + U[i+1])/2 W[i-1] = (V[(i-2)%4] + V[i%4])/2}W[N-1] = (V[(N-2)%4] + V[N%4])/2

We need the last two Vs

We need 3 V locations, quicker to round up to four

Four-element contracted array, used as circular buffer

Occupies small chunk of cache, avoids trashing rest of cache

The SIMD target…Code generation for SIMD:

Aggressive loop fusion and array contractionUsing the CLooG code generator to generate the loop fragments

Vectorisation and Scalar promotionCorrectness guaranteed by dependence metadata

If-conversionGenerate code to use masks to track conditionals

Memory access realignment:In SIMD architectures where contiguous, aligned loads/stores are faster, placement of intermediate data is guided by metadata to make this so

Contracted load/store rescheduling:Filters require mis-aligned SIMD loadsAfter contraction, these can straddle the end of the circular buffer – we need them to wrap-aroundWe use a double-buffer trick…

Vector access to contracted arrays

Stores are made to two arrays, one shifted by 180 around the circular buffer Data is not lost. Loads choose a safe array to read from

Filters require mis-aligned SIMD loadsAfter contraction, these can straddle the end of the circular buffer

SIMT – code generation for nVidia’s CUDA

Constant/shared memory stagingWhere data needed by adjacent threads overlaps, we generate code to stage image sub-blocks in scratchpad memory

Maximising parallelismMoving-average filters are common in VFX, and involve a loop-carried dependenceWe catch this case with a special “eMoving” index typeWe create enough threads to fill the machine, while efficiently computing a moving average within each thread

Coordinated coalesced memory accessWe shift a kernel’s iteration space, if necessary, to arrange an thread-to-data mapping that satisfies the alignment requirements for high-bandwidth, coalesced access to global memoryWe introduce transposes to achieve coalescing in horizontal moving-average filters

Choosing optimal scheduling parametersResource management and scheduling parameters are derived from indexed functor metadata, and used to select optimal mapping of threads onto processors.

SIMT optimisations: staging

Shared memory staging

In a row-wise filter, each thread accesses data that overlaps with its neighbours

Wasteful to fetch from global memory

We generate code that coordinates fetching data into scratchpad memory

SIMT: maximising parallelism

Computes moving average along a column

Need more threads than columns

Split columns into chunks, re-initialise sum at each chunk

SIMT: coalesced access

In horizontal moving average, we want threads to run along rows

Adjacent threads access different rows – no spatial locality: no coalescing

Each thread occupies one of the 32 SIMD “lanes” which are issued together – called a “warp”

Here the threads in a warp are accessing different rows

Warp

Coalescing: transposition options

Several options:

Whole-image transposeTranspose into global memoryOften one transpose is good for a sequence of filters

Transposed stagingTransposed block in shared (scratchpad) memoryScratchpad is too small for this at present

Redundant vertical sweepExecute initialiser at every pointFunctor is then fully-parallelRedundant additional work

Transpose Process Transpose

Performance results

Degrain: Performance results

All systems ran 64-bit Ubuntu Linux 8 with the Intel C/C++ Compiler 11.0, CUDA Toolkit 2.1 and 180 series NVIDIA graphics drivers.We used ICC flags “-O3 –xHost -no-prec-div -ansi-alias -restrict” and NVCC flag “-O3”.

GPU timings do not include host/device data transfers.

images were stock photos cropped or repeated to a set of industry-standard frame sizes, powers-of-two and prime numbers

In this example, CPU can beat a GPU

Because loop fusion eliminates DRAM bottleneck

Future work: loop fusion for the GPU!

Tesla C1060 (nVidia)30-SM, CC 1.3

GTX 260 (nVidia)24-SM, CC 1.3

8800 GTX (nVidia)16-SM, CC 1.0

Phenom 9650 (AMD)4-core

Xeon E5420 (Intel)8 cores, two sockets, two Core2Duos per socket

C2D E6600 (Intel)2-Core Core2Duo

Diffusion filteringIn this example, GPUs always win

Loop fusion is not possible

So GPU DRAM bandwidth gives overwhelming advantage

8 cores are no better than 4 cores since bandwidth-limited

Loop fusion and SSEWithout loop fusion, SSE is of limited value – memory is bottleneck

8-core Intel Xeon has less DRAM and L2 bandwidth per core, so benefits more from fusion

Older nVidia hardware was very sensitive to alignment of global memory accesses – not a problem with GTX260 and C1060

Staging and transposition are crucial for diffusion filtering

Degrain on CPUs - multicore scaling

Without fusion, multicore CPUs are almost useless

ConclusionsDomain-specific “active” library encapsulates specialist performance expertise

Separates higher-level long-term codebase from implementation details

Each new platform requires new performance tuning effort

Need assurance that future performance challenges can be met within the framework

So domain-specialists will be doing the performance tuning

Our challenge is to support them

Specific technical challenges

Generalise indexed functors conceptAEcute access-execute descriptors

Automate and guide the search for optimal combinations of optimisations

Robustness…Static/dynamic checking of dependence metadata

Test generation for optimisations

We have a specification… can we verify the optimisations statically?

What happens when you combine different active libraries?

ConclusionsOur ambitions for this work:

Proof-of-concept for a cross-platform accelerated computer vision library

OpenGL for imagesProof-of-concept for “active libraries”

Target other application domainsProof-of-concept for indexed dependence metadata

OpenCL with automatic generation of data movement/scratchpad code

What we plan to do next:Develop into commercial toolsExtend beyond pure image operations

Eg extracting 3D, SLAMDevelop indexed dependence metadata concept

AEcute: Access/Execute descriptors Computational science applications

Finite-element, unstructured mesh, h-adaptive, p-adaptive…

Loop fusion for GPUs?Making more effective use of (coherent?) texture cache in GPUsPower-performance tradeoffs

lost in abstraction new york times, thursday january 27, 1910 not everyone is such a fan of...

Documents

storedprogram computer

program generation

program mode

eniac http

presper eckert

multiple units

arithmetic units togetheryou

visual effectsthe foundry