lost in abstraction new york times, thursday january 27, 1910 not everyone is such a fan of...
TRANSCRIPT
![Page 1: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/1.jpg)
Lost in Abstraction
New York Times, Thursday January 27, 1910
Not everyone is such a fan of abstraction:
![Page 2: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/2.jpg)
SIMD and SIMT Code Generation for Visual Effects usingindexed dependence metadata
Paul Kelly
Group Leader, Software Performance Optimisation
Department of Computing
Imperial College LondonJoint work with Jay Cornwall, Lee Howes, Anton Lokhmotov, Tony Field (Imperial) and Phil Parsonage and Bruno Nicoletti (The Foundry)
![Page 3: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/3.jpg)
The Moore School Lectures
The first ever computer architecture conference
July 8th to August 31st 1946, at the Moore School of Electrical Engineering, University of Pennsylvania
Organised by Eckert and others the summer he left academia (an intellectual property dispute)
A defining moment in the history of computing
To have been there….
![Page 4: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/4.jpg)
http://www.computerhistory.org/collections/accession/102657895
J Presper Eckert (1919-1995)
Co-inventor of, and chief engineer on, the ENIAC, arguably the first stored-program computer (first operational Feb 14th 1946)
27 tonnes, 150KW, 5000 cycles/sec
Picture shows the mercury-delay-line memory device of BINAC, the first stored-program computer in the US, and the world's first commercial digital computer (Eckert-Mauchly Computer Corp, 1949)
![Page 5: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/5.jpg)
…
See
als
o ht
tp://
ww
w.d
igita
l60.
org/
birt
h/th
emoo
resc
hool
/lect
ures
.htm
l#l4
5
![Page 6: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/6.jpg)
ENIAC was designed to be set up manually by plugging arithmetic units together
You could plug together quite complex configurations Parallel - with multiple units working at the same time
The “big idea”: stored-program mode -Plug the units together to build a machine that fetches instructions from memory - and executes themSo any calculation could be set up completely automatically – just choose the right sequence of instructions
ENIAC: “setting up the machine”
http://www.columbia.edu/acis/history/eniac.html
![Page 7: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/7.jpg)
The “von Neumann
bottleneck”The price to pay:
Stored-program mode was serial – one instruction at a time
How can we have our cake -And eat it:
Flexibility and ease of programmingPerformance of parallelism
John Backus
www.post-gazette.com/pg/07080/771123-96.stm
John von Neumann
Wikip
ed
ia, h
ttp://w
ww
.lan
l.gov/h
istory/a
tomicbo
mb
/ima
ges/N
eum
ann
L.G
IF
![Page 8: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/8.jpg)
The research challenge
But “It has been shown over and over again…” that this results in a system too complicated to use
How can we get the speed and efficiency without suffering the complexity?
What have we learned since 1946?
![Page 9: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/9.jpg)
The research challenge
But “It has been shown over and over again…” that this results in a system too complicated to use
How can we get the speed and efficiency without suffering the complexity?
What have we learned since 1946?Compilers and out-of-order processors can extract some instruction-level parallelism
Explicit parallel programming in MPI, OpenMP, VHDL are flourishing industries – they can be made to work
SQL, TBB, Cilk, Ct (all functional…), many more speculative proposals
No attractive general-purpose solution
![Page 10: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/10.jpg)
The research challenge
But “It has been shown over and over again…” that this results in a system too complicated to use
How can we get the speed and efficiency without suffering the complexity?
What have we learned since 1946?Program generation….?
![Page 11: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/11.jpg)
Case study: Visual Effects
• The Foundry is a London company building visual effects plug-ins for the movie/TV industry (http://www.thefoundry.co.uk/)
• Core competence: image processing algorithms• Core value: large body of C++ code based on library
of image-based primitives
Opportunity 1:Competitive advantage from exploitation of whatever platform the customer may have - SSE, multicore, vendor libraries, GPUs
Opportunity 2:Redesign of the Foundry’s Image Processing Primitives Library
Risk:Premature optimisation delays deliveryPerformance hacking reduces value of core codebase
![Page 12: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/12.jpg)
Case study: Visual Effects
The brief:Recommend an architecture for The Foundry’s library
That supports mapping onto diverse upcoming hardware
Single source code, from which high-performance code can be generated for many different classes of architecture
![Page 13: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/13.jpg)
Visual effects in movie post-production
Nuke compositing tool (http://www.thefoundry.co.uk)
Visual effects plugins (Foundry and others) appear as nodes in the node graphWe aim to optimise individual effects for multicore CPUs, GPUs etcIn the future: tunnel optimisations across node boundaries at runtime.
(c) Heribert Raab, Softmachine. All rights reserved. Images courtesy of The Foundry
![Page 14: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/14.jpg)
Visual effects: degrain example
Image degraining effect – a complete Foundry plug-inRandom texturing noise introduced by photographic film is removed without compromising the clarity of the picture, either through analysis or by matching against a database of known film grain patternsBased on undecimated wavelet transformUp to several seconds per frame
![Page 15: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/15.jpg)
Visual effects: degrain example
The recursive wavelet-based degraining visual effect in C++Visual primitives are chained together via image temporaries to form a DAGDAG construction is captured through delayed evaluation.
![Page 16: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/16.jpg)
Indexed functorFunctor represents function over an imageKernel accesses image via indexersIndexers carry metadata that characterises kernel’s data access pattern
One-dimensional discrete wavelet transform, as indexed functorCompilable with standard C++ compilerOperates in either the horizontal or vertical axis
Input indexer operates on RGB components separatelyInput indexer accesses ±radius elements in one (the axis) dimension
![Page 17: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/17.jpg)
Software architectureUse of indexed functors is optimised using a source-to-source compiler (based on ROSE, www.rosecompiler.org)
DAG capture
Source code
analysis
Indexed functor kernels
Functor composition DAG for visual effect
Indexed functor dependence metadata
SIMD/SIMT code
generation
Polyhedral representation of composite
iteration space
Schedule transformation – loop fusionDAG
scheduling
Array contraction and scratchpad
staging
Cod
e ge
nera
tion
Ven
dor
com
pile
r
![Page 18: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/18.jpg)
Two generic targets
Lots of cache per threadLower DRAM bandwidth
32lane32xSMTSIMT
x86
4-laneSIMD
CacheCache
4GBCommodity
DRAM
Scratchpad memory
Scratchpad memory
1GBHighly-interleaved
DRAM
×8 ×24x86
4-laneSIMD
x86
4-laneSIMD
x86
4-laneSIMD
x86
4-laneSIMD
32lane32xSMTSIMT
32lane32xSMTSIMT
32lane32xSMTSIMT
32lane32xSMTSIMT
32lane32xSMTSIMT
32lane32xSMTSIMT
32lane32xSMTSIMT
Very, very little cache per threadVery small scratchpad RAM shared by blocks of threadsHigher DRAM bandwidth
SIMD Multicore CPU SIMT Manycore GPU
Goal: single source code, high-performance code for multiple manycore architectures
Proof-of-concept: two targetsVery different, need very different optimisations
![Page 19: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/19.jpg)
Fusing image filter loopsKey optimisation is loop fusion
A little tricky…for example:
“Stencil” loops are not directly fusable
for (i=1; i<N; i++) V[i] = (U[i-1] + U[i+1])/2
for (i=1; i<N; i++) W[i] = (V[i-1] + V[i+1])/2
![Page 20: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/20.jpg)
Fusing image filter loops
We make them fusable by shifting:
V[1] = (U[0] + U[2])/2for (i=2; i<N; i++) { V[i] = (U[i-1] + U[i+1])/2 W[i-1] = (V[i-2] + V[i])/2}W[N-1] = (V[N-2] + V[N])/2
The middle loop is fusable
We get lots of little edge bits
![Page 21: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/21.jpg)
0,2 2,22,2
2,22,20,2
2,2
2,2
2,2
We walk the dataflow graph and calculate the shift factor (in each dimension) required to enable fusionShift factors accumulate at each layer of the DAGWe build this shift factor into the execution schedule
Calculating shift factors
![Page 22: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/22.jpg)
Wavelet-based degraining consists of 37 whole-image loop nests Image size smaller in later steps due to boundaries
Loop fusion leads to code explosion
![Page 23: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/23.jpg)
Naively fusing these loops flattens whole computation into one traversalSome fragmentation as not every loop body is applied at every point
Loop fusion leads to code explosion
![Page 24: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/24.jpg)
For correctness, loops must be shifted before being collapsedMuch more fragmentation – one traversal, but a loop nest for each fragment
Loop fusion leads to code explosion
![Page 25: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/25.jpg)
Array contractionThe benefit of loop fusion comes from array contraction - eliminating intermediate arrays:
V[1] = (U[0] + U[2])/2for (i=2; i<N; i++) { V[i%4] = (U[i-1] + U[i+1])/2 W[i-1] = (V[(i-2)%4] + V[i%4])/2}W[N-1] = (V[(N-2)%4] + V[N%4])/2
We need the last two Vs
We need 3 V locations, quicker to round up to four
Four-element contracted array, used as circular buffer
Occupies small chunk of cache, avoids trashing rest of cache
![Page 26: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/26.jpg)
The SIMD target…Code generation for SIMD:
Aggressive loop fusion and array contractionUsing the CLooG code generator to generate the loop fragments
Vectorisation and Scalar promotionCorrectness guaranteed by dependence metadata
If-conversionGenerate code to use masks to track conditionals
Memory access realignment:In SIMD architectures where contiguous, aligned loads/stores are faster, placement of intermediate data is guided by metadata to make this so
Contracted load/store rescheduling:Filters require mis-aligned SIMD loadsAfter contraction, these can straddle the end of the circular buffer – we need them to wrap-aroundWe use a double-buffer trick…
![Page 27: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/27.jpg)
Vector access to contracted arrays
Stores are made to two arrays, one shifted by 180 around the circular buffer Data is not lost. Loads choose a safe array to read from
Filters require mis-aligned SIMD loadsAfter contraction, these can straddle the end of the circular buffer
![Page 28: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/28.jpg)
SIMT – code generation for nVidia’s CUDA
Constant/shared memory stagingWhere data needed by adjacent threads overlaps, we generate code to stage image sub-blocks in scratchpad memory
Maximising parallelismMoving-average filters are common in VFX, and involve a loop-carried dependenceWe catch this case with a special “eMoving” index typeWe create enough threads to fill the machine, while efficiently computing a moving average within each thread
Coordinated coalesced memory accessWe shift a kernel’s iteration space, if necessary, to arrange an thread-to-data mapping that satisfies the alignment requirements for high-bandwidth, coalesced access to global memoryWe introduce transposes to achieve coalescing in horizontal moving-average filters
Choosing optimal scheduling parametersResource management and scheduling parameters are derived from indexed functor metadata, and used to select optimal mapping of threads onto processors.
![Page 29: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/29.jpg)
SIMT optimisations: staging
Shared memory staging
In a row-wise filter, each thread accesses data that overlaps with its neighbours
Wasteful to fetch from global memory
We generate code that coordinates fetching data into scratchpad memory
![Page 30: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/30.jpg)
SIMT: maximising parallelism
Computes moving average along a column
Need more threads than columns
Split columns into chunks, re-initialise sum at each chunk
![Page 31: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/31.jpg)
SIMT: coalesced access
In horizontal moving average, we want threads to run along rows
Adjacent threads access different rows – no spatial locality: no coalescing
Each thread occupies one of the 32 SIMD “lanes” which are issued together – called a “warp”
Here the threads in a warp are accessing different rows
Warp
![Page 32: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/32.jpg)
Coalescing: transposition options
Several options:
Whole-image transposeTranspose into global memoryOften one transpose is good for a sequence of filters
Transposed stagingTransposed block in shared (scratchpad) memoryScratchpad is too small for this at present
Redundant vertical sweepExecute initialiser at every pointFunctor is then fully-parallelRedundant additional work
Transpose Process Transpose
![Page 33: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/33.jpg)
Performance results
![Page 34: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/34.jpg)
Degrain: Performance results
All systems ran 64-bit Ubuntu Linux 8 with the Intel C/C++ Compiler 11.0, CUDA Toolkit 2.1 and 180 series NVIDIA graphics drivers.We used ICC flags “-O3 –xHost -no-prec-div -ansi-alias -restrict” and NVCC flag “-O3”.
GPU timings do not include host/device data transfers.
images were stock photos cropped or repeated to a set of industry-standard frame sizes, powers-of-two and prime numbers
In this example, CPU can beat a GPU
Because loop fusion eliminates DRAM bottleneck
Future work: loop fusion for the GPU!
Tesla C1060 (nVidia)30-SM, CC 1.3
GTX 260 (nVidia)24-SM, CC 1.3
8800 GTX (nVidia)16-SM, CC 1.0
Phenom 9650 (AMD)4-core
Xeon E5420 (Intel)8 cores, two sockets, two Core2Duos per socket
C2D E6600 (Intel)2-Core Core2Duo
![Page 35: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/35.jpg)
Diffusion filteringIn this example, GPUs always win
Loop fusion is not possible
So GPU DRAM bandwidth gives overwhelming advantage
8 cores are no better than 4 cores since bandwidth-limited
![Page 36: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/36.jpg)
Loop fusion and SSEWithout loop fusion, SSE is of limited value – memory is bottleneck
8-core Intel Xeon has less DRAM and L2 bandwidth per core, so benefits more from fusion
![Page 37: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/37.jpg)
Older nVidia hardware was very sensitive to alignment of global memory accesses – not a problem with GTX260 and C1060
Staging and transposition are crucial for diffusion filtering
![Page 38: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/38.jpg)
Degrain on CPUs - multicore scaling
Without fusion, multicore CPUs are almost useless
![Page 39: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/39.jpg)
ConclusionsDomain-specific “active” library encapsulates specialist performance expertise
Separates higher-level long-term codebase from implementation details
Each new platform requires new performance tuning effort
Need assurance that future performance challenges can be met within the framework
So domain-specialists will be doing the performance tuning
Our challenge is to support them
![Page 40: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/40.jpg)
Specific technical challenges
Generalise indexed functors conceptAEcute access-execute descriptors
Automate and guide the search for optimal combinations of optimisations
Robustness…Static/dynamic checking of dependence metadata
Test generation for optimisations
We have a specification… can we verify the optimisations statically?
What happens when you combine different active libraries?
![Page 41: Lost in Abstraction New York Times, Thursday January 27, 1910 Not everyone is such a fan of abstraction:](https://reader036.vdocument.in/reader036/viewer/2022070409/56649e8e5503460f94b91cba/html5/thumbnails/41.jpg)
ConclusionsOur ambitions for this work:
Proof-of-concept for a cross-platform accelerated computer vision library
OpenGL for imagesProof-of-concept for “active libraries”
Target other application domainsProof-of-concept for indexed dependence metadata
OpenCL with automatic generation of data movement/scratchpad code
What we plan to do next:Develop into commercial toolsExtend beyond pure image operations
Eg extracting 3D, SLAMDevelop indexed dependence metadata concept
AEcute: Access/Execute descriptors Computational science applications
Finite-element, unstructured mesh, h-adaptive, p-adaptive…
Loop fusion for GPUs?Making more effective use of (coherent?) texture cache in GPUsPower-performance tradeoffs