the end of programming petaqcd collaboration 1. end of parallelith the era when we were in control...
TRANSCRIPT
The end of programming
PetaQCD collaboration
1
End of Parallelith
• The era when we were in control of what gets executed where and how is pretty much over
• No modern compiler can generate efficient code for modern architectures
• Because they are SPMD of various width
• So companies have to do something else
2
Warps
• To abstract and hide the execution of programs nVidia came up with Thread Warp, a bunch of threads executed at the same time at some part of the hardware
3
ISPC
• Intel noticed that people don’t quite like CUDA
• Yet figured out that no compiler can figure out where and how to parallelize
• So it abandoned idea of having own compiler
• And conceived a syntax which would appear user-friendly
• Wrote an LLVM(low-level virtual machine) frontend and few backends
• Just like NVIDIA did.
4
Gangs
• So what does brave new code like?
export void simple(uniform float vin[], uniform float vout[], uniform int count) { foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; else v = sqrt(v); vout[index] = v; }}
float vin[16], vout[16]; for (int i = 0; i < 16; ++i) vin[i] = i;
simple(vin, vout, 16);
5
Gangs
• So instead of Thread Warps we have Gangs
• Bunch of program « instances » mapped onto SIMD
• About 2-4x the width of SIMD
• And variables can be shared or unique across gangs
• Atomic operations should be used inside a gang
• No threads, no context switching, just an army of marching ants
6
So what?
• Intel’s ISPC is poor man’s CUDA
• Flow control is much simpler (things go uniformly)
• Obviously at the cost of performance
• And is designed to deceive users that it is simpler
• While all that it does is mapping code to instances
• Which then get mapped to low-level functions
• So in principle ISPC can be used for any architecture
7
So we care?
• Yes we do.
• First, Intel confirmed that classic compilers are dead
• Second, it has shown us what a proper backend for AVX and SSE should look like (opensource code)define <16 x double> @__gather_base_offsets64_double(i8 * %ptr,
i32 %scale, <16 x i64> %offsets, <16 x i32> %mask32) nounwind readonly alwaysinline {…
%v1 = call <4 x double> @llvm.x86.avx2.gather.q.pd.256(<4 x double> undef, i8 * %ptr, <4 x i64> %offsets_1, <4 x double> %vecmask_1, i8 %scale8) assemble_4s(double, v, v1, v2, v3, v4) ret <16 x double> %v}
8
So what?
• Third, we have a glimpse at how intel optimises. It doesn’t.
• It just parses the commands via AST(abstract sysntax tree, basic notion of LLVM ) walkingASTNode *
WalkAST(ASTNode *node, ASTPreCallBackFunc preFunc, ASTPostCallBackFunc postFunc, void *data) { if (node == NULL) return node;
// Call the callback function if (preFunc != NULL) { if (preFunc(node, data) == false) // The function asked us to not continue recursively, so stop. return node; }
9
And...
• Generates IR
• And maps it to SIMD backends
voidAST::GenerateIR() { for (unsigned int i = 0; i < functions.size(); ++i) functions[i]->GenerateIR();}
10
So we have …
• An unstable quickly developing utility
• Which may or may not be actually useful beyond AVX(no MIC support yet)
• Is comparably obscure to CUDA (but with source!)
• and not GPL.
11
Short term
• ISPC may be a valueble tool to investigate the efficiency of the llvm-based compiler for the Intel architectures
• One may try to generate gang-compatible code
• Because it obviously will be superiour to the icc
• Which fails to vectorize properly
12
Medium term
• And we finally have access to the L1 prefetcher!
• NT means data will be discarded after use
uniform int32 array[...];for (uniform int i = 0; i < count; ++i) { // do computation with array[i] prefetch_l1(&array[i+32]);}
void prefetch_{l1,l2,l3,nt}(void * uniform ptr)void prefetch_{l1,l2,l3,nt}(void * varying ptr)
13
Long term
• It is likely that moving from QiralIR to ISPC to IntelIR we are loosing information
• And certainly adding overhead
• So having our own WalkAST mapping to intel LLVM backends should be way better
• And we can also modify them for IBM SIMD
(remember, IBM also wants same model, but with different images instead of instances within one image)
• And maybe CUDA just as well
14
Conclusions
• All modern achitectures are damn GPUs
• Divide variables into unique and uniform
• Give up the control of what is executed when and how, to varying level
• Have either implicit or explicit barriers/sync
• Have gangs/thread warps as sort of uniform threads
• And LLVM seems to be the choice to deal with them.
15