the end of programming petaqcd collaboration 1. end of parallelith the era when we were in control...

The end of programming

PetaQCD collaboration

1

End of Parallelith

• The era when we were in control of what gets executed where and how is pretty much over

• No modern compiler can generate efficient code for modern architectures

• Because they are SPMD of various width

• So companies have to do something else

2

Warps

• To abstract and hide the execution of programs nVidia came up with Thread Warp, a bunch of threads executed at the same time at some part of the hardware

3

ISPC

• Intel noticed that people don’t quite like CUDA

• Yet figured out that no compiler can figure out where and how to parallelize

• So it abandoned idea of having own compiler

• And conceived a syntax which would appear user-friendly

• Wrote an LLVM(low-level virtual machine) frontend and few backends

• Just like NVIDIA did.

4

Gangs

• So what does brave new code like?

export void simple(uniform float vin[], uniform float vout[], uniform int count) { foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; else v = sqrt(v); vout[index] = v; }}

float vin[16], vout[16]; for (int i = 0; i < 16; ++i) vin[i] = i;

simple(vin, vout, 16);

5

Gangs

• So instead of Thread Warps we have Gangs

• Bunch of program « instances » mapped onto SIMD

• About 2-4x the width of SIMD

• And variables can be shared or unique across gangs

• Atomic operations should be used inside a gang

• No threads, no context switching, just an army of marching ants

6

So what?

• Intel’s ISPC is poor man’s CUDA

• Flow control is much simpler (things go uniformly)

• Obviously at the cost of performance

• And is designed to deceive users that it is simpler

• While all that it does is mapping code to instances

• Which then get mapped to low-level functions

• So in principle ISPC can be used for any architecture

7

So we care?

• Yes we do.

• First, Intel confirmed that classic compilers are dead

• Second, it has shown us what a proper backend for AVX and SSE should look like (opensource code)define <16 x double> @__gather_base_offsets64_double(i8 * %ptr,

i32 %scale, <16 x i64> %offsets, <16 x i32> %mask32) nounwind readonly alwaysinline {…

%v1 = call <4 x double> @llvm.x86.avx2.gather.q.pd.256(<4 x double> undef, i8 * %ptr, <4 x i64> %offsets_1, <4 x double> %vecmask_1, i8 %scale8) assemble_4s(double, v, v1, v2, v3, v4) ret <16 x double> %v}

8

So what?

• Third, we have a glimpse at how intel optimises. It doesn’t.

• It just parses the commands via AST(abstract sysntax tree, basic notion of LLVM ) walkingASTNode *

WalkAST(ASTNode *node, ASTPreCallBackFunc preFunc, ASTPostCallBackFunc postFunc, void *data) { if (node == NULL) return node;

// Call the callback function if (preFunc != NULL) { if (preFunc(node, data) == false) // The function asked us to not continue recursively, so stop. return node; }

9

And...

• Generates IR

• And maps it to SIMD backends

voidAST::GenerateIR() { for (unsigned int i = 0; i < functions.size(); ++i) functions[i]->GenerateIR();}

10

So we have …

• An unstable quickly developing utility

• Which may or may not be actually useful beyond AVX(no MIC support yet)

• Is comparably obscure to CUDA (but with source!)

• and not GPL.

11

Short term

• ISPC may be a valueble tool to investigate the efficiency of the llvm-based compiler for the Intel architectures

• One may try to generate gang-compatible code

• Because it obviously will be superiour to the icc

• Which fails to vectorize properly

12

Medium term

• And we finally have access to the L1 prefetcher!

• NT means data will be discarded after use

uniform int32 array[...];for (uniform int i = 0; i < count; ++i) { // do computation with array[i] prefetch_l1(&array[i+32]);}

void prefetch_{l1,l2,l3,nt}(void * uniform ptr)void prefetch_{l1,l2,l3,nt}(void * varying ptr)

13

Long term

• It is likely that moving from QiralIR to ISPC to IntelIR we are loosing information

• And certainly adding overhead

• So having our own WalkAST mapping to intel LLVM backends should be way better

• And we can also modify them for IBM SIMD

(remember, IBM also wants same model, but with different images instead of instances within one image)

• And maybe CUDA just as well

14

Conclusions

• All modern achitectures are damn GPUs

• Divide variables into unique and uniform

• Give up the control of what is executed when and how, to varying level

• Have either implicit or explicit barriers/sync

• Have gangs/thread warps as sort of uniform threads

• And LLVM seems to be the choice to deal with them.

15

the end of programming petaqcd collaboration 1. end of parallelith the era when we were in control...

Documents