the end of programming petaqcd collaboration 1. end of parallelith the era when we were in control...

15
The end of programming PetaQCD collaboration 1

Upload: bathsheba-lawson

Post on 13-Dec-2015

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

The end of programming

PetaQCD collaboration

1

Page 2: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

End of Parallelith

• The era when we were in control of what gets executed where and how is pretty much over

• No modern compiler can generate efficient code for modern architectures

• Because they are SPMD of various width

• So companies have to do something else

2

Page 3: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

Warps

• To abstract and hide the execution of programs nVidia came up with Thread Warp, a bunch of threads executed at the same time at some part of the hardware

3

Page 4: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

ISPC

• Intel noticed that people don’t quite like CUDA

• Yet figured out that no compiler can figure out where and how to parallelize

• So it abandoned idea of having own compiler

• And conceived a syntax which would appear user-friendly

• Wrote an LLVM(low-level virtual machine) frontend and few backends

• Just like NVIDIA did.

4

Page 5: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

Gangs

• So what does brave new code like?

export void simple(uniform float vin[], uniform float vout[], uniform int count) { foreach (index = 0 ... count) { float v = vin[index]; if (v < 3.) v = v * v; else v = sqrt(v); vout[index] = v; }}

float vin[16], vout[16]; for (int i = 0; i < 16; ++i) vin[i] = i;

simple(vin, vout, 16);

5

Page 6: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

Gangs

• So instead of Thread Warps we have Gangs

• Bunch of program « instances » mapped onto SIMD

• About 2-4x the width of SIMD

• And variables can be shared or unique across gangs

• Atomic operations should be used inside a gang

• No threads, no context switching, just an army of marching ants

6

Page 7: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

So what?

• Intel’s ISPC is poor man’s CUDA

• Flow control is much simpler (things go uniformly)

• Obviously at the cost of performance

• And is designed to deceive users that it is simpler

• While all that it does is mapping code to instances

• Which then get mapped to low-level functions

• So in principle ISPC can be used for any architecture

7

Page 8: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

So we care?

• Yes we do.

• First, Intel confirmed that classic compilers are dead

• Second, it has shown us what a proper backend for AVX and SSE should look like (opensource code)define <16 x double> @__gather_base_offsets64_double(i8 * %ptr,

i32 %scale, <16 x i64> %offsets, <16 x i32> %mask32) nounwind readonly alwaysinline {…

%v1 = call <4 x double> @llvm.x86.avx2.gather.q.pd.256(<4 x double> undef, i8 * %ptr, <4 x i64> %offsets_1, <4 x double> %vecmask_1, i8 %scale8) assemble_4s(double, v, v1, v2, v3, v4) ret <16 x double> %v}

8

Page 9: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

So what?

• Third, we have a glimpse at how intel optimises. It doesn’t.

• It just parses the commands via AST(abstract sysntax tree, basic notion of LLVM ) walkingASTNode *

WalkAST(ASTNode *node, ASTPreCallBackFunc preFunc, ASTPostCallBackFunc postFunc, void *data) { if (node == NULL) return node;

// Call the callback function if (preFunc != NULL) { if (preFunc(node, data) == false) // The function asked us to not continue recursively, so stop. return node; }

9

Page 10: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

And...

• Generates IR

• And maps it to SIMD backends

voidAST::GenerateIR() { for (unsigned int i = 0; i < functions.size(); ++i) functions[i]->GenerateIR();}

10

Page 11: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

So we have …

• An unstable quickly developing utility

• Which may or may not be actually useful beyond AVX(no MIC support yet)

• Is comparably obscure to CUDA (but with source!)

• and not GPL.

11

Page 12: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

Short term

• ISPC may be a valueble tool to investigate the efficiency of the llvm-based compiler for the Intel architectures

• One may try to generate gang-compatible code

• Because it obviously will be superiour to the icc

• Which fails to vectorize properly

12

Page 13: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

Medium term

• And we finally have access to the L1 prefetcher!

• NT means data will be discarded after use

uniform int32 array[...];for (uniform int i = 0; i < count; ++i) { // do computation with array[i] prefetch_l1(&array[i+32]);}

void prefetch_{l1,l2,l3,nt}(void * uniform ptr)void prefetch_{l1,l2,l3,nt}(void * varying ptr)

13

Page 14: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

Long term

• It is likely that moving from QiralIR to ISPC to IntelIR we are loosing information

• And certainly adding overhead

• So having our own WalkAST mapping to intel LLVM backends should be way better

• And we can also modify them for IBM SIMD

(remember, IBM also wants same model, but with different images instead of instances within one image)

• And maybe CUDA just as well

14

Page 15: The end of programming PetaQCD collaboration 1. End of Parallelith The era when we were in control of what gets executed where and how is pretty much

Conclusions

• All modern achitectures are damn GPUs

• Divide variables into unique and uniform

• Give up the control of what is executed when and how, to varying level

• Have either implicit or explicit barriers/sync

• Have gangs/thread warps as sort of uniform threads

• And LLVM seems to be the choice to deal with them.

15