computer architecture for medical applications pipelining & s ingle i nstruction m ultiple d...

Post on 25-Feb-2016

33 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

DESCRIPTION

Computer Architecture for Medical Applications Pipelining & S ingle I nstruction M ultiple D ata – two driving factors of single core performance. Gerhard Wellein , Department for Computer Science and Erlangen Regional Computing Center Dietmar Fey , Department for Computer Science . - PowerPoint PPT Presentation

TRANSCRIPT

1CAMA 2013 - D. Fey and G. Wellein

Computer Architecture for Medical Applications

Pipelining & SingleInstructionMultipleData – two driving factors of single core performance

Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center

Dietmar Fey, Department for Computer Science

30. April 2013

2CAMA 2013 - D. Fey and G. Wellein

A different view on computer architecture

30. April 2013

3

From high level code to macro-/microcode execution

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

sum=0.d0do i=1, N sum=sum + A(i)enddo…

A(i) (incl. LD) sum in register xmm1

i (loop counter)

NADD 1st argument to 2nd argument and store result in 2nd argument

Compiler

Exec

ution

ADD Execution unit

4CAMA 2013 - D. Fey and G. Wellein

How does high level code interact with execution units

Many hardware execution units: LOAD (STORE) operands from L1 cache

(register) to register (memory)

Floating Point (FP) MULTIPLY and ADD

Various Integer units Execution units may work in parallel “Superscalar” processor

Two important concepts at hardware level: Pipelining + SIMD

30. April 2013

sum=0.d0do i=1, N sum=sum + A(i)enddo…

5CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Pipelining

30. April 2013

6CAMA 2013 - D. Fey and G. Wellein30. April 2013

Introduction: Moore’s law

1965: G. Moore claimed #transistors on processor chip doubles every 12-24 months

Intel Nehalem EX: 2.3 BillionnVIDIA FERMI: 3 Billion

7CAMA 2013 - D. Fey and G. Wellein

Frequency [MHz]

0,1

1

10

100

1000

10000

Year

30. April 2013

Introduction: Moore’s law faster cycles and beyond

• Moore’s law transistors are getting smaller run them faster

• Faster clock speed Reduce complexity of instruction execution Pipelining of instructions

Intel x86 clock speed

Increasing transistor count and clock speed allows / requires architectural changes:• Pipelining• Superscalarity• SIMD / Vector ops

• Multi-Core/Threading• Complex on chip caches

8CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining of arithmetic/functional units

• Idea:– Split complex instruction into several simple / fast steps (stages)– Each step takes the same amount of time, e.g. a single cycle– Execute different steps on different instructions at the same time (in parallel)

• Allows for shorter cycle times (simpler logic circuits), e.g.: – floating point multiplication takes 5 cycles, but – processor can work on 5 different multiplications simultaneously– one result at each cycle after the pipeline is full

• Drawback: – Pipeline must be filled - startup times (#Instructions >> pipeline steps)– Efficient use of pipelines requires large number of independent instructions

instruction level parallelism– Requires complex instruction scheduling by compiler/hardware – software-

pipelining / out-of-order

• Pipelining is widely used in modern computer architectures

9CAMA 2013 - D. Fey and G. Wellein

Interlude: Possible stage for Multiply

Real numbers can be represented as mantissa and exponent in a “normalized” representation, e.g.: s*0.m * 10e with

Sign s={-1,1} Mantissa m which does not contain 0 in leading digitExponent e some positive or negative integer

Multiply two real numbers r1*r2 = r3r1=s1*0.m1 * 10e1 , r2=s2*0.m2 * 10e2 :

s1*0.m1 * 10e1 * s2*0.m2 * 10e2

(s1*s2)* (0.m1*0.m2) * 10(e1+e2)

Normalize result: s3* 0.m3 * 10e3

30. April 2013

10CAMA 2013 - D. Fey and G. Wellein30. April 2013

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N

1

B(1) C(1)

2

B(2) C(2)

B(1) C(1)

3

B(3) C(3)

B(2) C(2)

B(1) C(1)

4

B(4) C(4)

B(3) C(3)

B(2) C(2)

A(1)

5

B(5) C(5)

B(4) C(4)

B(3) C(3)

A(2)

A(1)

6

B(6) C(6)

B(5) C(5)

B(4) C(4)

B(3) C(3)

A(2)

N+4...

A(N)

...

...

...

...

...

Cycle:

SeparateMant. / Exp.

Mult.Mantissa

Add.Exponents

Normal.Result

Insert Sign

Stage

First result is available after 5 cycles (=latency of pipeline)!After that one instruction is completed in each cycle

11CAMA 2013 - D. Fey and G. Wellein30. April 2013

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N

Wind-up/-down phases: Empty pipeline stages

12CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Speed-Up and Throughput

• Assume a general m-stage pipe, i.e. pipeline depth is m. Speed-up piplined vs non-pipelined execution at same clock speed

Tseq / Tpipe = (m*N) / (N+m) ~ m for large N (>>m)

• Throughput of piplined execution (=Average results per Cycle) executing N instructions in pipeline with m stages:

N / Tpipe(N) = N / (N+m) = 1 / [ 1+m/N ]

• Throughput for large N: N / Tpipe(N) ~ 1

• Number of independent operations (NC) required to achive Tp results per cycle:

Tp= 1 / [ 1+m/NC ] NC = Tp m / (1- Tp)

Tp= 0.5 NC = m

13CAMA 2013 - D. Fey and G. Wellein30. April 2013

Throughput as function of pipeline stages

m = #pipeline stages

90% pipeline efficiency

14CAMA 2013 - D. Fey and G. Wellein30. April 2013

Software pipelining

• Example:

Simple Pseudo Code:loop: load a[i]

mult a[i] = c, a[i]store a[i]branch.loop

Fortran Code:do i=1,N a(i) = a(i) * cend do

load a[i] Load operand to register (4 cycles)mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registersstore a[i] Write back result from register to mem./cache (2 cycles)branch.loop Increase loopcounter as long i less equal N (0 cycles)

Latencies

Optimized Pseudo Code:loop: load a[i+6]

mult a[i+2] = c, a[i+2]store a[i]branch.loop

Assumption:

Instructions block execution if operands are not available

15CAMA 2013 - D. Fey and G. Wellein30. April 2013

Software pipelining

Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19

load a[1]

mult a[1]=c,a[1]

store a[1]

load a[2]

mult a[2]=c,a[2]

store a[2]

load a[3]

load a[1] load a[2] load a[3]load a[4]load a[5] mult a[1]=c,a[1] load a[6] mult a[2]=c,a[2]load a[7] mult a[3]=c,a[3] store a[1] load a[8] mult a[4]=c,a[4] store a[2]load a[9] mult a[5]=c,a[5] store a[3] load a[10] mult a[6]=c,a[6] store a[4] load a[11] mult a[7]=c,a[7] store a[5]load a[12] mult a[8]=c,a[8] store a[6]

mult a[9]=c,a[9] store a[7]mult a[10]=c,a[10] store a[8]mult a[11]=c,a[11] store a[9]mult a[12]=c,a[12] store a[10]

store a[11]store a[12]

Optimized instruction issuea[i]=a[i]*c; N=12

T= 96 cycles T= 19 cycles

Prolog

Epilog

Kernel

16CAMA 2013 - D. Fey and G. Wellein30. April 2013

Efficient use of Pipelining

• Software pipelining can be done by the compiler, but efficient reordering of the instructions requires deep insight into application (data dependencies) and processor (latencies of functional units)

• Re-ordering of instructions can also be done at runtime by out-of-order (OOO) execution

• (Potential) dependencies within loop body may prevent efficient software pipelining or OOO execution, e.g.:

Dependency:

do i=2,N a(i) = a(i-1) * cend do

No dependency:

do i=1,N a(i) = a(i) * cend do

Pseudo-Dependency:

do i=1,N-1 a(i) = a(i+1) * cend do

17CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Data dependencies

18CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Data dependencies

19CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Data dependencies

Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19

load a[1]

mult a[2]=c,a[1]

store a[2]

load a[2]

mult a[3]=c,a[2]

store a[3]

load a[3]

load a[1]

mult a[2]=c,a[1]

mult a[3]=c,a[2] store a[2]

mult a[4]=c,a[3] store a[3] mult a[5]=c,a[4] store a[4]

mult a[6]=c,a[5] store a[5]

mult a[7]=c,a[6] store a[6]

mult a[8]=c,a[7] store a[7]

mult a[9]=c,a[8] store a[8]

Optimized instruction issuea[i]=a[i-1]*c; N=12

T= 96 cycles T= 26 cycles

Prolog

Kernel

Length of MULT pipeline determines throughput

20CAMA 2013 - D. Fey and G. Wellein

Fill pipeline with independent recursive streams..

Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMTMULT Pipeline depth: 5 stages 1 F / 5 cycles for recursive update

5 independent updates on a single core!

B(2)*s

A(2)*s

E(1)*s

D(1)*s

C(1)*s

Thread 0:do i=1,N A(i)=A(i-1)*s B(i)=B(i-1)*s C(i)=C(i-1)*s D(i)=D(i-1)*s E(i)=E(i-1)*senddo

MU

LT p

ipe

30. April 2013

21CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Beyond multiplication

• Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs.

• x86 processors (AMD, Intel): 1 Mult & Add unit per processor core• No hardware for div / sqrt / exp / sin … expensive instructions

• “FP costs” in cycles per instruction for Intel Core2 architecture

• Other instructions are also pipelined, e.g. LOAD operand to register (4 cycles)

Operation y=a+y (y=a*y) y=a/y y=sqrt(y) y=sin(y)

Latency 3 (5) 32 29 >100

Throughput 1 (1) 31 28 >100

Cycles/Operation 0.5* 15.5* 14* >100

22CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Potential problems (1)

• Hidden data dependencies:

– C/C++ allows “Pointer Aliasing” , i.e. A &C[-1] ; B &C[-2] C[i] = C[i-1] + C[i-2] Dependency!

– Compiler can not resolve potential pointer aliasing conflicts on its own!

– If no “Pointer Aliasing” is used, tell it to the compiler, e.g.

• use –fno-alias switch for Intel compiler• Pass arguments as (double *restrict A,…)

(only C99 standard)

void scale_shift(double *A, double *B, double *C, int n) {

for(int i=0; i<n; ++i) C[i] = A[i] + B[i];

}

23CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Potential problems (2)

• Simple subroutine/function calls within a loop

Inline subroutines! (can be done by compiler….)

do i=1, N call elementprod(A(i),B(i), psum) C(i)=psumenddo…function elementprod( a, b, psum)…psum=a*b

do i=1, N psum=A(i)*B(i) C(i)=psumenddo…

24CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (3a)

30. April 2013

Can we use pipelining or does this cost us 8*3 cycle (assuming 3 stage ADD pipeline)

25CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (3b)

More general – “reduction operations”?

Benchmark: Run above assembly language kernel with N=32,64,128,…,4096 on processor with

3.5 GHz clock speed ClSp=3500 Mcycle/s 1 pipelined ADD unit (latency 3 cycles) 1 pipelined LOAD unit (latency 4 cycles)

30. April 2013

sum=0.d0do i=1, N sum=sum + A(i)enddo…

A(i) (incl. LD) sum in register xmm1

i (loop counter) NADD 1st argument to 2nd argument and store result in 2nd argument

1 cycle per iteration(after 7 iterations)

26CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (4)

Expected Performance: Throughput * ClockSpeed

Throughput: N/T(N) = N/ (L+N)Assumption: L is total latency of one iteration and one result per cycle delivered after pipeline startup. Total runtime: L+N cycles

Total latency: L = 4 cycles + 3 cycles = 7 cycles

Performance for N Iterations:3500 MHz * (N / (L+N)) Iterations/cycle

Maximum performance ():3500 Mcycle/s * 1 Iteration/cycle=

3500 Miterations/s

30. April 2013

A(i-4)

A(i-5)

A(i-6)

A(i)

A(i-1)

A(i-2)

A(i-3)

LOAD

AD

D

27CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (5)

30. April 2013

Why?

s = s+A(1)

A(2)

s = s+A(2)

A(3)

Dependency on sum next instruction needs to wait for completion of previous one only 1 out of 3 stages active 3 cycles per iteration

sum=0.d0do i=1, N sum=sum + A(i)enddo… Throughput here: N/T(N) = N/ (L+3*N)

28CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (6)

Increase pipeline utilization by “loop unrolling”

30. April 2013

sum1=0.d0sum2=0.d0do i=1, N, 2 sum1=sum1+A(i) sum2=sum2+A(i+1)enddosum=sum1+sum2

“2-way Modulo Variable Expansion” (N is even)

2 out of 3 pipeline stages can be filled

2 results every 3 cycles

1.5 cycle/Iteration

29CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (7)

4-way Modulo Variable Expansion (MVE) to get best performance (in principle 3-way should do as well)

Sum is split up in 4 independent partial sums

Compiler can do that, if he is allowed to do so…

Computer’s floating point arithmetic is not associative!

If you require binary exact result (-fp-model strict) compiler is not allowed to do this transformation

L=(7+3*3) cycles (prev. slide)

30. April 2013

Nr=4*(N/4)sum1=0.d0sum2=0.d0sum3=0.d0sum4=0.d0do i=1, Nr, 4 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2) sum4=sum4+A(i+3)enddodo i=Nr+1, N sum1=sum1+A(i)enddosum=sum1+sum2+sum3+sum4

“4-way MVE”

Rem

aind

er

loop

30CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: The Instruction pipeline

• Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:

Fetch Instructionfrom L1I

Decode instruction

ExecuteInstruction

Hardware Pipelining on processor (all units can run concurrently):Fetch Instruction 1

from L1IDecode

Instruction 1Execute

Instruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

t

… Branches can stall this pipeline! (Speculative Execution, Predication) Each Unit is pipelined itself (cf. Execute=Multiply Pipeline)

1

2

3

4

31CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: The Instruction pipeline

• Problem: Unpredictable branches to other instructions

Fetch Instruction 1from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

t

1

2

3

4

Assume: Result determines next instruction!

32CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Superscalar

30. April 2013

33CAMA 2013 - D. Fey and G. Wellein30. April 2013

Superscalar Processors

• Superscalar processors provide additional hardware (i.e. transistors) to execute multiple instructions per cycle!

• Parallel hardware components / pipelines are available to– fetch / decode / issues multiple instructions

per cycle(typically 3 – 6 per cycle)

– perform multiple integer / address calculations per cycle(e.g. 6 integer units on Itanium2)

– load (store) multiple operands (results) from (to) cacheper cycle (typically one load AND one store per cycle)

– perform multiple floating point instructions per cycle(typically 2 floating point instructions per cycle, e.g. 1 MULT + 1 ADD)

• On superscalar RISC processors out-of order (OOO) execution hardware is available to optimize the usage of the parallel hardware

34CAMA 2013 - D. Fey and G. Wellein30. April 2013

Multiple units enable use of Instrucion Level Parallelism (ILP):Instruction stream is “parallelized” on the fly

Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar &

can perform 2 or 4 floating point operations per cycles

Superscalar Processors – Instruction Level Parallelism

Fetch Instruction 4from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

Fetch Instruction 3from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

Fetch Instruction 2from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

Fetch Instruction 1from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 5from L1I

Decode Instruction 5

Decode Instruction 9

ExecuteInstruction 5

Fetch Instruction 9from L1I

Fetch Instruction 13from L1I

4-way „superscalar“

t

35CAMA 2013 - D. Fey and G. Wellein

Complex register management not show (R4 contains A(i-4)) 2-way superscalar: 1 LOAD instruction + 1 ADD instruction

completed per cycle Often cited metrics for superscalar processors:

Instructions Per Cycle: IPC=2 above Cycles Per Instruction: CPI=0.5 above

Superscalar Processors – ILP in action

30. April 2013

R11=R11+R4

R12=R12+R5

R13=R13+R6

A(i)R0

A(i-1)R1

A(i-2)R2

A(i-3)R3

LOAD

ADD

sum1=0.d0 ! reg. R11sum2=0.d0 ! reg. R12Sum3=0.d0 ! reg. R13do i=1, N, 3 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2)enddosum=sum1+sum2+sum3

“3-way Modulo Variable Expansion” (N is multiple of 3)

Reg

iste

rSe

t

3630. April 2013 CAMA 2013 - D. Fey and G. Wellein

Superscalar processor – Intel Nehalem design Decode & issue a max. of 4

instructions per cycle: IPC=4Min. CPI=0.25 Cycles/Instruction

Parallel units: FP ADD & FP MULT (work in parallel) LOAD + STORE (work in parallel)

Max. FP performance:1 ADD + 1 MULT instruction per cycle

Max. performance:A(i) = r0 + r1 * B(i)

½ of max. FP performance:A(i) = r1 * B(i)

1/3 of max. FP performance:A(i) = A(i) + B(i) * C(i)

37CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Single Instruction Multiple Data (SIMD)-processing Basic Idea: Apply the same instruction to multiple operands in parallel

30. April 2013

38

SIMD-processing – Basics

Single Instruction Multiple Data (SIMD) instructions allow the concurrent execution of the same operation on “wide” registers.

x86_64 SIMD instruction sets: SSE: register width = 128 Bit 2 double (4 single) precision FP operands AVX: register width = 256 Bit 4 double (8 single) precision FP operands “Scalar” (non-SIMD) execution: 1 single/double operand, i.e. only lower

64 Bit (32 Bit) of registers are used.

Integer operands: SSE can be configured very flexible: 1 x 128 bit,…,16 x 8 bit AXV: No support for using the 256 bit register width for integer operations

SIMD-execution vector execution If compiler has vectorized loop SIMD instructions are used

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

39CAMA 2013 - D. Fey and G. Wellein

SIMD-processing – Basics

Example: Adding two registers holding double precision floating point operands using 256 Bit register (AVX)

If 128 Bit SIMD instructions (SSE) are executed only half of the registers width is used

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

C[0]

C[1]

C[2]

C[3]

A[0]

B[0]

C[0]64 Bit

256 Bit

+ +

+

+

+

R0 R1 R2 R0 R1 R2

Scalar execution:R2 ADD [R0,R1]

SIMD execution:V64ADD [R0,R1] R2

30. April 2013

40

SIMD-processing – Basics

Steps (done by the compiler) for “SIMD-processing”

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

for(int i=0; i<n;i++) C[i]=A[i]+B[i];

for(int i=0; i<n;i+=4){ C[i] =A[i] +B[i];

C[i+1]=A[i+1]+B[i+1];C[i+2]=A[i+2]+B[i+2];C[i+3]=A[i+3]+B[i+3];}

//remainder loop handling

LABEL1:VLOAD R0 A[i]VLOAD R1 B[i]V64ADD[R0,R1] R2VSTORE R2 C[i]ii+4i<(n-4)? JMP LABEL1

//remainder loop handling

“Loop unrolling”

“Pseudo-Assembler”

Load 256 Bits starting from address of A[i] to register R0

Add the corresponding 64 Bit entries in R0 and R1 and store the 4 results to R2

Store R2 (256 Bit) to address starting at C[i]

41

SIMD-processing – Basics

No SIMD-processing for loops with data dependencies

“Pointer aliasing” may prevent compiler from SIMD-processing

C/C++ allows that A &C[-1] and B &C[-2] C[i] = C[i-1] + C[i-2]: dependency No SIMD-processing

If no “Pointer aliasing” is used, tell it to the compiler, e.g. use –fno-alias switch for Intel compiler SIMD-processing

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

for(int i=0; i<n;i++) A[i]=A[i-1]*s;

void scale_shift(double *A, double *B, double *C, int n) {for(int i=0; i<n; ++i) C[i] = A[i] + B[i];

}

42

s0=0.0;s1=0.0;S2=0.0;S3=0.0;for(int i=0; i<n;i+=4){ s0 = s0+ A[i] ; s1 = s1+ A[i+1]; s2 = s2+ A[i+2]; s3 = s3+ A[i+3];}//remainders=s0+s1+s2+s3

SIMD-processing – Basics

SIMD-processing of a vector sum

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

double s=0.0;for(int i=0; i<n;i++)

s = s + A[i];

…V64ADD(R0,R1) R0…

R0 R1

Data dependency on s must be resolved for SIMD-processing(assume AVX)Compiler does transformation (Modulo Variable Expansion) –

if programmer allows it to do so!(e.g. use –O3 instead of –O1)

“Horizontal” ADD: Sum up the 4 64 Bit entries of R0

R0 (0.d0, 0.d0, 0.d0, 0.d0)

43CAMA 2013 - D. Fey and G. Wellein

R0=R0+A(1:4)R0=R0+A(5:8)

A(5:8)

SIMD-processing: What about pipelining?!

30. April 2013

A(9:12)

R0 (0.0,0.0,0.0, 0.0)do i=1, N, 4 VLOAD A(i:i+3) R1 V64ADD(R0,R1) R0

enddosum HorizontalADD(R0)

R0 (0.0,0.0,0.0,0.0)R1 (0.0,0.0,0.0,0.0)R2 (0.0,0.0,0.0,0.0)do i=1, N, 12 LOAD A(i:i+3) R3 LOAD A(i+4:i+7) R4 LOAD A(i+8:i+11) R5

V64ADD(R0,R3) R0 V64ADD(R1,R4) R1 V64ADD(R2,R5) R2

enddo…V64ADD(R0,R1) R0V64ADD(R0,R2) R0sum HorizontalADD(R0)

Need to do another MVE step to fill pipeline stages

“Vertical add”

“Horizontal add”

44CAMA 2013 - D. Fey and G. Wellein

SIMD-processing: What about pipelining?!

30. April 2013

Unrolling factor of vectorized code

1 AVX iteration performs 4 i-Iterations (successive)

Performance: 4x higher than “scalar” version

Start-up phase much longer…

Double Precision

45CAMA 2013 - D. Fey and G. Wellein

Compiler generated AVX code (loop body)

30. April 2013

Baseline version (“scalar”): No pipelining – no SIMD 3 cycles / Iteration

Compiler generated “AVX version” (-O3 –xAVX): SIMD processing: vaddpd %ymm8 4 dp operands(4-way unrolling)

Pipelining: 8-way MVE of SIMD code

0.25 cycles / Iteration32-way unrolling in total

SIMD processing – Vector sum (double precision) – 1 core

SIMD: Most impact if data is close to the core – other bottlenecks stay the same!

Peak

Location of “input data” (A[]) in memory hierarchy

Scalar: Code execution in core is slower than any data transfer

Plain: No SIMD but 4-way MVE

AVX/SIMD: Full benefit only if data is in L1 cache

47CAMA 2013 - D. Fey and G. Wellein30. April 2013

Data parallel SIMD processing

• Requires independent vector-like operations (“Data parallel”)

• Compiler is required to generate “vectorized” code Check compiler output

• Check for the use of “Packed SIMD” instructions at runtime (likwid) or in the assembly code

• Packed SIMD may require alignment constraint, e.g. 16-Byte alignment for efficient load/store on Intel Core2 architectures

• Check also for SIMD LOAD / STORE instructions

• Use of Packed SIMD instructions reduces the overall number of instructions (typical theoretical max. of 4 instructions / cycle)

SIMD code may improve performance but reduce CPI!

48CAMA 2013 - D. Fey and G. Wellein30. April 2013

Data parallel SIMD processing: Boosting performance

• Putting it all together: Modern x86_64 based Intel / AMD processor

– One FP MULTIPLY and one FP ADD pipeline can run in parallel and have a throughput of one FP instruction/cycle each (FMA units on AMD Interlagos)

Maximum 2 FP instructions/cycle

– Each pipeline operates on 128 (256) Bit registers for packed SSE (AVX) instructions 2 (4) double precision FP operations per SSE (AVX) instruction

4 (8) FP operations / cycle (1 MULT & 1 ADD on 2 (4) operands)

Peak performance of 3 GHz CPU (core): SSE: 12 GFlop/s or AVX: 24 GFlop/s (double precision)SSE: 24 GFlop/s or AVX: 48 GFlop/s (single precision)

BUT for “SCALAR” code: 6 GFlop/s (double and single precision)!

49

Maximum Floating Point (FP) Performance:

Pcore = F * S * n

F FP instructions per cycle: 2 (1 MULT and 1 ADD)

S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers – “AVX”)

n Clock speed : ∽2.5 GHz

P = 20 GF/s (dp) / 40 GF/s (sp)

There is no single driving force for single core performance!

Scalar (non-SIMD) execution

S = 1 FP op/instruction (dp / sp)

P = 5 GF/s (dp / sp)

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

5050

SIMD registers: floating point (FP) data and beyond

Possible data types in an SSE register

AVX only applies to FP data (not to scale)

16x 8bit

8x 16bit

4x 32bit

2x 64bit

1x 128bit

inte

ger

4x 32 bit

2x 64 bit floati

ng

poin

t

8x 32 bit

4x 64 bit

51

Rules for vectorizable loops / SIMD processing

1. Countable2. Single entry and single exit3. Straight line code4. No function calls (exception intrinsic math functions)

Better performance with:5. Simple inner loops with unit stride6. Minimize indirect addressing7. Align data structures (SSE 16 bytes, AVX 32 bytes)8. In C use the restrict keyword for pointers to rule out aliasing

Obstacles for vectorization: Non-contiguous memory access Data dependencies

52

How to leverage vectorization / SIMD

The compiler does it for you (aliasing, alignment, language)

Source code directives (pragmas) to ease the compiler’s job

Alternative programming models for compute kernels (OpenCL, ispc)

Intrinsics (restricted to C/C++)

Implement directly in assembler

Com

plexity and efficiency increases

53

Vectorization and the Intel compiler

Intel compiler will try to use SIMD instructions when enabled to do so

“Poor man’s vector computing” Compiler can emit messages about vectorized loops (not by default):

plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.

Use option -vec_report3 to get full compiler output about which loops were vectorized and which were not and why (data dependencies!)

Some obstructions will prevent the compiler from applying vectorization even if it is possible (e.g. –fp-model strict for vector sum or pointer aliasing)

You can use source code directives to provide more information to the compiler

54

Vectorization compiler options

The compiler will vectorize starting with –O2.

To enable specific SIMD extensions use the –x option: -xSSE2 vectorize for SSE2 capable machinesAvailable SIMD extensions:SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX

-xAVX on Sandy Bridge processors

Recommended option: -xHost will optimize for the architecture you compile on

Compiling for AMD Opteron: use plain –O3 as the -x options may involve CPU type checks at runtime!

55

Vectorization source code directives (pragmas)

Fine-grained control of loop vectorization Use !DEC$ (Fortran) or #pragma (C/C++) sentinel to start a compiler

directive

#pragma vector alwaysvectorize even if it seems inefficient (hint!)

#pragma novectordo not vectorize even if possible

#pragma vector nontemporaluse NT stores when allowed (i.e. alignment conditions are met)

#pragma vector alignedspecifies that all array accesses are aligned to 16-byte boundaries (DANGEROUS! You must not lie about this!)

56

User mandated vectorization (pragmas)

Starting with Intel Compiler 12.0 the simd pragma is available #pragma simd enforces vectorization where the other pragmas fail Prerequesites:

Countable loop Innermost loop Must conform to for-loop style of OpenMP worksharing constructs

There are additional clauses: reduction, vectorlength, private Refer to the compiler manual for further details

NOTE: Using the #pragma simd the compiler may generate incorrect code if the loop violates the vectorization rules!

#pragma simd reduction(+:x) for (int i=0; i<n; i++) { x = x + A[i]; }

57

Basic approach to check the instruction code

Get the assembler code (Intel compiler):icc –S –O3 -xHost triad.c -o triad.s

Disassemble Executable: objdump –d ./cacheBench | less

Things to check for: Is the code vectorized? Search for pd/ps suffix. mulpd, addpd, vaddpd, vmulpd Is the data loaded with 16 byte moves? movapd, movaps, vmovupd For memory-bound code: Search for nontemporal stores: movntpd, movntps

The x86 ISA is documented in:Intel Software Development Manual (SDM) 2A and 2BAMD64 Architecture Programmer's Manual Vol. 1-5

58

Some basics of the x86-64 ISA

16 general Purpose Registers (64bit): rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15alias with eight 32 bit register set:eax, ebx, ecx, edx, esi, edi, esp, ebp

Floating Point SIMD Registers:xmm0-xmm15 SSE (128bit) alias with 256bit registersymm0-ymm15 AVX (256bit)

SIMD instructions are distinguished by:AVX (VEX) prefix: vOperation: mul, add, movModifier: non temporal (nt), unaligned (u), aligned (a), high(h)Data range: packed (p), scalar (s)Data type: single (s), double (d)

5959

Some basic single core optimizations – warnings first

“Premature optimization is the root of all evil.”Donald E. Knuth

“Parallel performance is easy, single node/core performance is difficult” Bill Gropp

60

Single core: Common sense optimizations (1)

Do less work! Reducing the work to be done is never a bad idea!

logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one endif ! Is trueenddo

logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one EXIT ! Is true and endif ! EXIT do-loopenddo

61

Single core: Common sense optimizations (2)

Avoid expensive operations! FP MULT & FP ADD are the fastest way to compute Avoid DIV / SQRT / SIN / COS / TAN ,… table lookup Avoid a one-to-one implementation of the algorithm, e.g.

A= A + B**2 A = A + B**2.0 (V1)(A,B float) A = A + B * B (V2)

(V1) is not a good idea: B**2.0 exp{2.0*ln(B)}1. Computing exp & ln is very expensive 2. B < 0 ?!

Most useful if data is close to CPU!

62

Single core: Common sense optimizations (3)

Shrink the working set!Working on small data sets reduces data transfer volume and increases the probability for cache hits

Analyze if appropriate data types are used, e.g.:4 different particle species have to be distinguished:

integer spec(1:N) spec(i) = {0,1,2,3} sizeof(spec(1:N)) = 4*N*byte

OR use 1-Byte integer datatype integer*1 spec(1:N) sizeof(spec(1:N)) = N*byte

OR use 2-Bit for each speciesinteger*1 spec(1:N/4) sizeof(spec(1:N)) = N/4*byte

Strongly depends on application!

63

Single core: Common sense optimizations (4a)

Elimination of common subexpressions!This often reduces MFLOP/s rate but improves runtime

In principle the compiler should do the job but do not rely on it! Associativity rules may prevent compiler to do so Compiler does not recognize (limited scope)

64

Single core: Common sense optimizations (4b)

Replace expensive functions by table lookup

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

do iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+tanh(edelz)) … … enddo … …enddo

Entries: -1,0,1

edelz=-6,-5,-4,-3,-2,-1,1,0,1,2,3,4,5,6

Compute all 13 potential values of tanh(edelz) before and store it in a table with 13 entries: tanh_table(-6:6)

65

Single core: Common sense optimizations (4c)

Replace expensive functions by table lookup

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

do i=-6,6 tanh_table(i) = tanh(i)enddodo iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+ tanh_table(edelz)) … … enddo … …enddo

66

Single core: Common sense optimizations (5)

Avoid branches!Support the compiler to understand and optimize your code!

Code change may enable vectorization, SIMD & other optimizations

BTW software pipelining is also much easier and no branch prediction is required

top related