computer architecture for medical applications pipelining & s ingle i nstruction m ultiple d...

66
Computer Architecture for Medical Applications Pipelining & SingleInstructionMultipleData – two driving factors of single core performance Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center Dietmar Fey, Department for Computer Science 30. April 2013 CAMA 2013 - D. Fey and G. Wellein 1

Upload: talon

Post on 25-Feb-2016

33 views

Category:

Documents


0 download

DESCRIPTION

Computer Architecture for Medical Applications Pipelining & S ingle I nstruction M ultiple D ata – two driving factors of single core performance. Gerhard Wellein , Department for Computer Science and Erlangen Regional Computing Center Dietmar Fey , Department for Computer Science . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

1CAMA 2013 - D. Fey and G. Wellein

Computer Architecture for Medical Applications

Pipelining & SingleInstructionMultipleData – two driving factors of single core performance

Gerhard Wellein, Department for Computer Science and Erlangen Regional Computing Center

Dietmar Fey, Department for Computer Science

30. April 2013

Page 2: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

2CAMA 2013 - D. Fey and G. Wellein

A different view on computer architecture

30. April 2013

Page 3: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

3

From high level code to macro-/microcode execution

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

sum=0.d0do i=1, N sum=sum + A(i)enddo…

A(i) (incl. LD) sum in register xmm1

i (loop counter)

NADD 1st argument to 2nd argument and store result in 2nd argument

Compiler

Exec

ution

ADD Execution unit

Page 4: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

4CAMA 2013 - D. Fey and G. Wellein

How does high level code interact with execution units

Many hardware execution units: LOAD (STORE) operands from L1 cache

(register) to register (memory)

Floating Point (FP) MULTIPLY and ADD

Various Integer units Execution units may work in parallel “Superscalar” processor

Two important concepts at hardware level: Pipelining + SIMD

30. April 2013

sum=0.d0do i=1, N sum=sum + A(i)enddo…

Page 5: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

5CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Pipelining

30. April 2013

Page 6: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

6CAMA 2013 - D. Fey and G. Wellein30. April 2013

Introduction: Moore’s law

1965: G. Moore claimed #transistors on processor chip doubles every 12-24 months

Intel Nehalem EX: 2.3 BillionnVIDIA FERMI: 3 Billion

Page 7: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

7CAMA 2013 - D. Fey and G. Wellein

Frequency [MHz]

0,1

1

10

100

1000

10000

Year

30. April 2013

Introduction: Moore’s law faster cycles and beyond

• Moore’s law transistors are getting smaller run them faster

• Faster clock speed Reduce complexity of instruction execution Pipelining of instructions

Intel x86 clock speed

Increasing transistor count and clock speed allows / requires architectural changes:• Pipelining• Superscalarity• SIMD / Vector ops

• Multi-Core/Threading• Complex on chip caches

Page 8: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

8CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining of arithmetic/functional units

• Idea:– Split complex instruction into several simple / fast steps (stages)– Each step takes the same amount of time, e.g. a single cycle– Execute different steps on different instructions at the same time (in parallel)

• Allows for shorter cycle times (simpler logic circuits), e.g.: – floating point multiplication takes 5 cycles, but – processor can work on 5 different multiplications simultaneously– one result at each cycle after the pipeline is full

• Drawback: – Pipeline must be filled - startup times (#Instructions >> pipeline steps)– Efficient use of pipelines requires large number of independent instructions

instruction level parallelism– Requires complex instruction scheduling by compiler/hardware – software-

pipelining / out-of-order

• Pipelining is widely used in modern computer architectures

Page 9: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

9CAMA 2013 - D. Fey and G. Wellein

Interlude: Possible stage for Multiply

Real numbers can be represented as mantissa and exponent in a “normalized” representation, e.g.: s*0.m * 10e with

Sign s={-1,1} Mantissa m which does not contain 0 in leading digitExponent e some positive or negative integer

Multiply two real numbers r1*r2 = r3r1=s1*0.m1 * 10e1 , r2=s2*0.m2 * 10e2 :

s1*0.m1 * 10e1 * s2*0.m2 * 10e2

(s1*s2)* (0.m1*0.m2) * 10(e1+e2)

Normalize result: s3* 0.m3 * 10e3

30. April 2013

Page 10: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

10CAMA 2013 - D. Fey and G. Wellein30. April 2013

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N

1

B(1) C(1)

2

B(2) C(2)

B(1) C(1)

3

B(3) C(3)

B(2) C(2)

B(1) C(1)

4

B(4) C(4)

B(3) C(3)

B(2) C(2)

A(1)

5

B(5) C(5)

B(4) C(4)

B(3) C(3)

A(2)

A(1)

6

B(6) C(6)

B(5) C(5)

B(4) C(4)

B(3) C(3)

A(2)

N+4...

A(N)

...

...

...

...

...

Cycle:

SeparateMant. / Exp.

Mult.Mantissa

Add.Exponents

Normal.Result

Insert Sign

Stage

First result is available after 5 cycles (=latency of pipeline)!After that one instruction is completed in each cycle

Page 11: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

11CAMA 2013 - D. Fey and G. Wellein30. April 2013

5-stage Multiplication-Pipeline: A(i)=B(i)*C(i) ; i=1,...,N

Wind-up/-down phases: Empty pipeline stages

Page 12: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

12CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Speed-Up and Throughput

• Assume a general m-stage pipe, i.e. pipeline depth is m. Speed-up piplined vs non-pipelined execution at same clock speed

Tseq / Tpipe = (m*N) / (N+m) ~ m for large N (>>m)

• Throughput of piplined execution (=Average results per Cycle) executing N instructions in pipeline with m stages:

N / Tpipe(N) = N / (N+m) = 1 / [ 1+m/N ]

• Throughput for large N: N / Tpipe(N) ~ 1

• Number of independent operations (NC) required to achive Tp results per cycle:

Tp= 1 / [ 1+m/NC ] NC = Tp m / (1- Tp)

Tp= 0.5 NC = m

Page 13: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

13CAMA 2013 - D. Fey and G. Wellein30. April 2013

Throughput as function of pipeline stages

m = #pipeline stages

90% pipeline efficiency

Page 14: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

14CAMA 2013 - D. Fey and G. Wellein30. April 2013

Software pipelining

• Example:

Simple Pseudo Code:loop: load a[i]

mult a[i] = c, a[i]store a[i]branch.loop

Fortran Code:do i=1,N a(i) = a(i) * cend do

load a[i] Load operand to register (4 cycles)mult a[i] = c,a[i] Multiply a(i) with c (2 cycles); a[i],c in registersstore a[i] Write back result from register to mem./cache (2 cycles)branch.loop Increase loopcounter as long i less equal N (0 cycles)

Latencies

Optimized Pseudo Code:loop: load a[i+6]

mult a[i+2] = c, a[i+2]store a[i]branch.loop

Assumption:

Instructions block execution if operands are not available

Page 15: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

15CAMA 2013 - D. Fey and G. Wellein30. April 2013

Software pipelining

Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19

load a[1]

mult a[1]=c,a[1]

store a[1]

load a[2]

mult a[2]=c,a[2]

store a[2]

load a[3]

load a[1] load a[2] load a[3]load a[4]load a[5] mult a[1]=c,a[1] load a[6] mult a[2]=c,a[2]load a[7] mult a[3]=c,a[3] store a[1] load a[8] mult a[4]=c,a[4] store a[2]load a[9] mult a[5]=c,a[5] store a[3] load a[10] mult a[6]=c,a[6] store a[4] load a[11] mult a[7]=c,a[7] store a[5]load a[12] mult a[8]=c,a[8] store a[6]

mult a[9]=c,a[9] store a[7]mult a[10]=c,a[10] store a[8]mult a[11]=c,a[11] store a[9]mult a[12]=c,a[12] store a[10]

store a[11]store a[12]

Optimized instruction issuea[i]=a[i]*c; N=12

T= 96 cycles T= 19 cycles

Prolog

Epilog

Kernel

Page 16: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

16CAMA 2013 - D. Fey and G. Wellein30. April 2013

Efficient use of Pipelining

• Software pipelining can be done by the compiler, but efficient reordering of the instructions requires deep insight into application (data dependencies) and processor (latencies of functional units)

• Re-ordering of instructions can also be done at runtime by out-of-order (OOO) execution

• (Potential) dependencies within loop body may prevent efficient software pipelining or OOO execution, e.g.:

Dependency:

do i=2,N a(i) = a(i-1) * cend do

No dependency:

do i=1,N a(i) = a(i) * cend do

Pseudo-Dependency:

do i=1,N-1 a(i) = a(i+1) * cend do

Page 17: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

17CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Data dependencies

Page 18: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

18CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Data dependencies

Page 19: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

19CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Data dependencies

Naive instruction issueCycle 1Cycle 2Cycle 3Cycle 4Cycle 5Cycle 6Cycle 7Cycle 8Cycle 9Cycle 10Cycle 11Cycle 12Cycle 13Cycle 14Cycle 15Cycle 16Cycle 17Cycle 18Cycle 19

load a[1]

mult a[2]=c,a[1]

store a[2]

load a[2]

mult a[3]=c,a[2]

store a[3]

load a[3]

load a[1]

mult a[2]=c,a[1]

mult a[3]=c,a[2] store a[2]

mult a[4]=c,a[3] store a[3] mult a[5]=c,a[4] store a[4]

mult a[6]=c,a[5] store a[5]

mult a[7]=c,a[6] store a[6]

mult a[8]=c,a[7] store a[7]

mult a[9]=c,a[8] store a[8]

Optimized instruction issuea[i]=a[i-1]*c; N=12

T= 96 cycles T= 26 cycles

Prolog

Kernel

Length of MULT pipeline determines throughput

Page 20: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

20CAMA 2013 - D. Fey and G. Wellein

Fill pipeline with independent recursive streams..

Intel Sandy Bridge (desktop) 4-core; 3.5 GHz; SMTMULT Pipeline depth: 5 stages 1 F / 5 cycles for recursive update

5 independent updates on a single core!

B(2)*s

A(2)*s

E(1)*s

D(1)*s

C(1)*s

Thread 0:do i=1,N A(i)=A(i-1)*s B(i)=B(i-1)*s C(i)=C(i-1)*s D(i)=D(i-1)*s E(i)=E(i-1)*senddo

MU

LT p

ipe

30. April 2013

Page 21: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

21CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Beyond multiplication

• Typical number of pipeline stages: 2-5 for the hardware pipelines on modern CPUs.

• x86 processors (AMD, Intel): 1 Mult & Add unit per processor core• No hardware for div / sqrt / exp / sin … expensive instructions

• “FP costs” in cycles per instruction for Intel Core2 architecture

• Other instructions are also pipelined, e.g. LOAD operand to register (4 cycles)

Operation y=a+y (y=a*y) y=a/y y=sqrt(y) y=sin(y)

Latency 3 (5) 32 29 >100

Throughput 1 (1) 31 28 >100

Cycles/Operation 0.5* 15.5* 14* >100

Page 22: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

22CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Potential problems (1)

• Hidden data dependencies:

– C/C++ allows “Pointer Aliasing” , i.e. A &C[-1] ; B &C[-2] C[i] = C[i-1] + C[i-2] Dependency!

– Compiler can not resolve potential pointer aliasing conflicts on its own!

– If no “Pointer Aliasing” is used, tell it to the compiler, e.g.

• use –fno-alias switch for Intel compiler• Pass arguments as (double *restrict A,…)

(only C99 standard)

void scale_shift(double *A, double *B, double *C, int n) {

for(int i=0; i<n; ++i) C[i] = A[i] + B[i];

}

Page 23: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

23CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: Potential problems (2)

• Simple subroutine/function calls within a loop

Inline subroutines! (can be done by compiler….)

do i=1, N call elementprod(A(i),B(i), psum) C(i)=psumenddo…function elementprod( a, b, psum)…psum=a*b

do i=1, N psum=A(i)*B(i) C(i)=psumenddo…

Page 24: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

24CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (3a)

30. April 2013

Can we use pipelining or does this cost us 8*3 cycle (assuming 3 stage ADD pipeline)

Page 25: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

25CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (3b)

More general – “reduction operations”?

Benchmark: Run above assembly language kernel with N=32,64,128,…,4096 on processor with

3.5 GHz clock speed ClSp=3500 Mcycle/s 1 pipelined ADD unit (latency 3 cycles) 1 pipelined LOAD unit (latency 4 cycles)

30. April 2013

sum=0.d0do i=1, N sum=sum + A(i)enddo…

A(i) (incl. LD) sum in register xmm1

i (loop counter) NADD 1st argument to 2nd argument and store result in 2nd argument

1 cycle per iteration(after 7 iterations)

Page 26: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

26CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (4)

Expected Performance: Throughput * ClockSpeed

Throughput: N/T(N) = N/ (L+N)Assumption: L is total latency of one iteration and one result per cycle delivered after pipeline startup. Total runtime: L+N cycles

Total latency: L = 4 cycles + 3 cycles = 7 cycles

Performance for N Iterations:3500 MHz * (N / (L+N)) Iterations/cycle

Maximum performance ():3500 Mcycle/s * 1 Iteration/cycle=

3500 Miterations/s

30. April 2013

A(i-4)

A(i-5)

A(i-6)

A(i)

A(i-1)

A(i-2)

A(i-3)

LOAD

AD

D

Page 27: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

27CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (5)

30. April 2013

Why?

s = s+A(1)

A(2)

s = s+A(2)

A(3)

Dependency on sum next instruction needs to wait for completion of previous one only 1 out of 3 stages active 3 cycles per iteration

sum=0.d0do i=1, N sum=sum + A(i)enddo… Throughput here: N/T(N) = N/ (L+3*N)

Page 28: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

28CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (6)

Increase pipeline utilization by “loop unrolling”

30. April 2013

sum1=0.d0sum2=0.d0do i=1, N, 2 sum1=sum1+A(i) sum2=sum2+A(i+1)enddosum=sum1+sum2

“2-way Modulo Variable Expansion” (N is even)

2 out of 3 pipeline stages can be filled

2 results every 3 cycles

1.5 cycle/Iteration

Page 29: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

29CAMA 2013 - D. Fey and G. Wellein

Pipelining: Potential problems (7)

4-way Modulo Variable Expansion (MVE) to get best performance (in principle 3-way should do as well)

Sum is split up in 4 independent partial sums

Compiler can do that, if he is allowed to do so…

Computer’s floating point arithmetic is not associative!

If you require binary exact result (-fp-model strict) compiler is not allowed to do this transformation

L=(7+3*3) cycles (prev. slide)

30. April 2013

Nr=4*(N/4)sum1=0.d0sum2=0.d0sum3=0.d0sum4=0.d0do i=1, Nr, 4 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2) sum4=sum4+A(i+3)enddodo i=Nr+1, N sum1=sum1+A(i)enddosum=sum1+sum2+sum3+sum4

“4-way MVE”

Rem

aind

er

loop

Page 30: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

30CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: The Instruction pipeline

• Besides arithmetic & functional unit, instruction execution itself is pipelined also, e.g.: one instruction performs at least 3 steps:

Fetch Instructionfrom L1I

Decode instruction

ExecuteInstruction

Hardware Pipelining on processor (all units can run concurrently):Fetch Instruction 1

from L1IDecode

Instruction 1Execute

Instruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

t

… Branches can stall this pipeline! (Speculative Execution, Predication) Each Unit is pipelined itself (cf. Execute=Multiply Pipeline)

1

2

3

4

Page 31: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

31CAMA 2013 - D. Fey and G. Wellein30. April 2013

Pipelining: The Instruction pipeline

• Problem: Unpredictable branches to other instructions

Fetch Instruction 1from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

t

1

2

3

4

Assume: Result determines next instruction!

Page 32: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

32CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Superscalar

30. April 2013

Page 33: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

33CAMA 2013 - D. Fey and G. Wellein30. April 2013

Superscalar Processors

• Superscalar processors provide additional hardware (i.e. transistors) to execute multiple instructions per cycle!

• Parallel hardware components / pipelines are available to– fetch / decode / issues multiple instructions

per cycle(typically 3 – 6 per cycle)

– perform multiple integer / address calculations per cycle(e.g. 6 integer units on Itanium2)

– load (store) multiple operands (results) from (to) cacheper cycle (typically one load AND one store per cycle)

– perform multiple floating point instructions per cycle(typically 2 floating point instructions per cycle, e.g. 1 MULT + 1 ADD)

• On superscalar RISC processors out-of order (OOO) execution hardware is available to optimize the usage of the parallel hardware

Page 34: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

34CAMA 2013 - D. Fey and G. Wellein30. April 2013

Multiple units enable use of Instrucion Level Parallelism (ILP):Instruction stream is “parallelized” on the fly

Issuing m concurrent instructions per cycle: m-way superscalar Modern processors are 3- to 6-way superscalar &

can perform 2 or 4 floating point operations per cycles

Superscalar Processors – Instruction Level Parallelism

Fetch Instruction 4from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

Fetch Instruction 3from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

Fetch Instruction 2from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 2from L1I

Decode Instruction 2

Decode Instruction 3

ExecuteInstruction 2

Fetch Instruction 3from L1I

Fetch Instruction 4from L1I

Fetch Instruction 1from L1I

Decode Instruction 1

ExecuteInstruction 1

Fetch Instruction 5from L1I

Decode Instruction 5

Decode Instruction 9

ExecuteInstruction 5

Fetch Instruction 9from L1I

Fetch Instruction 13from L1I

4-way „superscalar“

t

Page 35: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

35CAMA 2013 - D. Fey and G. Wellein

Complex register management not show (R4 contains A(i-4)) 2-way superscalar: 1 LOAD instruction + 1 ADD instruction

completed per cycle Often cited metrics for superscalar processors:

Instructions Per Cycle: IPC=2 above Cycles Per Instruction: CPI=0.5 above

Superscalar Processors – ILP in action

30. April 2013

R11=R11+R4

R12=R12+R5

R13=R13+R6

A(i)R0

A(i-1)R1

A(i-2)R2

A(i-3)R3

LOAD

ADD

sum1=0.d0 ! reg. R11sum2=0.d0 ! reg. R12Sum3=0.d0 ! reg. R13do i=1, N, 3 sum1=sum1+A(i) sum2=sum2+A(i+1) sum3=sum3+A(i+2)enddosum=sum1+sum2+sum3

“3-way Modulo Variable Expansion” (N is multiple of 3)

Reg

iste

rSe

t

Page 36: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

3630. April 2013 CAMA 2013 - D. Fey and G. Wellein

Superscalar processor – Intel Nehalem design Decode & issue a max. of 4

instructions per cycle: IPC=4Min. CPI=0.25 Cycles/Instruction

Parallel units: FP ADD & FP MULT (work in parallel) LOAD + STORE (work in parallel)

Max. FP performance:1 ADD + 1 MULT instruction per cycle

Max. performance:A(i) = r0 + r1 * B(i)

½ of max. FP performance:A(i) = r1 * B(i)

1/3 of max. FP performance:A(i) = A(i) + B(i) * C(i)

Page 37: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

37CAMA 2013 - D. Fey and G. Wellein

Microprocessors – Single Instruction Multiple Data (SIMD)-processing Basic Idea: Apply the same instruction to multiple operands in parallel

30. April 2013

Page 38: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

38

SIMD-processing – Basics

Single Instruction Multiple Data (SIMD) instructions allow the concurrent execution of the same operation on “wide” registers.

x86_64 SIMD instruction sets: SSE: register width = 128 Bit 2 double (4 single) precision FP operands AVX: register width = 256 Bit 4 double (8 single) precision FP operands “Scalar” (non-SIMD) execution: 1 single/double operand, i.e. only lower

64 Bit (32 Bit) of registers are used.

Integer operands: SSE can be configured very flexible: 1 x 128 bit,…,16 x 8 bit AXV: No support for using the 256 bit register width for integer operations

SIMD-execution vector execution If compiler has vectorized loop SIMD instructions are used

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

Page 39: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

39CAMA 2013 - D. Fey and G. Wellein

SIMD-processing – Basics

Example: Adding two registers holding double precision floating point operands using 256 Bit register (AVX)

If 128 Bit SIMD instructions (SSE) are executed only half of the registers width is used

A[0]

A[1]

A[2]

A[3]

B[0]

B[1]

B[2]

B[3]

C[0]

C[1]

C[2]

C[3]

A[0]

B[0]

C[0]64 Bit

256 Bit

+ +

+

+

+

R0 R1 R2 R0 R1 R2

Scalar execution:R2 ADD [R0,R1]

SIMD execution:V64ADD [R0,R1] R2

30. April 2013

Page 40: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

40

SIMD-processing – Basics

Steps (done by the compiler) for “SIMD-processing”

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

for(int i=0; i<n;i++) C[i]=A[i]+B[i];

for(int i=0; i<n;i+=4){ C[i] =A[i] +B[i];

C[i+1]=A[i+1]+B[i+1];C[i+2]=A[i+2]+B[i+2];C[i+3]=A[i+3]+B[i+3];}

//remainder loop handling

LABEL1:VLOAD R0 A[i]VLOAD R1 B[i]V64ADD[R0,R1] R2VSTORE R2 C[i]ii+4i<(n-4)? JMP LABEL1

//remainder loop handling

“Loop unrolling”

“Pseudo-Assembler”

Load 256 Bits starting from address of A[i] to register R0

Add the corresponding 64 Bit entries in R0 and R1 and store the 4 results to R2

Store R2 (256 Bit) to address starting at C[i]

Page 41: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

41

SIMD-processing – Basics

No SIMD-processing for loops with data dependencies

“Pointer aliasing” may prevent compiler from SIMD-processing

C/C++ allows that A &C[-1] and B &C[-2] C[i] = C[i-1] + C[i-2]: dependency No SIMD-processing

If no “Pointer aliasing” is used, tell it to the compiler, e.g. use –fno-alias switch for Intel compiler SIMD-processing

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

for(int i=0; i<n;i++) A[i]=A[i-1]*s;

void scale_shift(double *A, double *B, double *C, int n) {for(int i=0; i<n; ++i) C[i] = A[i] + B[i];

}

Page 42: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

42

s0=0.0;s1=0.0;S2=0.0;S3=0.0;for(int i=0; i<n;i+=4){ s0 = s0+ A[i] ; s1 = s1+ A[i+1]; s2 = s2+ A[i+2]; s3 = s3+ A[i+3];}//remainders=s0+s1+s2+s3

SIMD-processing – Basics

SIMD-processing of a vector sum

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

double s=0.0;for(int i=0; i<n;i++)

s = s + A[i];

…V64ADD(R0,R1) R0…

R0 R1

Data dependency on s must be resolved for SIMD-processing(assume AVX)Compiler does transformation (Modulo Variable Expansion) –

if programmer allows it to do so!(e.g. use –O3 instead of –O1)

“Horizontal” ADD: Sum up the 4 64 Bit entries of R0

R0 (0.d0, 0.d0, 0.d0, 0.d0)

Page 43: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

43CAMA 2013 - D. Fey and G. Wellein

R0=R0+A(1:4)R0=R0+A(5:8)

A(5:8)

SIMD-processing: What about pipelining?!

30. April 2013

A(9:12)

R0 (0.0,0.0,0.0, 0.0)do i=1, N, 4 VLOAD A(i:i+3) R1 V64ADD(R0,R1) R0

enddosum HorizontalADD(R0)

R0 (0.0,0.0,0.0,0.0)R1 (0.0,0.0,0.0,0.0)R2 (0.0,0.0,0.0,0.0)do i=1, N, 12 LOAD A(i:i+3) R3 LOAD A(i+4:i+7) R4 LOAD A(i+8:i+11) R5

V64ADD(R0,R3) R0 V64ADD(R1,R4) R1 V64ADD(R2,R5) R2

enddo…V64ADD(R0,R1) R0V64ADD(R0,R2) R0sum HorizontalADD(R0)

Need to do another MVE step to fill pipeline stages

“Vertical add”

“Horizontal add”

Page 44: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

44CAMA 2013 - D. Fey and G. Wellein

SIMD-processing: What about pipelining?!

30. April 2013

Unrolling factor of vectorized code

1 AVX iteration performs 4 i-Iterations (successive)

Performance: 4x higher than “scalar” version

Start-up phase much longer…

Double Precision

Page 45: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

45CAMA 2013 - D. Fey and G. Wellein

Compiler generated AVX code (loop body)

30. April 2013

Baseline version (“scalar”): No pipelining – no SIMD 3 cycles / Iteration

Compiler generated “AVX version” (-O3 –xAVX): SIMD processing: vaddpd %ymm8 4 dp operands(4-way unrolling)

Pipelining: 8-way MVE of SIMD code

0.25 cycles / Iteration32-way unrolling in total

Page 46: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

SIMD processing – Vector sum (double precision) – 1 core

SIMD: Most impact if data is close to the core – other bottlenecks stay the same!

Peak

Location of “input data” (A[]) in memory hierarchy

Scalar: Code execution in core is slower than any data transfer

Plain: No SIMD but 4-way MVE

AVX/SIMD: Full benefit only if data is in L1 cache

Page 47: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

47CAMA 2013 - D. Fey and G. Wellein30. April 2013

Data parallel SIMD processing

• Requires independent vector-like operations (“Data parallel”)

• Compiler is required to generate “vectorized” code Check compiler output

• Check for the use of “Packed SIMD” instructions at runtime (likwid) or in the assembly code

• Packed SIMD may require alignment constraint, e.g. 16-Byte alignment for efficient load/store on Intel Core2 architectures

• Check also for SIMD LOAD / STORE instructions

• Use of Packed SIMD instructions reduces the overall number of instructions (typical theoretical max. of 4 instructions / cycle)

SIMD code may improve performance but reduce CPI!

Page 48: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

48CAMA 2013 - D. Fey and G. Wellein30. April 2013

Data parallel SIMD processing: Boosting performance

• Putting it all together: Modern x86_64 based Intel / AMD processor

– One FP MULTIPLY and one FP ADD pipeline can run in parallel and have a throughput of one FP instruction/cycle each (FMA units on AMD Interlagos)

Maximum 2 FP instructions/cycle

– Each pipeline operates on 128 (256) Bit registers for packed SSE (AVX) instructions 2 (4) double precision FP operations per SSE (AVX) instruction

4 (8) FP operations / cycle (1 MULT & 1 ADD on 2 (4) operands)

Peak performance of 3 GHz CPU (core): SSE: 12 GFlop/s or AVX: 24 GFlop/s (double precision)SSE: 24 GFlop/s or AVX: 48 GFlop/s (single precision)

BUT for “SCALAR” code: 6 GFlop/s (double and single precision)!

Page 49: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

49

Maximum Floating Point (FP) Performance:

Pcore = F * S * n

F FP instructions per cycle: 2 (1 MULT and 1 ADD)

S FP ops / instruction: 4 (dp) / 8 (sp) (256 Bit SIMD registers – “AVX”)

n Clock speed : ∽2.5 GHz

P = 20 GF/s (dp) / 40 GF/s (sp)

There is no single driving force for single core performance!

Scalar (non-SIMD) execution

S = 1 FP op/instruction (dp / sp)

P = 5 GF/s (dp / sp)

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

Page 50: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

5050

SIMD registers: floating point (FP) data and beyond

Possible data types in an SSE register

AVX only applies to FP data (not to scale)

16x 8bit

8x 16bit

4x 32bit

2x 64bit

1x 128bit

inte

ger

4x 32 bit

2x 64 bit floati

ng

poin

t

8x 32 bit

4x 64 bit

Page 51: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

51

Rules for vectorizable loops / SIMD processing

1. Countable2. Single entry and single exit3. Straight line code4. No function calls (exception intrinsic math functions)

Better performance with:5. Simple inner loops with unit stride6. Minimize indirect addressing7. Align data structures (SSE 16 bytes, AVX 32 bytes)8. In C use the restrict keyword for pointers to rule out aliasing

Obstacles for vectorization: Non-contiguous memory access Data dependencies

Page 52: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

52

How to leverage vectorization / SIMD

The compiler does it for you (aliasing, alignment, language)

Source code directives (pragmas) to ease the compiler’s job

Alternative programming models for compute kernels (OpenCL, ispc)

Intrinsics (restricted to C/C++)

Implement directly in assembler

Com

plexity and efficiency increases

Page 53: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

53

Vectorization and the Intel compiler

Intel compiler will try to use SIMD instructions when enabled to do so

“Poor man’s vector computing” Compiler can emit messages about vectorized loops (not by default):

plain.c(11): (col. 9) remark: LOOP WAS VECTORIZED.

Use option -vec_report3 to get full compiler output about which loops were vectorized and which were not and why (data dependencies!)

Some obstructions will prevent the compiler from applying vectorization even if it is possible (e.g. –fp-model strict for vector sum or pointer aliasing)

You can use source code directives to provide more information to the compiler

Page 54: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

54

Vectorization compiler options

The compiler will vectorize starting with –O2.

To enable specific SIMD extensions use the –x option: -xSSE2 vectorize for SSE2 capable machinesAvailable SIMD extensions:SSE2, SSE3, SSSE3, SSE4.1, SSE4.2, AVX

-xAVX on Sandy Bridge processors

Recommended option: -xHost will optimize for the architecture you compile on

Compiling for AMD Opteron: use plain –O3 as the -x options may involve CPU type checks at runtime!

Page 55: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

55

Vectorization source code directives (pragmas)

Fine-grained control of loop vectorization Use !DEC$ (Fortran) or #pragma (C/C++) sentinel to start a compiler

directive

#pragma vector alwaysvectorize even if it seems inefficient (hint!)

#pragma novectordo not vectorize even if possible

#pragma vector nontemporaluse NT stores when allowed (i.e. alignment conditions are met)

#pragma vector alignedspecifies that all array accesses are aligned to 16-byte boundaries (DANGEROUS! You must not lie about this!)

Page 56: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

56

User mandated vectorization (pragmas)

Starting with Intel Compiler 12.0 the simd pragma is available #pragma simd enforces vectorization where the other pragmas fail Prerequesites:

Countable loop Innermost loop Must conform to for-loop style of OpenMP worksharing constructs

There are additional clauses: reduction, vectorlength, private Refer to the compiler manual for further details

NOTE: Using the #pragma simd the compiler may generate incorrect code if the loop violates the vectorization rules!

#pragma simd reduction(+:x) for (int i=0; i<n; i++) { x = x + A[i]; }

Page 57: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

57

Basic approach to check the instruction code

Get the assembler code (Intel compiler):icc –S –O3 -xHost triad.c -o triad.s

Disassemble Executable: objdump –d ./cacheBench | less

Things to check for: Is the code vectorized? Search for pd/ps suffix. mulpd, addpd, vaddpd, vmulpd Is the data loaded with 16 byte moves? movapd, movaps, vmovupd For memory-bound code: Search for nontemporal stores: movntpd, movntps

The x86 ISA is documented in:Intel Software Development Manual (SDM) 2A and 2BAMD64 Architecture Programmer's Manual Vol. 1-5

Page 58: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

58

Some basics of the x86-64 ISA

16 general Purpose Registers (64bit): rax, rbx, rcx, rdx, rsi, rdi, rsp, rbp, r8-r15alias with eight 32 bit register set:eax, ebx, ecx, edx, esi, edi, esp, ebp

Floating Point SIMD Registers:xmm0-xmm15 SSE (128bit) alias with 256bit registersymm0-ymm15 AVX (256bit)

SIMD instructions are distinguished by:AVX (VEX) prefix: vOperation: mul, add, movModifier: non temporal (nt), unaligned (u), aligned (a), high(h)Data range: packed (p), scalar (s)Data type: single (s), double (d)

Page 59: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

5959

Some basic single core optimizations – warnings first

“Premature optimization is the root of all evil.”Donald E. Knuth

“Parallel performance is easy, single node/core performance is difficult” Bill Gropp

Page 60: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

60

Single core: Common sense optimizations (1)

Do less work! Reducing the work to be done is never a bad idea!

logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one endif ! Is trueenddo

logical :: flagflag = .false.do i=1,N if(comlex_func(A(i)) < TRESHOLD) then ! Check if at flag=.true. ! Least one EXIT ! Is true and endif ! EXIT do-loopenddo

Page 61: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

61

Single core: Common sense optimizations (2)

Avoid expensive operations! FP MULT & FP ADD are the fastest way to compute Avoid DIV / SQRT / SIN / COS / TAN ,… table lookup Avoid a one-to-one implementation of the algorithm, e.g.

A= A + B**2 A = A + B**2.0 (V1)(A,B float) A = A + B * B (V2)

(V1) is not a good idea: B**2.0 exp{2.0*ln(B)}1. Computing exp & ln is very expensive 2. B < 0 ?!

Most useful if data is close to CPU!

Page 62: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

62

Single core: Common sense optimizations (3)

Shrink the working set!Working on small data sets reduces data transfer volume and increases the probability for cache hits

Analyze if appropriate data types are used, e.g.:4 different particle species have to be distinguished:

integer spec(1:N) spec(i) = {0,1,2,3} sizeof(spec(1:N)) = 4*N*byte

OR use 1-Byte integer datatype integer*1 spec(1:N) sizeof(spec(1:N)) = N*byte

OR use 2-Bit for each speciesinteger*1 spec(1:N/4) sizeof(spec(1:N)) = N/4*byte

Strongly depends on application!

Page 63: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

63

Single core: Common sense optimizations (4a)

Elimination of common subexpressions!This often reduces MFLOP/s rate but improves runtime

In principle the compiler should do the job but do not rely on it! Associativity rules may prevent compiler to do so Compiler does not recognize (limited scope)

Page 64: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

64

Single core: Common sense optimizations (4b)

Replace expensive functions by table lookup

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

do iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+tanh(edelz)) … … enddo … …enddo

Entries: -1,0,1

edelz=-6,-5,-4,-3,-2,-1,1,0,1,2,3,4,5,6

Compute all 13 potential values of tanh(edelz) before and store it in a table with 13 entries: tanh_table(-6:6)

Page 65: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

65

Single core: Common sense optimizations (4c)

Replace expensive functions by table lookup

30. April 2013 CAMA 2013 - D. Fey and G. Wellein

do i=-6,6 tanh_table(i) = tanh(i)enddodo iter = … … … do i=1,n … edelz=iL(i)+iR(i)+iU(i)+iO(i)+iS(i)+iN(i) BF= 0.5d0*(1.d0+ tanh_table(edelz)) … … enddo … …enddo

Page 66: Computer Architecture for  Medical Applications Pipelining  &  S ingle I nstruction M ultiple D ata  – two driving factors of single core performance

66

Single core: Common sense optimizations (5)

Avoid branches!Support the compiler to understand and optimize your code!

Code change may enable vectorization, SIMD & other optimizations

BTW software pipelining is also much easier and no branch prediction is required