exploiting superword level parallelism with multimedia instruction sets

55
© 2000 MIT Exploiting Superword Level Parallelism with Multimedia Instruction Sets Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp

Upload: toyah

Post on 06-Jan-2016

32 views

Category:

Documents


4 download

DESCRIPTION

Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp. Overview. Problem statement - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting Superword Level Parallelism with Multimedia

Instruction Sets

Samuel LarsenSaman Amarasinghe

Laboratory for Computer ScienceMassachusetts Institute of Technology

{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp

Page 2: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Overview

• Problem statement• New paradigm for parallelism SLP • SLP extraction algorithm• Results• SLP vs. ILP and vector parallelism• Conclusions• Future work

Page 3: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Multimedia Extensions

• Additions to all major ISAs• SIMD operations

Instruction Set Architecture SIMD Width Floating PointAltiVec PowerPC 128 yesMMX/SSE Intel 64/128 yes3DNow! AMD 64 yesVIS Sun 64 noMAX2 HP 64 noMVI Alpha 64 noMDMX MIPS V 64 yes

Page 4: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

Page 5: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!

Page 6: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Using Multimedia Extensions

• Library calls and inline assembly– Difficult to program– Not portable

• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!

• Need automatic compilation

Page 7: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Vector Compilation

• Pros:– Successful for vector computers– Large body of research

Page 8: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Vector Compilation

• Pros:– Successful for vector computers– Large body of research

• Cons:– Involved transformations – Targets loop nests

Page 9: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Superword Level Parallelism (SLP)

• Small amount of parallelism– Typically 2 to 8-way

• Exists within basic blocks • Uncovered with a simple analysis

Page 10: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Superword Level Parallelism (SLP)

• Small amount of parallelism– Typically 2 to 8-way

• Exists within basic blocks • Uncovered with a simple analysis

• Independent isomorphic operations– New paradigm

Page 11: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

1. Independent ALU Ops

R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835

R R XR 1.08327G = G + XG * 1.89234B B XB 1.29835

Page 12: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

2. Adjacent Memory References

R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]

R RG = G + X[i:i+2]B B

Page 13: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]

3. Vectorizable Loops

Page 14: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

3. Vectorizable Loops

for (i=0; i<100; i+=4)

A[i:i+3] = B[i:i+3] + C[i:i+3]

for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0]

A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]

Page 15: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

4. Partially Vectorizable Loops

for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)

Page 16: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

4. Partially Vectorizable Loops

for (i=0; i<16; i+=2)

L0L1

= A[i:i+1] – B[i:i+1]

D = D + abs(L0)D = D + abs(L1)

for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L)

L = A[i+1] – B[i+1]D = D + abs(L)

Page 17: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting SLP with SIMD Execution

• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op

Page 18: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting SLP with SIMD Execution

• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op

• Cost:– Packing and unpacking– Reshuffling within a register

Page 19: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Packing/Unpacking Costs

C = A + 2D = B + 3

C A 2D B 3= +

Page 20: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Packing/Unpacking Costs

• Packing source operands

A AB BA = f()

B = g()C = A + 2D = B + 3

C A 2D B 3= +

Page 21: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Packing/Unpacking Costs

• Packing source operands• Unpacking destination operands

C CD D

A = f()B = g()C = A + 2D = B + 3E = C / 5F = D * 7

A AB B

C A 2D B 3= +

Page 22: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Optimizing Program Performance

• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking

Page 23: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Optimizing Program Performance

• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking

• Many packing possibilities– Worst case: n ops n! configurations– Different cost/benefit for each choice

Page 24: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

A = B + CD = E + F

Observation 1:Packing Costs can be Amortized

• Use packed result operands

G = A - HI = D - J

Page 25: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Observation 1:Packing Costs can be Amortized

• Use packed result operands• Share packed source operands

A = B + CD = E + F

G = B + HI = E + J

A = B + CD = E + F

G = A - HI = D - J

Page 26: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Observation 2:Adjacent Memory is Key

• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth

Page 27: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Observation 2:Adjacent Memory is Key

• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth

• Few packing possibilities– Only one ordering exploits pre-packing

Page 28: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Identify adjacent memory references

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

Page 29: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Identify adjacent memory references

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

Page 30: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow def-use chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

Page 31: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow def-use chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

HJ

CD

AB= -

Page 32: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow use-def chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

HJ

CD

AB= -

Page 33: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow use-def chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

CD

EF

35= *

HJ

CD

AB= -

Page 34: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Extraction Algorithm

• Follow use-def chains

A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B

AB = X[i:i+1]

CD

EF

35= *

HJ

CD

AB= -

Page 35: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Compiler Results

• SLP compiler implemented in SUIF• Tested on two benchmark suites

– SPEC95fp– Multimedia kernels

• Performance measured three ways:– SLP availability– Compared to vector parallelism– Speedup on AltiVec

Page 36: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP Availability

0

10

20

30

40

50

60

70

80

90

100

swim

tomca

tv

mgr

id

su2c

or

wave5

apsi

hydr

o2d

turb

3d

applu

fppp

p FIR IIRVM

MMMM

YUV

% dynamic SUIF instructions eliminated

128 bits1024 bits

Page 37: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

swim

tomca

tv

mgrid

su2c

or

wave5 ap

si

hydro2

d

turb3d

applu

fppp

p

SLPVector

Page 38: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Speedup on AltiVec

1

1.1

1.2

1.3

1.4

1.5

1.6

1.7

1.8

1.9

2

swim

tomca

tv FIR IIRVM

MMMM

YUV

6.7

Page 39: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

• Extracted with a simple analysis– SLP is fine grain basic blocks

Page 40: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

• Extracted with a simple analysis– SLP is fine grain basic blocks

• Superset of vector parallelism – Unrolling transforms VP to SLP– Handles partially vectorizable loops

Page 41: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

}Basic block

Page 42: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. Vector Parallelism

Iterations

Page 43: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. ILP

• Subset of instruction level parallelism

Page 44: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. ILP

• Subset of instruction level parallelism

• SIMD hardware is simpler– Lack of heavily ported register files

Page 45: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP vs. ILP

• Subset of instruction level parallelism

• SIMD hardware is simpler– Lack of heavily ported register files

• SIMD instructions are more compact– Reduces instruction fetch bandwidth

Page 46: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP and ILP

• SLP & ILP can be exploited together– Many architectures can already do this

Page 47: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP and ILP

• SLP & ILP can be exploited together– Many architectures can already do this

• SLP & ILP may compete– Occurs when parallelism is scarce

Page 48: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

SLP and ILP

• SLP & ILP can be exploited together– Many architectures can already do this

• SLP & ILP may compete– Occurs when parallelism is scarce

• Unroll the loop more times– When ILP is due to loop level

parallelism

Page 49: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

Page 50: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

Page 51: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70

Page 52: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Conclusions

• Multimedia architectures abundant– Need automatic compilation

• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp

• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70

• Found SLP in general-purpose codes

Page 53: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Future Work

• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops

Page 54: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Future Work

• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops

• SLP architectures– Emphasis on SIMD– Better packing/unpacking

Page 55: Exploiting Superword Level Parallelism with Multimedia Instruction Sets

© 2000 MIT

Exploiting Superword Level Parallelism with Multimedia

Instruction Sets

Samuel LarsenSaman Amarasinghe

Laboratory for Computer ScienceMassachusetts Institute of Technology

{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp