exploiting superword level parallelism with multimedia instruction sets
DESCRIPTION
Exploiting Superword Level Parallelism with Multimedia Instruction Sets. Samuel Larsen Saman Amarasinghe Laboratory for Computer Science Massachusetts Institute of Technology {slarsen,saman}@lcs.mit.edu www.cag.lcs.mit.edu/slp. Overview. Problem statement - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/1.jpg)
© 2000 MIT
Exploiting Superword Level Parallelism with Multimedia
Instruction Sets
Samuel LarsenSaman Amarasinghe
Laboratory for Computer ScienceMassachusetts Institute of Technology
{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp
![Page 2: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/2.jpg)
© 2000 MIT
Overview
• Problem statement• New paradigm for parallelism SLP • SLP extraction algorithm• Results• SLP vs. ILP and vector parallelism• Conclusions• Future work
![Page 3: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/3.jpg)
© 2000 MIT
Multimedia Extensions
• Additions to all major ISAs• SIMD operations
Instruction Set Architecture SIMD Width Floating PointAltiVec PowerPC 128 yesMMX/SSE Intel 64/128 yes3DNow! AMD 64 yesVIS Sun 64 noMAX2 HP 64 noMVI Alpha 64 noMDMX MIPS V 64 yes
![Page 4: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/4.jpg)
© 2000 MIT
Using Multimedia Extensions
• Library calls and inline assembly– Difficult to program– Not portable
![Page 5: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/5.jpg)
© 2000 MIT
Using Multimedia Extensions
• Library calls and inline assembly– Difficult to program– Not portable
• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!
![Page 6: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/6.jpg)
© 2000 MIT
Using Multimedia Extensions
• Library calls and inline assembly– Difficult to program– Not portable
• Different extensions to the same ISA– MMX and SSE– SSE vs. 3DNow!
• Need automatic compilation
![Page 7: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/7.jpg)
© 2000 MIT
Vector Compilation
• Pros:– Successful for vector computers– Large body of research
![Page 8: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/8.jpg)
© 2000 MIT
Vector Compilation
• Pros:– Successful for vector computers– Large body of research
• Cons:– Involved transformations – Targets loop nests
![Page 9: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/9.jpg)
© 2000 MIT
Superword Level Parallelism (SLP)
• Small amount of parallelism– Typically 2 to 8-way
• Exists within basic blocks • Uncovered with a simple analysis
![Page 10: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/10.jpg)
© 2000 MIT
Superword Level Parallelism (SLP)
• Small amount of parallelism– Typically 2 to 8-way
• Exists within basic blocks • Uncovered with a simple analysis
• Independent isomorphic operations– New paradigm
![Page 11: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/11.jpg)
© 2000 MIT
1. Independent ALU Ops
R = R + XR * 1.08327G = G + XG * 1.89234B = B + XB * 1.29835
R R XR 1.08327G = G + XG * 1.89234B B XB 1.29835
![Page 12: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/12.jpg)
© 2000 MIT
2. Adjacent Memory References
R = R + X[i+0]G = G + X[i+1]B = B + X[i+2]
R RG = G + X[i:i+2]B B
![Page 13: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/13.jpg)
© 2000 MIT
for (i=0; i<100; i+=1) A[i+0] = A[i+0] + B[i+0]
3. Vectorizable Loops
![Page 14: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/14.jpg)
© 2000 MIT
3. Vectorizable Loops
for (i=0; i<100; i+=4)
A[i:i+3] = B[i:i+3] + C[i:i+3]
for (i=0; i<100; i+=4) A[i+0] = A[i+0] + B[i+0]
A[i+1] = A[i+1] + B[i+1]A[i+2] = A[i+2] + B[i+2]A[i+3] = A[i+3] + B[i+3]
![Page 15: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/15.jpg)
© 2000 MIT
4. Partially Vectorizable Loops
for (i=0; i<16; i+=1) L = A[i+0] – B[i+0] D = D + abs(L)
![Page 16: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/16.jpg)
© 2000 MIT
4. Partially Vectorizable Loops
for (i=0; i<16; i+=2)
L0L1
= A[i:i+1] – B[i:i+1]
D = D + abs(L0)D = D + abs(L1)
for (i=0; i<16; i+=2) L = A[i+0] – B[i+0] D = D + abs(L)
L = A[i+1] – B[i+1]D = D + abs(L)
![Page 17: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/17.jpg)
© 2000 MIT
Exploiting SLP with SIMD Execution
• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op
![Page 18: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/18.jpg)
© 2000 MIT
Exploiting SLP with SIMD Execution
• Benefit:– Multiple ALU ops One SIMD op– Multiple ld/st ops One wide mem op
• Cost:– Packing and unpacking– Reshuffling within a register
![Page 19: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/19.jpg)
© 2000 MIT
Packing/Unpacking Costs
C = A + 2D = B + 3
C A 2D B 3= +
![Page 20: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/20.jpg)
© 2000 MIT
Packing/Unpacking Costs
• Packing source operands
A AB BA = f()
B = g()C = A + 2D = B + 3
C A 2D B 3= +
![Page 21: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/21.jpg)
© 2000 MIT
Packing/Unpacking Costs
• Packing source operands• Unpacking destination operands
C CD D
A = f()B = g()C = A + 2D = B + 3E = C / 5F = D * 7
A AB B
C A 2D B 3= +
![Page 22: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/22.jpg)
© 2000 MIT
Optimizing Program Performance
• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking
![Page 23: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/23.jpg)
© 2000 MIT
Optimizing Program Performance
• To achieve the best speedup:– Maximize parallelization– Minimize packing/unpacking
• Many packing possibilities– Worst case: n ops n! configurations– Different cost/benefit for each choice
![Page 24: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/24.jpg)
© 2000 MIT
A = B + CD = E + F
Observation 1:Packing Costs can be Amortized
• Use packed result operands
G = A - HI = D - J
![Page 25: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/25.jpg)
© 2000 MIT
Observation 1:Packing Costs can be Amortized
• Use packed result operands• Share packed source operands
A = B + CD = E + F
G = B + HI = E + J
A = B + CD = E + F
G = A - HI = D - J
![Page 26: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/26.jpg)
© 2000 MIT
Observation 2:Adjacent Memory is Key
• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth
![Page 27: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/27.jpg)
© 2000 MIT
Observation 2:Adjacent Memory is Key
• Large potential performance gains– Eliminate ld/str instructions– Reduce memory bandwidth
• Few packing possibilities– Only one ordering exploits pre-packing
![Page 28: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/28.jpg)
© 2000 MIT
SLP Extraction Algorithm
• Identify adjacent memory references
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
![Page 29: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/29.jpg)
© 2000 MIT
SLP Extraction Algorithm
• Identify adjacent memory references
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
![Page 30: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/30.jpg)
© 2000 MIT
SLP Extraction Algorithm
• Follow def-use chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
![Page 31: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/31.jpg)
© 2000 MIT
SLP Extraction Algorithm
• Follow def-use chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
HJ
CD
AB= -
![Page 32: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/32.jpg)
© 2000 MIT
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
HJ
CD
AB= -
![Page 33: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/33.jpg)
© 2000 MIT
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
CD
EF
35= *
HJ
CD
AB= -
![Page 34: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/34.jpg)
© 2000 MIT
SLP Extraction Algorithm
• Follow use-def chains
A = X[i+0]C = E * 3B = X[i+1]H = C – AD = F * 5J = D - B
AB = X[i:i+1]
CD
EF
35= *
HJ
CD
AB= -
![Page 35: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/35.jpg)
© 2000 MIT
SLP Compiler Results
• SLP compiler implemented in SUIF• Tested on two benchmark suites
– SPEC95fp– Multimedia kernels
• Performance measured three ways:– SLP availability– Compared to vector parallelism– Speedup on AltiVec
![Page 36: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/36.jpg)
© 2000 MIT
SLP Availability
0
10
20
30
40
50
60
70
80
90
100
swim
tomca
tv
mgr
id
su2c
or
wave5
apsi
hydr
o2d
turb
3d
applu
fppp
p FIR IIRVM
MMMM
YUV
% dynamic SUIF instructions eliminated
128 bits1024 bits
![Page 37: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/37.jpg)
© 2000 MIT
SLP vs. Vector Parallelism
0%
10%
20%
30%
40%
50%
60%
70%
80%
90%
100%
swim
tomca
tv
mgrid
su2c
or
wave5 ap
si
hydro2
d
turb3d
applu
fppp
p
SLPVector
![Page 38: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/38.jpg)
© 2000 MIT
Speedup on AltiVec
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
swim
tomca
tv FIR IIRVM
MMMM
YUV
6.7
![Page 39: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/39.jpg)
© 2000 MIT
SLP vs. Vector Parallelism
• Extracted with a simple analysis– SLP is fine grain basic blocks
![Page 40: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/40.jpg)
© 2000 MIT
SLP vs. Vector Parallelism
• Extracted with a simple analysis– SLP is fine grain basic blocks
• Superset of vector parallelism – Unrolling transforms VP to SLP– Handles partially vectorizable loops
![Page 41: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/41.jpg)
© 2000 MIT
SLP vs. Vector Parallelism
}Basic block
![Page 42: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/42.jpg)
© 2000 MIT
SLP vs. Vector Parallelism
Iterations
![Page 43: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/43.jpg)
© 2000 MIT
SLP vs. ILP
• Subset of instruction level parallelism
![Page 44: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/44.jpg)
© 2000 MIT
SLP vs. ILP
• Subset of instruction level parallelism
• SIMD hardware is simpler– Lack of heavily ported register files
![Page 45: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/45.jpg)
© 2000 MIT
SLP vs. ILP
• Subset of instruction level parallelism
• SIMD hardware is simpler– Lack of heavily ported register files
• SIMD instructions are more compact– Reduces instruction fetch bandwidth
![Page 46: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/46.jpg)
© 2000 MIT
SLP and ILP
• SLP & ILP can be exploited together– Many architectures can already do this
![Page 47: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/47.jpg)
© 2000 MIT
SLP and ILP
• SLP & ILP can be exploited together– Many architectures can already do this
• SLP & ILP may compete– Occurs when parallelism is scarce
![Page 48: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/48.jpg)
© 2000 MIT
SLP and ILP
• SLP & ILP can be exploited together– Many architectures can already do this
• SLP & ILP may compete– Occurs when parallelism is scarce
• Unroll the loop more times– When ILP is due to loop level
parallelism
![Page 49: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/49.jpg)
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
![Page 50: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/50.jpg)
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp
![Page 51: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/51.jpg)
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp
• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70
![Page 52: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/52.jpg)
© 2000 MIT
Conclusions
• Multimedia architectures abundant– Need automatic compilation
• SLP is the right paradigm– 20% non-vectorizable in SPEC95fp
• SLP extraction successful– Simple, local analysis– Provides speedups from 1.24 – 6.70
• Found SLP in general-purpose codes
![Page 53: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/53.jpg)
© 2000 MIT
Future Work
• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops
![Page 54: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/54.jpg)
© 2000 MIT
Future Work
• SLP analysis beyond basic blocks– Packing maintained across blocks– Loop invariant packing– Fill unused slots with speculative ops
• SLP architectures– Emphasis on SIMD– Better packing/unpacking
![Page 55: Exploiting Superword Level Parallelism with Multimedia Instruction Sets](https://reader036.vdocument.in/reader036/viewer/2022062408/56813ba9550346895da4d9fa/html5/thumbnails/55.jpg)
© 2000 MIT
Exploiting Superword Level Parallelism with Multimedia
Instruction Sets
Samuel LarsenSaman Amarasinghe
Laboratory for Computer ScienceMassachusetts Institute of Technology
{slarsen,saman}@lcs.mit.eduwww.cag.lcs.mit.edu/slp