university of michigan electrical engineering and computer science macross: macro-simdization of...

25
University of Michigan Electrical Engineering and Computer Science MacroSS: Macro- SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi , Mark Woh*, Manjunath Kudlur , Rodric Rabbah , Trevor Mudge*, Scott Mahlke* * Advanced Computer Arch. Lab., University of Michigan † Nvidia Corp. ‡ IBM T.J. Watson Research Center

Post on 15-Jan-2016

213 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

MacroSS: Macro-SIMDization of Streaming Applications

Amir Hormati*, Yoonseo Choi‡, Mark Woh*,

Manjunath Kudlur†, Rodric Rabbah‡, Trevor Mudge*,

Scott Mahlke*

* Advanced Computer Arch. Lab.,

University of Michigan† Nvidia Corp. ‡ IBM T.J. Watson Research

Center

Page 2: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Importance of SIMD

• Energy and area efficient way to exploit data-level parallelism

• Performance in multimedia and communication apps

• Ubiquitous in modern processors– Intel: SSE, Larrabee– IBM: Altivec, Cell SPE – ARM: Neon

Control Unit

Functional Units

Cache

Control Unit

Functional Units

Cache

Control Unit

Functional Units

Cache

Page 3: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Stream Computing

• Prevalent in embedded, desktop and server systems

• Many optimizations for mapping and scheduling applications to parallel architectures

• Retargetability is a big plus in streaming languages

• Task, pipeline, and data-level parallelism is mapped into core-level parallelism

• Data-level parallelism on SIMD engines is not utilized

Page 4: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Traditional Vectorization on Streaming Applications

AudioBeam

BeamForm

erDCT

FFT

FM R

adio

Matr

ix Multip

ly

Matr

ix Multip

ly Block

Bitonic

Sort

FilterB

ank

MP3 D

ecoder

Average

0

0.5

1

1.5

2

2.5

3

3.5ICC + Auto Vectorize

Sp

ee

du

p (

x)

Page 5: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Why SIMD engines are under-utilized?

• Finding data-level parallelism suitable for SIMD engines

• Proper data-alignment

• Complicated compiler optimization and transformations

• Wide variety of SIMD standards

Page 6: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

In this work…

• Macro-level SIMDization techniques for streaming languages.

• MacroSS compiler for StreamIt language

• Hardware-based buffer optimizations for packing/unpacking operations

• Evaluation of MacroSS on Intel Core i7

Page 7: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

StreamIt

• Main Constructs:– Filter: Encapsulate computation.

• Stateful• Stateless

– Pipeline Expressing pipeline parallelism

– Splitjoin Expressing task/data-level parallelism

• Exposes different types of parallelism

• Scheduling and rate-matching are needed

pipeline

filter

splitjoin

Page 8: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Macro SIMDization

• SIMDization at graph level

• Tunes the graph based on the target system– SIMD standards– Wide/Narrow SIMD

• Actor SIMDization:– Single-Actor– Vertical– Horizontal

Page 9: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

EE EE

Single-Actor SIMDization Overview

E

E v

E

E

E

E

E

E

E

EEEE E v

E(8)

E v

E v

Execution ReorderingSerial Execution Ideal VectorizationRealistic Vectorization

Page 10: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (8)

Single Actor SIMDization0 x0_v.{3} = peek(9);1 x0_v.{2} = peek(6);2 x0_v.{1} = peek(3);3 x0_v.{0} = pop();

4 x1_v.{3} = peek(9);5 x1_v.{2} = peek(6);6 x1_v.{1} = peek(3);7 x1_v.{0} = pop();

8 x2_v.{3} = peek(9);9 x2_v.{2} = peek(6);10 x2_v.{1} = peek(3);11 x2_v.{0} = pop();

12 result_v[0] = x1_v * cos(x0_v) + x2_v;13 result_v[1] = x0_v * cos(x1_v) + x2_v;14 result_v[2] = x1_v * sin(x0_v) + x2_v;15 result_v[3] = x0_v * sin(x1_v) + x2_v;

16 for (i : 0 to 3) {17 rpush(result_v[i].{3}, 12);18 rpush(result_v[i].{2}, 8);19 rpush(result_v[i].{1}, 4); 20 push(result_v[i].{0});21 }

EV (1)

• Only stateless actors• Scalar buffer accesses • Strided pushes and

pops

0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (8)0 x0 = pop();1 x1 = pop();2 x2 = pop();3 result[0] = x1 * cos(x0) + x2;4 result[1] = x0 * cos(x1) + x2;5 result[2] = x1 * sin(x0) + x2;5 result[3] = x0 * sin(x1) + x2;6 for (i : 0 to 3) 7 push(result[i]);

E (4)

Page 11: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Why Scalar Buffers?

Epop=3, push=4

Dpop=2, push=2

8

12

128 bits

60 42

2317 2119

2216 2018

159 1311

148 1210

71 531st Execution

2nd Execution

3nd Execution

?

90 63

2314 2017

2213 1916

2112 1815

112 85

101 74

2nd Execution

1st Execution

20 21 22 23

16 17 18 19

12 13 14 15

8 9 10 11

4 5 6 7

0 1 2 3

Page 12: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Vertical SIMDization

3D 2Epop=6, push=8

4

D0 D2 D3 D4 D5 D6 D7 D8 D9 D10 D11D1

E0 E2 E3 E4 E5 E6 E7E1

1st Execution 2nd Execution 3rd Execution

1st Execution 2nd Execution

Epop=3, push=4

Dpop=2, push=2

8

12

D0 D2D1

E0 E1

D3 D5D4

E2 E3

D6 D8D7

E4 E5

D9 D11D10

E6 E7

Page 13: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Horizontal SIMDization

• Find isomorphic actors in split/join structures

• The isomorphic actors are merge in one vectorized actor

• Actors can be both stateful or stateless.

Source

Splitter

A1

B1

C1

Sink

Joiner

An

Bn

Cn

. . .

. . .

. . .

Page 14: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Epop=3, push=4

Dpop=2, push=2

Fpop=4, push=1

Gpop=2, push=8

Hpop=8, push=n

Apop=n, push=8

Joiner (1, 1, 1, 1)

Splitter (4, 4, 4, 4)

C0pop=1, push=1

C3pop=1, push=1

C2pop=1, push=1

C1pop=1, push=1

B1pop=12, push=3

B2pop=12, push=3

B3pop=12, push=3

B0pop=12, push=3

6

3

3

3333

1 111

4

6

4

2

1

B3

C3

B2

C2

B1

C1

Epop=3, push=4

Dpop=2, push=2

Fpop=4, push=1

Gpop=2, push=8

Hpeek=8, pop=8,

push=n

Apop=n, push=8

Joiner (1, 1, 1, 1)

Splitter (4, 4, 4, 4)

C0pop=1, push=1

B0pop=12, push=3

12

6

6

6

2

8

12

8

4

2

3D 2E

B3B2

B1

3D 2E

C3C2

C1

HJoiner (1)

HSplitter (4)

3D 2E

3D 2Epop=6, push=8

Fpop=4, push=1

Gpeek=4, pop=2,

push=8G

peek=4, pop=2, push=8G

peek=4, pop=2, push=8G

pop=2, push=8

Hpop=8, push=n

C0pop=1, push=1

B0pop=12, push=3

Apop=n, push=8

12

1

6

8

1

2

22

22

6

666

6

Horizontal SIM

Dization

Vertical SIM

Dization

Single-Actor SIM

Dization

?

?

Page 15: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

20 21 22 23

16 17 18 19

12 13 14 15

8 9 10 11

4 5 6 7

Streaming Address Generation

0 1 2 3

14 17 20 23

13 16 19 22

12 15 18 21

2 5 8 11

1 4 7 10

0 3 6 9

E pop=2

Dpush=3

12

8

E pop=2

Dpush=3

12

8

Scalar Buffer Vector Buffer

• Area overhead less than 1% on Core i7.

• Critical path two 16-bit adds and one 64-bit add.

Page 16: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Traditional vs. Macro SIMDization

Traditional SIMDization Macro-SIMDization

Applicability Any Streaming

Adjust the schedule xTune streaming graph xIdentify isomorphic actors xEasily retargetable x

Complexity of optis and transformations High Low

Page 17: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Experimental Setup

Backend Compiler

Frontend Compiler

Streaming Program

C Code

Host Compiler

Intel Core i7

• Frontend StreamIt MIT Compiler

• Backend MacroSS

• ICC 11.1 compile C/C++ code

• Core i7 with SSE4

Page 18: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Macro-SIMDization vs. Traditional

AudioBeam

BeamFormer

DCTFFT

FM Radio

Matrix Multip

ly

Matrix Multip

ly Block

Bitonic Sort

FilterBank

MP3 Decoder

Average0

0.51

1.52

2.53

3.5

ICC + Auto Vectorize ICC + Macro SIMDICC + Macro SIMD + Autovectorize

Spee

dup

(x)

Page 19: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Benefits of SAGU

AudioBeam

BeamFormer

DCTFFT

FM Radio

Matrix Multip

ly

Matrix Multip

ly Block

Bitonic Sort

FilterBank

MP3 Decoder

Average0

5

10

15

20

25

% Im

prov

emen

t

Page 20: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Conclusion• Streaming is prevalent in all computing domains.

• Applying traditional SIMDization on streaming applications fails to utilize SIMD engines.

• Macro-SIMDization is done at higher level.

• MacroSS outperforms traditional SIMDization techniques by 54%.

Page 21: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Questions and Comments

Page 22: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Macro-SIMDization vs. Traditional

AudioBeam

BeamForm

erDCT

FFT

FM R

adio

Matr

ix Multip

ly

Matr

ix Multip

ly Block

Bitonic

Sort

FilterB

ank

MP3 D

ecoder

Average

0

0.5

1

1.5

2

2.5

3

3.5

4

GCC + Auto Vectorize GCC + Macro SIMDGCC + Macro SIMD + Autovectorize

Sp

ee

du

p (

x)

Page 23: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

SAGU Implementation

• Area overhead less than 1% on Core i7.

• Critical path two 16-bit adds and one 64-bit add.

• Minor ISA modifications are needed.

Page 24: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

SIMD + Multi-core Scheduling

• How to schedule for a heterogeneous SIMD system?

• SIMDization reduces memory/bus traffic

• Exploit SIMD parallelism before Core-level parallelism.

• Is this the best we can do?

Page 25: University of Michigan Electrical Engineering and Computer Science MacroSS: Macro-SIMDization of Streaming Applications Amir Hormati*, Yoonseo Choi ‡,

University of MichiganElectrical Engineering and Computer Science

Multicore + Macro-SIMDization

AudioBeam

BeamForm

erDCT

FFT

FM R

adio

Matr

ix Multip

ly

Matr

ix Multip

ly Block

Bitonic

Sort

FilterB

ank

MP3 D

ecoder

Average

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

2 Cores 4 Cores 2 Cores + Macro SIMD 4 Cores + Macro SIMD

Sp

ee

du

p (

x)