ehsan&totoni& babak&behzad& swapnil&ghike& › iacoma-papers › pres ›...

Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Illinois-Intel Parallelism Center �2

2

•  Increasing number of transistors on chip – Power and energy limited – Single-‐thread performance limited – => parallelism

•  Many opEons: heavy mulEcore, manycore, “SIMD-‐like” GPUs, low-‐power designs –  Intel’s SCC presents manycore


3

•  Various trade-‐offs – Performance, power and energy – Programmability – Different for each applicaEon

•  Each architecture has advantages – No single best soluEon –  (But manycore is a good opEon)


4

•  Benchmark different parallel applicaEons – On all architectures

•  Analyze results, learn about architecture •  Where does each architecture stand? –  In various tradeoffs

•  What could be probably beZer for future? – Reasonable in all metrics: power, performance… – For most applicaEons


5

•  Jacobi: nearest neighbor communicaEon – Standard floaEng-‐point benchmark

•  NAMD: Molecular dynamics – Dynamic scienEfic applicaEon

•  Nqueens: state space search –  Integer intensive


6

•  CG: Conjugate Gradient, from NAS Parallel Benchmarks (NPB) – A lot of fine-‐grain communicaEon

•  Integer Sort: represents commercial workloads –  Integer intensive, heavy communicaEon


7

•  All codes reasonably opEmized, not extensively! – Programmer producEvity

•  MPI and Charm++ •  CUDA and OpenCL codes for GPU – Reasonably opEmized, same complexity as CPU ones

– Fair apple-‐to-‐apple comparison


8

•  MulE-‐core: Intel Core i7 – 4 cores, 8 threads, 2.8 GHz

•  Low-‐power: Intel Atom D – 2 cores, 4 threads, 1.8 GHz

•  “SIMD-‐like” GPU: Nvidia ION2 – 16 CUDA cores, low power, 475 MHz

•  Manycore: Intel’s SCC – 48 cores, 800 MHz


9

•  “Single-‐Chip Cloud Computer” (SCC) –  Just a name, not only for cloud –  Experimental chip, for research on manycore

•  Slower than producEon chips, but we can analyze –  48 PenEum cores, Mesh interconnect

•  Not cache coherent, message-‐passing programs


10

•  We ported Charm infrastructure to SCC – To run Charm++ and MPI parallel codes – Charm++ is message-‐passing parallel objects

•  No conceptual difficulty – SCC is similar to a small cluster

•  Just headaches of experimental chip – Old compiler, OS and libraries…


11

Apps can use parallelism in architecture

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8

Sp

eed

up

Number of Threads

NAMDJacobi

NQueensCG

Sort


12

150W

85W 80

90

100

110

120

130

140

150

160

1 2 3 4 5 6 7 8

Po

wer

(W

)

Number of Threads

NAMDJacobi

NQueensCG

Sort


13

Speedup offsets Power increase

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

1 2 3 4 5 6 7 8

No

rmal

ized

En

erg

y

Number of Threads

NAMDJacobi

NQueensCG

Sort


14

1

1.5

2

2.5

3

3.5

1 2 3 4

Sp

eed

up

Number of Threads

NAMDJacobi

NQueensCG

Sort


15

31W

29W

28.5

29

29.5

30

30.5

31

31.5

32

1 2 3 4

Po

wer

(W

)

Number of Threads

NAMDJacobi

NQueensCG

Sort


16

Speedup and near constant

power

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

1 2 3 4

No

rmal

ized

En

erg

y

Number of Threads

NAMDJacobi

NQueensCG

Sort


17

0

5

10

15

20

25

30

35

40

Jacobi NAMD NQueens Sort 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Spee

dup, P

ow

er (

W)

Rel

ativ

e E

ner

gy

SpeedupPower (W)

Relative Energy

Energy


18

Except CG (fine-‐grain collec5ves)

No change in source code!

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35 40 45 50

Sp

eed

up

Number of Cores

NAMDJacobi

NQueensCG

Sort


19

Different power

(Communica5on bound so processors stall)

35

40

45

50

55

60

65

70

75

80

85

90

0 5 10 15 20 25 30 35 40 45 50

Pow

er (

W)

Number of Cores

NAMDJacobi

NQueensCG

Sort

60W

85W

40W


20

not for CG

Apps scalability needed to offset power increase!

0

2

4

6

8

10

12

14

16

18

20

22

0 5 10 15 20 25 30 35 40 45 50

Norm

aliz

ed E

ner

gy

Number of Cores

NAMDJacobi

NQueensCG

Sort


21

0

5

10

15

20

25

30

Jacobi NAMD NQueens CG Sort

Rel

ativ

e S

pee

d

Atom (4 Threads)

Core i7 (8 Threads)

SCC (48 Threads)

ION2 (16 Threads)GPU good for some apps

Regular, highly parallel

Mul5core good for most apps

Even dynamic, irregular, less parallel

SCC beats mul5core on Nqueens

experimental chip!

Low-‐power is really slow!


22

0

20

40

60

80

100

120

140

160

180


Pow

er (

W)

Atom (4 Threads)Core i7 (8 Threads)

SCC (48 Threads)ION2 (16 Threads)

GPU and have low

consump5on

Mul5core has high consump5on

SCC has moderate power

More parallelism, less

frequency

High frequency, fancy features

Power nearly same for different apps


23

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


Rel

ativ

e E

ner

gy

Atom (4 Threads)

Core i7 (8 Threads)

SCC (48 Threads)

ION2 (16 Threads)

Mul5core is energy efficient!

High speed offsets power

Low-‐power design is not

energy efficient!

Low speed offsets power

savings

SCC is OK

Despite low speed


24

•  MulEcore sEll effecEve for many apps – Dynamic, irregular, heavy communicaEon

•  GPU is excepEonal for some apps – Not for all apps – Programmability, portability issues

•  Low-‐power design just low power – Slow! So not energy efficient


25

•  SCC: a good opEon – Low power and can run legacy code fast

•  Balanced design point – Lower power than mulEcore, faster than low-‐power design

– No programmability and portability issues of GPU •  Experimental, slow but can be improved


26

•  SequenEal performance, floaEng point – Apps show good scaling on more cores of SCC – Easy with more CMOS transistors –  Interconnect also has to improve with it (sEll easy)

•  Global collecEves network – Like BlueGene supercomputers, low cost – To support fine-‐grain global communicaEons

•  As in CG

ehsan&totoni& babak&behzad& swapnil&ghike& › iacoma-papers › pres ›...

Documents