ehsan&totoni& babak&behzad&...

27
Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas

Upload: others

Post on 26-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Ehsan  Totoni  Babak  Behzad  Swapnil  Ghike  Josep  Torrellas  

Page 2: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �2  

2  

•  Increasing  number  of  transistors  on  chip  – Power  and  energy  limited  – Single-­‐thread  performance  limited  – =>  parallelism  

•  Many  opEons:  heavy  mulEcore,  manycore,  “SIMD-­‐like”  GPUs,  low-­‐power  designs  –  Intel’s  SCC  presents  manycore  

Page 3: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �3  

3  

•  Various  trade-­‐offs  – Performance,  power  and  energy  – Programmability  – Different  for  each  applicaEon  

•  Each  architecture  has  advantages  – No  single  best  soluEon  –  (But  manycore  is  a  good  opEon)  

Page 4: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �4  

4  

•  Benchmark  different  parallel  applicaEons  – On  all  architectures  

•  Analyze  results,  learn  about  architecture  •  Where  does  each  architecture  stand?  –  In  various  tradeoffs  

•  What  could  be  probably  beZer  for  future?  – Reasonable  in  all  metrics:  power,  performance…  – For  most  applicaEons  

Page 5: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �5  

5  

•  Jacobi:  nearest  neighbor  communicaEon  – Standard  floaEng-­‐point  benchmark  

•  NAMD:  Molecular  dynamics  – Dynamic  scienEfic  applicaEon  

•  Nqueens:  state  space  search  –  Integer  intensive  

Page 6: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �6  

6  

•  CG:  Conjugate  Gradient,  from  NAS  Parallel  Benchmarks  (NPB)  – A  lot  of  fine-­‐grain  communicaEon  

•  Integer  Sort:  represents  commercial  workloads  –  Integer  intensive,  heavy  communicaEon    

Page 7: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �7  

7  

•  All  codes  reasonably  opEmized,  not  extensively!  – Programmer  producEvity  

•  MPI  and  Charm++  •  CUDA  and  OpenCL  codes  for  GPU  – Reasonably  opEmized,  same  complexity  as  CPU  ones  

– Fair  apple-­‐to-­‐apple  comparison    

Page 8: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �8  

8  

•  MulE-­‐core:  Intel  Core  i7  – 4  cores,  8  threads,  2.8  GHz  

•  Low-­‐power:  Intel  Atom  D  – 2  cores,  4  threads,  1.8  GHz  

•  “SIMD-­‐like”  GPU:  Nvidia  ION2  – 16  CUDA  cores,  low  power,  475  MHz  

•  Manycore:  Intel’s  SCC  – 48  cores,  800  MHz  

Page 9: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �9  

9  

•  “Single-­‐Chip  Cloud  Computer”  (SCC)  –  Just  a  name,  not  only  for  cloud  –  Experimental  chip,  for  research  on  manycore  

•  Slower  than  producEon  chips,  but  we  can  analyze  –  48  PenEum  cores,  Mesh  interconnect  

•  Not  cache  coherent,  message-­‐passing  programs  

Page 10: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �10  

10  

•  We  ported  Charm  infrastructure  to  SCC  – To  run  Charm++  and  MPI  parallel  codes  – Charm++  is  message-­‐passing  parallel  objects  

•  No  conceptual  difficulty  – SCC  is  similar  to  a  small  cluster  

•  Just  headaches  of  experimental  chip  – Old  compiler,  OS  and  libraries…  

Page 11: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �11  

11  

Apps  can  use  parallelism  in  architecture  

1

1.5

2

2.5

3

3.5

4

4.5

5

1 2 3 4 5 6 7 8

Sp

eed

up

Number of Threads

NAMDJacobi

NQueensCG

Sort

Page 12: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �12  

12  

150W  

85W   80

90

100

110

120

130

140

150

160

1 2 3 4 5 6 7 8

Po

wer

(W

)

Number of Threads

NAMDJacobi

NQueensCG

Sort

Page 13: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �13  

13  

Speedup  offsets  Power  increase  

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

1 2 3 4 5 6 7 8

No

rmal

ized

En

erg

y

Number of Threads

NAMDJacobi

NQueensCG

Sort

Page 14: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �14  

14  

1

1.5

2

2.5

3

3.5

1 2 3 4

Sp

eed

up

Number of Threads

NAMDJacobi

NQueensCG

Sort

Page 15: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �15  

15  

31W  

29W  

28.5

29

29.5

30

30.5

31

31.5

32

1 2 3 4

Po

wer

(W

)

Number of Threads

NAMDJacobi

NQueensCG

Sort

Page 16: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �16  

16  

Speedup  and  near  constant  

power  

0.8

1

1.2

1.4

1.6

1.8

2

2.2

2.4

2.6

2.8

3

1 2 3 4

No

rmal

ized

En

erg

y

Number of Threads

NAMDJacobi

NQueensCG

Sort

Page 17: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �17  

17  

0

5

10

15

20

25

30

35

40

Jacobi NAMD NQueens Sort 0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2

Spee

dup, P

ow

er (

W)

Rel

ativ

e E

ner

gy

SpeedupPower (W)

Relative Energy

Energy  

Page 18: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �18  

18  

Except  CG  (fine-­‐grain  collec5ves)  

No  change  in  source  code!  

0

5

10

15

20

25

30

35

0 5 10 15 20 25 30 35 40 45 50

Sp

eed

up

Number of Cores

NAMDJacobi

NQueensCG

Sort

Page 19: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �19  

19  

Different  power  

(Communica5on  bound  so  processors  stall)  

35

40

45

50

55

60

65

70

75

80

85

90

0 5 10 15 20 25 30 35 40 45 50

Pow

er (

W)

Number of Cores

NAMDJacobi

NQueensCG

Sort

60W  

85W  

40W  

Page 20: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �20  

20  

not  for  CG  

Apps  scalability  needed  to  offset  power  increase!  

0

2

4

6

8

10

12

14

16

18

20

22

0 5 10 15 20 25 30 35 40 45 50

Norm

aliz

ed E

ner

gy

Number of Cores

NAMDJacobi

NQueensCG

Sort

Page 21: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �21  

21  

0

5

10

15

20

25

30

Jacobi NAMD NQueens CG Sort

Rel

ativ

e S

pee

d

Atom (4 Threads)

Core i7 (8 Threads)

SCC (48 Threads)

ION2 (16 Threads)GPU  good  for  some  apps  

Regular,  highly  parallel  

Mul5core  good  for  most  apps  

Even  dynamic,  irregular,  less  parallel  

SCC  beats  mul5core  on  Nqueens  

experimental  chip!  

Low-­‐power  is  really  slow!  

Page 22: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �22  

22  

0

20

40

60

80

100

120

140

160

180

Jacobi NAMD NQueens CG Sort

Pow

er (

W)

Atom (4 Threads)Core i7 (8 Threads)

SCC (48 Threads)ION2 (16 Threads)

GPU  and  have  low  

consump5on  

Mul5core  has  high  consump5on  

SCC  has  moderate  power  

More  parallelism,  less  

frequency  

High  frequency,  fancy  features  

Power  nearly  same  for  different  apps  

Page 23: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �23  

23  

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

Jacobi NAMD NQueens CG Sort

Rel

ativ

e E

ner

gy

Atom (4 Threads)

Core i7 (8 Threads)

SCC (48 Threads)

ION2 (16 Threads)

Mul5core  is  energy  efficient!  

High  speed  offsets  power  

Low-­‐power  design  is  not  

energy  efficient!  

Low  speed  offsets  power  

savings  

SCC  is  OK  

Despite  low  speed  

Page 24: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �24  

24  

•  MulEcore  sEll  effecEve  for  many  apps  – Dynamic,  irregular,  heavy  communicaEon  

•  GPU  is  excepEonal  for  some  apps  – Not  for  all  apps  – Programmability,  portability  issues  

•  Low-­‐power  design  just  low  power  – Slow!  So  not  energy  efficient  

Page 25: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �25  

25  

•  SCC:  a  good  opEon  – Low  power  and  can  run  legacy  code  fast  

•  Balanced  design  point  – Lower  power  than  mulEcore,  faster  than  low-­‐power  design  

– No  programmability  and  portability  issues  of  GPU  

•  Experimental,  slow  but  can  be  improved    

Page 26: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens

Illinois-Intel Parallelism Center �26  

26  

•  SequenEal  performance,  floaEng  point  – Apps  show  good  scaling  on  more  cores  of  SCC  – Easy  with  more  CMOS  transistors  –  Interconnect  also  has  to  improve  with  it  (sEll  easy)  

•  Global  collecEves  network  – Like  BlueGene  supercomputers,  low  cost  – To  support  fine-­‐grain  global  communicaEons  

•  As  in  CG  

Page 27: Ehsan&Totoni& Babak&Behzad& Swapnil&Ghike&charm.cs.illinois.edu/newPapers/12-10/talk.pdf · Illinois-Intel Parallelism Center 22 22 0 20 40 60 80 100 120 140 160 180 Jacobi NAMD NQueens