ehsan&totoni& babak&behzad& swapnil&ghike& › iacoma-papers › pres ›...
TRANSCRIPT
-
Ehsan Totoni Babak Behzad Swapnil Ghike Josep Torrellas
-
Illinois-Intel Parallelism Center �2
2
• Increasing number of transistors on chip – Power and energy limited – Single-‐thread performance limited – => parallelism
• Many opEons: heavy mulEcore, manycore, “SIMD-‐like” GPUs, low-‐power designs – Intel’s SCC presents manycore
-
Illinois-Intel Parallelism Center �3
3
• Various trade-‐offs – Performance, power and energy – Programmability – Different for each applicaEon
• Each architecture has advantages – No single best soluEon – (But manycore is a good opEon)
-
Illinois-Intel Parallelism Center �4
4
• Benchmark different parallel applicaEons – On all architectures
• Analyze results, learn about architecture • Where does each architecture stand? – In various tradeoffs
• What could be probably beZer for future? – Reasonable in all metrics: power, performance… – For most applicaEons
-
Illinois-Intel Parallelism Center �5
5
• Jacobi: nearest neighbor communicaEon – Standard floaEng-‐point benchmark
• NAMD: Molecular dynamics – Dynamic scienEfic applicaEon
• Nqueens: state space search – Integer intensive
-
Illinois-Intel Parallelism Center �6
6
• CG: Conjugate Gradient, from NAS Parallel Benchmarks (NPB) – A lot of fine-‐grain communicaEon
• Integer Sort: represents commercial workloads – Integer intensive, heavy communicaEon
-
Illinois-Intel Parallelism Center �7
7
• All codes reasonably opEmized, not extensively! – Programmer producEvity
• MPI and Charm++ • CUDA and OpenCL codes for GPU – Reasonably opEmized, same complexity as CPU ones
– Fair apple-‐to-‐apple comparison
-
Illinois-Intel Parallelism Center �8
8
• MulE-‐core: Intel Core i7 – 4 cores, 8 threads, 2.8 GHz
• Low-‐power: Intel Atom D – 2 cores, 4 threads, 1.8 GHz
• “SIMD-‐like” GPU: Nvidia ION2 – 16 CUDA cores, low power, 475 MHz
• Manycore: Intel’s SCC – 48 cores, 800 MHz
-
Illinois-Intel Parallelism Center �9
9
• “Single-‐Chip Cloud Computer” (SCC) – Just a name, not only for cloud – Experimental chip, for research on manycore
• Slower than producEon chips, but we can analyze – 48 PenEum cores, Mesh interconnect
• Not cache coherent, message-‐passing programs
-
Illinois-Intel Parallelism Center �10
10
• We ported Charm infrastructure to SCC – To run Charm++ and MPI parallel codes – Charm++ is message-‐passing parallel objects
• No conceptual difficulty – SCC is similar to a small cluster
• Just headaches of experimental chip – Old compiler, OS and libraries…
-
Illinois-Intel Parallelism Center �11
11
Apps can use parallelism in architecture
1
1.5
2
2.5
3
3.5
4
4.5
5
1 2 3 4 5 6 7 8
Sp
eed
up
Number of Threads
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �12
12
150W
85W 80
90
100
110
120
130
140
150
160
1 2 3 4 5 6 7 8
Po
wer
(W
)
Number of Threads
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �13
13
Speedup offsets Power increase
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
1 2 3 4 5 6 7 8
No
rmal
ized
En
erg
y
Number of Threads
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �14
14
1
1.5
2
2.5
3
3.5
1 2 3 4
Sp
eed
up
Number of Threads
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �15
15
31W
29W
28.5
29
29.5
30
30.5
31
31.5
32
1 2 3 4
Po
wer
(W
)
Number of Threads
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �16
16
Speedup and near constant
power
0.8
1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
1 2 3 4
No
rmal
ized
En
erg
y
Number of Threads
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �17
17
0
5
10
15
20
25
30
35
40
Jacobi NAMD NQueens Sort 0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
Spee
dup, P
ow
er (
W)
Rel
ativ
e E
ner
gy
SpeedupPower (W)
Relative Energy
Energy
-
Illinois-Intel Parallelism Center �18
18
Except CG (fine-‐grain collec5ves)
No change in source code!
0
5
10
15
20
25
30
35
0 5 10 15 20 25 30 35 40 45 50
Sp
eed
up
Number of Cores
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �19
19
Different power
(Communica5on bound so processors stall)
35
40
45
50
55
60
65
70
75
80
85
90
0 5 10 15 20 25 30 35 40 45 50
Pow
er (
W)
Number of Cores
NAMDJacobi
NQueensCG
Sort
60W
85W
40W
-
Illinois-Intel Parallelism Center �20
20
not for CG
Apps scalability needed to offset power increase!
0
2
4
6
8
10
12
14
16
18
20
22
0 5 10 15 20 25 30 35 40 45 50
Norm
aliz
ed E
ner
gy
Number of Cores
NAMDJacobi
NQueensCG
Sort
-
Illinois-Intel Parallelism Center �21
21
0
5
10
15
20
25
30
Jacobi NAMD NQueens CG Sort
Rel
ativ
e S
pee
d
Atom (4 Threads)
Core i7 (8 Threads)
SCC (48 Threads)
ION2 (16 Threads)GPU good for some apps
Regular, highly parallel
Mul5core good for most apps
Even dynamic, irregular, less parallel
SCC beats mul5core on Nqueens
experimental chip!
Low-‐power is really slow!
-
Illinois-Intel Parallelism Center �22
22
0
20
40
60
80
100
120
140
160
180
Jacobi NAMD NQueens CG Sort
Pow
er (
W)
Atom (4 Threads)Core i7 (8 Threads)
SCC (48 Threads)ION2 (16 Threads)
GPU and have low
consump5on
Mul5core has high consump5on
SCC has moderate power
More parallelism, less
frequency
High frequency, fancy features
Power nearly same for different apps
-
Illinois-Intel Parallelism Center �23
23
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Jacobi NAMD NQueens CG Sort
Rel
ativ
e E
ner
gy
Atom (4 Threads)
Core i7 (8 Threads)
SCC (48 Threads)
ION2 (16 Threads)
Mul5core is energy efficient!
High speed offsets power
Low-‐power design is not
energy efficient!
Low speed offsets power
savings
SCC is OK
Despite low speed
-
Illinois-Intel Parallelism Center �24
24
• MulEcore sEll effecEve for many apps – Dynamic, irregular, heavy communicaEon
• GPU is excepEonal for some apps – Not for all apps – Programmability, portability issues
• Low-‐power design just low power – Slow! So not energy efficient
-
Illinois-Intel Parallelism Center �25
25
• SCC: a good opEon – Low power and can run legacy code fast
• Balanced design point – Lower power than mulEcore, faster than low-‐power design
– No programmability and portability issues of GPU • Experimental, slow but can be improved
-
Illinois-Intel Parallelism Center �26
26
• SequenEal performance, floaEng point – Apps show good scaling on more cores of SCC – Easy with more CMOS transistors – Interconnect also has to improve with it (sEll easy)
• Global collecEves network – Like BlueGene supercomputers, low cost – To support fine-‐grain global communicaEons
• As in CG