memory intensive benchmarks: iram vs. cache based machines parry husbands (lbnl) brain gaeke, xiaoye...
TRANSCRIPT
![Page 1: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/1.jpg)
Memory Intensive Memory Intensive Benchmarks:Benchmarks:
IRAM vs. Cache Based IRAM vs. Cache Based MachinesMachines
Parry Husbands(LBNL)
Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames)
![Page 2: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/2.jpg)
P. Husbands, IPDPS 2002
MotivationMotivation
Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~
20% of peak on 1.5GHz P4 Even worse when parallel efficiency considered
Overall ~10% across application benchmarks Is memory bandwidth the problem?
Performance directly related to how well memory system performs
But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)
![Page 3: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/3.jpg)
P. Husbands, IPDPS 2002
Solutions?Solutions?
Better Software ATLAS, FFTW, Sparsity, PHiPAC
Power and packaging are important too! New buildings and infrastructure needed for many
recent/planned installations
Alternative Architectures One idea: Tighter integration of processor and memory
BlueGene/L (~ 25 cycles to main memory) VIRAM
– Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM
![Page 4: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/4.jpg)
P. Husbands, IPDPS 2002
VIRAM OverviewVIRAM Overview
14.5 mm
20
.0 m
m
MIPS core (200 MHz) Main memory system
13 MB of on-chip DRAM Large on-chip bandwidth
6.4 GBytes/s peak to vector unit Vector unit
Energy efficient way to express fine-grained parallelism and exploit bandwidth
Typical power consumption: 2.0 W Peak vector performance
1.6/3.2/6.4 Gops 1.6 Gflops (single-precision)
Fabrication by IBM Tape-out in O(1 month)
Our results use simulator with Cray’s vcc compiler
![Page 5: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/5.jpg)
P. Husbands, IPDPS 2002
Our TaskOur Task
Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines For now focus on serial performance
Benchmark VIRAM on Scientific Computing kernels Originally for multimedia applications
Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser)
Isolate performance limiting features of architectures More than just memory bandwidth
![Page 6: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/6.jpg)
P. Husbands, IPDPS 2002
Benchmarks ConsideredBenchmarks Considered
Transitive-closure (small & large data set)NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)
Fetch-and-increment a stream of “random” addresses
Sparse matrix-vector product: Order 10000, #nonzeros 177820
Computing a histogram Different algorithms investigated: 64-elements sorting kernel;
privatization; retry
2D unstructured mesh adaptation
Transitive GUPS SPMV Histogram
Mesh
Ops/step 2 1 2 1 N/A
Mem/step 2 ld 1 st 2 ld 2 st 3 ld 2 ld 1 st N/A
![Page 7: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/7.jpg)
P. Husbands, IPDPS 2002
The ResultsThe Results
1
10
100
1000
Tran
sitive
GUPS
SPMV
Hist
Mes
h
MOPS
VIRAM 200MHz
R10K 180MHz
P-III 600MHz
P4 1.5GHz
Sparc 333MHz
EV6 466MHz
Comparable performance with lower clock rate
![Page 8: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/8.jpg)
P. Husbands, IPDPS 2002
Power EfficiencyPower Efficiency
Large power/performance advantage for VIRAM from PIM technology Data parallel execution model
0.1
1
10
100
1000
Transitive GUPS SPMV Hist Mesh
MOPS/Watt
VIRAM
R10K
P-III
P4
Sparc
EV6
![Page 9: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/9.jpg)
P. Husbands, IPDPS 2002
Ops/CycleOps/Cycle
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
5
Transitive GUPS SPMV Hist Mesh
Ops/Cycle
IRAM
SGI R10K
Mobile P-III
P4
Sun Ultra-II
Alpha EV6
![Page 10: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/10.jpg)
P. Husbands, IPDPS 2002
GUPSGUPS
1 op, 2 loads, 1 store per step Mix of indexed and unit stride operations Address generation key here (only 4 per cycle on VIRAM)
0
50
100
150
200
250
300
350
VIRAM R10K P-III P4 Sparc EV6
MO
PS
64-bit
32-bit
16-bit
8-bit
![Page 11: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/11.jpg)
P. Husbands, IPDPS 2002
HistogramHistogram
1 op, 2 loads, 1 store per step Like GUPS, but duplicates restrict available parallelism and make it more
difficult to vectorize Sort method performs best on VIRAM on real data Competitive when histogram doesn’t fit in cache
0
50
100
150
200
250
MOPS
7-bit Input
11-bit Input
15-bit Input
![Page 12: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/12.jpg)
P. Husbands, IPDPS 2002
Which Problems are Limited by Which Problems are Limited by Bandwidth?Bandwidth?
What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation, bank conflicts, and
parallelism For Histogram lack of parallelism, not memory bandwidth
0
1000
2000
3000
4000
5000
6000
Transitive GUPS SPMV Hist Mesh
MB/s
0
100
200
300
400
500
600
700
800
900
1000
MOPS
Memory Bandwidth
Computation Rate
![Page 13: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/13.jpg)
P. Husbands, IPDPS 2002
Summary and Future DirectionsSummary and Future Directions
Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular
performancePerformance/Power advantage
Over both low power and high performance processors Both PIM and data parallelism are key
Performance advantage for VIRAM depends on application Need fine-grained parallelism to utilize on-chip bandwidth
Future steps Validate our work on real chip! Extend to multi-PIM systems
Explore system balance issues– Other memory organizations (banks, bandwidth vs. size of memory)– # of vector units– Network performance vs. on-chip memory
![Page 14: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/14.jpg)
P. Husbands, IPDPS 2002
The CompetitionThe Competition
SPARCIIi
MIPS R10K
P III P 4Alpha EV6
Make SunUltra 10
Origin 2000
Intel Mobile
DellCompaq
DS10
Clock 333MHz 180MHz 600MHz 1.5GHz 466MHz
L1 16+16KB 32+32KB 32KB12+8K
B64+64KB
L2 2MB 1MB 256KB 256KB 2MB
Mem 256MB 1GB 128MB 1GB 512MB
![Page 15: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/15.jpg)
P. Husbands, IPDPS 2002
Transitive Closure (Floyd-Warshall)Transitive Closure (Floyd-Warshall)
2 ops, 2 loads, 1 store per step Good for vector processors:
Abundant, regular parallelism and unit stride
0
200
400
600
800
1000
VIRAM R10K P-III P4 Sparc EV6
MO
PS
100 vertices
200 vertices
300 vertices
400 vertices
500 vertices
![Page 16: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/16.jpg)
P. Husbands, IPDPS 2002
SPMVSPMV
2 ops, 3 loads per step Mix of indexed and unit stride operations Good performance for ELLPACK, but only when we have same number of
non-zeros per row
0
100
200
300
400
500
600
VIRAM R10K P-III P4 Sparc EV6
MF
LO
PS
CRS
CRS Banded
ELLPACKELLPACK (eff)
Seg. Sum
![Page 17: Memory Intensive Benchmarks: IRAM vs. Cache Based Machines Parry Husbands (LBNL) Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak](https://reader035.vdocument.in/reader035/viewer/2022072015/56649ed05503460f94bdf0bc/html5/thumbnails/17.jpg)
P. Husbands, IPDPS 2002
Mesh AdaptationMesh Adaptation
Single level of refinement of mesh with 4802 triangular elements, 2500 vertices, and 7301 edges
Extensive reorganization required to take advantage of vectorization Many indexed memory operations (limited again by address generation)
0
100200
300
400500
600
VIRAM R10K P-III P4 Sparc EV6
MFLOPS