memory intensive benchmarks: iram vs. cache based machines parry husbands (lbnl) brain gaeke, xiaoye...

Memory Intensive Memory Intensive Benchmarks:Benchmarks:

IRAM vs. Cache Based IRAM vs. Cache Based MachinesMachines

Parry Husbands(LBNL)

Brain Gaeke, Xiaoye Li, Leonid Oliker, Katherine Yelick (UCB/LBNL), Rupak Biswas (NASA Ames)

P. Husbands, IPDPS 2002

MotivationMotivation

Observation: Current cache-based supercomputers perform at a small fraction of peak for memory intensive problems (particularly irregular ones) E.g. Optimized Sparse Matrix-Vector Multiplication runs at ~

20% of peak on 1.5GHz P4 Even worse when parallel efficiency considered

Overall ~10% across application benchmarks Is memory bandwidth the problem?

Performance directly related to how well memory system performs

But “gap” between processor performance and DRAM access times continues to grow (60%/yr vs. 7%/yr)


Solutions?Solutions?

Better Software ATLAS, FFTW, Sparsity, PHiPAC

Power and packaging are important too! New buildings and infrastructure needed for many

recent/planned installations

Alternative Architectures One idea: Tighter integration of processor and memory

BlueGene/L (~ 25 cycles to main memory) VIRAM

– Uses PIM technology in attempt to take advantage of large on-chip bandwidth available in DRAM


VIRAM OverviewVIRAM Overview

14.5 mm

20

.0 m

m

MIPS core (200 MHz) Main memory system

13 MB of on-chip DRAM Large on-chip bandwidth

6.4 GBytes/s peak to vector unit Vector unit

Energy efficient way to express fine-grained parallelism and exploit bandwidth

Typical power consumption: 2.0 W Peak vector performance

1.6/3.2/6.4 Gops 1.6 Gflops (single-precision)

Fabrication by IBM Tape-out in O(1 month)

Our results use simulator with Cray’s vcc compiler


Our TaskOur Task

Evaluate use of processor-in-memory (PIM) chips as a building block for high performance machines For now focus on serial performance

Benchmark VIRAM on Scientific Computing kernels Originally for multimedia applications

Can we use on-chip DRAM for vector processing vs. the conventional SRAM? (DRAM denser)

Isolate performance limiting features of architectures More than just memory bandwidth


Benchmarks ConsideredBenchmarks Considered

Transitive-closure (small & large data set)NSA Giga-Updates Per Second (GUPS, 16-bit & 64-bit)

Fetch-and-increment a stream of “random” addresses

Sparse matrix-vector product: Order 10000, #nonzeros 177820

Computing a histogram Different algorithms investigated: 64-elements sorting kernel;

privatization; retry

2D unstructured mesh adaptation

Transitive GUPS SPMV Histogram

Mesh

Ops/step 2 1 2 1 N/A

Mem/step 2 ld 1 st 2 ld 2 st 3 ld 2 ld 1 st N/A


The ResultsThe Results

1

10

100

1000

Tran

sitive

GUPS

SPMV

Hist

Mes

h

MOPS

VIRAM 200MHz

R10K 180MHz

P-III 600MHz

P4 1.5GHz

Sparc 333MHz

EV6 466MHz

Comparable performance with lower clock rate


Power EfficiencyPower Efficiency

Large power/performance advantage for VIRAM from PIM technology Data parallel execution model

0.1

1

10

100

1000

Transitive GUPS SPMV Hist Mesh

MOPS/Watt

VIRAM

R10K

P-III

P4

Sparc

EV6


Ops/CycleOps/Cycle

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5


Ops/Cycle

IRAM

SGI R10K

Mobile P-III

P4

Sun Ultra-II

Alpha EV6


GUPSGUPS

1 op, 2 loads, 1 store per step Mix of indexed and unit stride operations Address generation key here (only 4 per cycle on VIRAM)

0

50

100

150

200

250

300

350

VIRAM R10K P-III P4 Sparc EV6

MO

PS

64-bit

32-bit

16-bit

8-bit


HistogramHistogram

1 op, 2 loads, 1 store per step Like GUPS, but duplicates restrict available parallelism and make it more

difficult to vectorize Sort method performs best on VIRAM on real data Competitive when histogram doesn’t fit in cache

0

50

100

150

200

250

MOPS

7-bit Input

11-bit Input

15-bit Input


Which Problems are Limited by Which Problems are Limited by Bandwidth?Bandwidth?

What is the bottleneck in each case? Transitive and GUPS are limited by bandwidth (near 6.4GB/s peak) SPMV and Mesh limited by address generation, bank conflicts, and

parallelism For Histogram lack of parallelism, not memory bandwidth

0

1000

2000

3000

4000

5000

6000


MB/s

0

100

200

300

400

500

600

700

800

900

1000

MOPS

Memory Bandwidth

Computation Rate


Summary and Future DirectionsSummary and Future Directions

Performance advantage Large on applications limited only by bandwidth More address generators/sub-banks would help irregular

performancePerformance/Power advantage

Over both low power and high performance processors Both PIM and data parallelism are key

Performance advantage for VIRAM depends on application Need fine-grained parallelism to utilize on-chip bandwidth

Future steps Validate our work on real chip! Extend to multi-PIM systems

Explore system balance issues– Other memory organizations (banks, bandwidth vs. size of memory)– # of vector units– Network performance vs. on-chip memory


The CompetitionThe Competition

SPARCIIi

MIPS R10K

P III P 4Alpha EV6

Make SunUltra 10

Origin 2000

Intel Mobile

DellCompaq

DS10

Clock 333MHz 180MHz 600MHz 1.5GHz 466MHz

L1 16+16KB 32+32KB 32KB12+8K

B64+64KB

L2 2MB 1MB 256KB 256KB 2MB

Mem 256MB 1GB 128MB 1GB 512MB


Transitive Closure (Floyd-Warshall)Transitive Closure (Floyd-Warshall)

2 ops, 2 loads, 1 store per step Good for vector processors:

Abundant, regular parallelism and unit stride

0

200

400

600

800

1000


MO

PS

100 vertices

200 vertices

300 vertices

400 vertices

500 vertices


SPMVSPMV

2 ops, 3 loads per step Mix of indexed and unit stride operations Good performance for ELLPACK, but only when we have same number of

non-zeros per row

0

100

200

300

400

500

600


MF

LO

PS

CRS

CRS Banded

ELLPACKELLPACK (eff)

Seg. Sum


Mesh AdaptationMesh Adaptation

Single level of refinement of mesh with 4802 triangular elements, 2500 vertices, and 7301 edges

Extensive reorganization required to take advantage of vectorization Many indexed memory operations (limited again by address generation)

0

100200

300

400500

600


MFLOPS

memory intensive benchmarks: iram vs. cache based machines parry husbands (lbnl) brain gaeke, xiaoye...

Documents