impulse project darpa review – july 2000

23
1 Impulse Adaptable Memory System Impulse Project DARPA Review – July 2000 University of Utah and University of Massachusetts at Amherst

Upload: kirk

Post on 15-Jan-2016

24 views

Category:

Documents


1 download

DESCRIPTION

Impulse Project DARPA Review – July 2000. University of Utah and University of Massachusetts at Amherst. Technology Trends. Disturbing trends (for a memory architect): Memory gap widening (CPUs improving 60%/year, DRAM only 7%) Internal CPU parallelism is escalating - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Impulse Project DARPA Review – July 2000

1Impulse Adaptable Memory System

Impulse Project

DARPA Review – July 2000

University of Utah

and

University of Massachusetts at Amherst

Page 2: Impulse Project DARPA Review – July 2000

2Impulse Adaptable Memory System

Technology Trends

Disturbing trends (for a memory architect):– Memory gap widening (CPUs improving 60%/year, DRAM only 7%)– Internal CPU parallelism is escalating– Emerging applications with poor locality (multimedia, databases, …)– Cache size growing much faster than TLB reach– Ugly CPIs: Perl and Sites, OSDI 1996

Possible solutions:– Bigger, deeper cache hierarchies– Better latency-tolerating CPU features (non-blocking cache, OOO, …)– Migrate computation to the DRAMs– Let software control how data is managed (Impulse)

Page 3: Impulse Project DARPA Review – July 2000

3Impulse Adaptable Memory System

Simple Example Problem Sum of diagonal elements of dense matrix

Problems– Wasted bus bandwidth

– Low cache utilization

– Low cache hit ratio

CachePhysical Memory

Memory Bus

for (i = 0; i < n; i++)

sum += A[i][i];

Memory Controller

Page 4: Impulse Project DARPA Review – July 2000

4Impulse Adaptable Memory System

The Impulse Idea What if software could do the following?

Improvements– No wasted bus bandwidth

– Better cache utilization

– Higher cache and TLB hit ratios

CachePhysical Memory

Memory bus Memory Controller

Create diag[*] corresponding to A[*][*]for (i = 0; i < n; i++) sum += diag[i];

Page 5: Impulse Project DARPA Review – July 2000

5Impulse Adaptable Memory System

How? Add Extra Level of Mapping Shadow address: “unused” physical address MC maps shadow address to physical address Applications configure MC through OS

Real physical space

Shadow address space

MM

U/T

LB

virtual space physical space real physical memory

Imp

uls

e M

C

Page 6: Impulse Project DARPA Review – July 2000

6Impulse Adaptable Memory System

Address Translations

ConventionalSystem

Virtual Memory

ShadowMemory

PseudoVirtual

MemoryPhysical Memory

MMU/TLB

diagonal

MMU/TLB

Physical Memory

Virtual Memory

ImpulseSystem

Word-grainedPage-grained

Page 7: Impulse Project DARPA Review – July 2000

7Impulse Adaptable Memory System

Impulse Features

Base-stride scatter/gather data– Walk columns or diagonals efficiently

– Remap matrix tiles to contiguous memory without copying

Indirection vector accesses– Static vectors (e.g., perform A[index[i]] efficiently)

– Dynamic cacheline assembly

Remap pages– Create superpages from disjoint base pages

– No-copy page coloring

Aggressive controller-based prefetching– Prefetch data from DRAMs (sequential and pointer-directed)

Page 8: Impulse Project DARPA Review – July 2000

8Impulse Adaptable Memory System

Exploiting Impulse

1. Application asks OS to setup remapping2. OS allocates free shadow configuration register

• sets up dense “page table” that points to target data

• downloads address of this page table to configuration register

3. OS allocates free shadow and virtual address space• maps application virtual addresses to shadow physical addresses

• returns virtual address corresponding to remapped data to app

1. TLB translation (VA to shadow)2. Fine-grained remapping (if any)3. Remapped addresses pass through MC-TLB4. DRAM scheduler “collects” data5. Application accesses (dense) remapped data

Set

upU

se

Page 9: Impulse Project DARPA Review – July 2000

9Impulse Adaptable Memory System

Architecture Overview

RegisterFile

ShadowEngine

ShadowEngine

MTLB MTLB

DRAMBank

Controller

DRAMBank

Controller

WritebackBuffer

.

.

.

RequestQueue

ScoreboardOut

Buffer

PrefetchUnit

ShadowStaging

Unit

DATACOHDATA ADDR

I/O

Page 10: Impulse Project DARPA Review – July 2000

11Impulse Adaptable Memory System

Benchmarks Fine-grained remapping benchmarks

– Conjugate gradient (core of DARPA vision benchmark)

– Ray tracing

Page-grained remapping benchmarks– SPEC95 (dynamic superpage promotion)

– Compress (no-copy page coloring)

Prefetching benchmarks– SPECint 95 suite (3-15% performance improvement)

– Synthetic tree microbenchmarks

Page 11: Impulse Project DARPA Review – July 2000

12Impulse Adaptable Memory System

Conjugate Gradient

Row A P => B

1 2 3 4 5 6

12

54

63

x

Data

Column 1 5 7 8 3 9

014

Store logical sparse matrix A using Yale storage scheme– Data stores non-zero elements (much larger than P)

– Row[i] indicates where the ith row begins in Data

– Column[i] is the column number of Data[i]

Page 12: Impulse Project DARPA Review – July 2000

13Impulse Adaptable Memory System

Optimizing Conjugate Gradient

for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * P[Col[j]]; b = sum;

Pi = remap_indirect(P, Col, n, …);for i=0 to n-1 do sum = 0; for j = Row[i] to Row[i+1]+1 do sum = Data[j] * Pi[j]; b = sum;

Original Code Optimized Code

Issues:• Data and Col are large streams

• P reusable, but forced out of cache

• Poor L1 cache hit rates

• Interference in L2 cache

Issues:• Indirect access to P[Col[j]] turned

into sequential streaming access

• No reuse on P now

• Side effect: eliminate access to Col• Significant improvement to hit rates

(both L1 and TLB)

Page 13: Impulse Project DARPA Review – July 2000

14Impulse Adaptable Memory System

Conjugate Gradient Results

Base Impulse

Time (cycles) 5.48B 1.77B

L1 hit ratio 63.4% 77.8%

L2 hit ratio 19.7% 15.9%

TLB cycles 10.1M 0.5M

Speedup --- 3.1X

Significant improvement in effective cache locality

Page 14: Impulse Project DARPA Review – July 2000

15Impulse Adaptable Memory System

Volume Rendering: Ray Tracing

Problem: Ray traversals are “random” memory accesses Solution: Calculate addresses of rays as “indirection vector

Access rays via Impulse-remapped data structure

Page 15: Impulse Project DARPA Review – July 2000

16Impulse Adaptable Memory System

Volume Rendering Results

Orig (A) Impulse (A) Orig (B) Impulse (B)

Time 264M 185M 1440M 285M

L1 hit ratio 96.8% 96.6% 86.3% 91.7%

L2 hit ratio 0.8% 0.9% 0.4% 6.2%

TLB cycles 0.30M 0.31M 259M 0.13M

Speedup -- 1.4X -- 6.1X

A: rays follow natural memory layout (X axis) B: rays perpendicular to natural memory layout (Z axis)

Page 16: Impulse Project DARPA Review – July 2000

17Impulse Adaptable Memory System

Coarse Grained Remappings

Page-grained remapping Aggressive use of synthetic superpages

– modified kernel TLB miss handler to detect pages responsible for frequent TLB misses

– create superpage by page-grained remapping on memory controller

– no copying, therefore can be far more aggressive

No-copy page coloring– Problem: conflicts in the physically-indexed L2 cache

– Normal solution: copy to non-conflicting pages

– Impulse solution: remap to non-conflict pages

Page 17: Impulse Project DARPA Review – July 2000

18Impulse Adaptable Memory System

0x40138000

0x06155000

0x04012000

0x00004000

0x00005000

0x00007000

0x00006000

Virtual Addresses

0x80240000

0x80243000

0x80242000

0x80241000

Shadow Addresses

Physical Addresses

0x12011000

Shadow-Backed Superpages

SPECint95 improves 5-20% MTLB increases effective reach of CPU TLB Superpage large and multiple arrays at compile time

– at allocation time (cheapest) or dynamically

Page 18: Impulse Project DARPA Review – July 2000

19Impulse Adaptable Memory System

MMC-Based Prefetching

Idea: Prefetch data off of DRAMs into SRAM on MMC

Misprediction penalties significantly reduced– conflict misses due to cache capacity limitations

– system bus bandwidth

Exploits “free” DRAM bandwidth at MMC level– higher aggregate DRAM bandwidth than cache or bus bandwidth

Reduces latency of accesses that hit in prefetch cache

Page 19: Impulse Project DARPA Review – July 2000

20Impulse Adaptable Memory System

Pointer-based Microbenchmarks

Random walk down tree w/ N-children per node– vary number of children from 1 (linked list) to 3 (trinary tree)

Baseline: compiler-directed prefetching Impulse: MMC prefetches next nodes in tree (1-ahead)

– allocate nodes in shadow region

– tell MMC what offsets represent pointers

Root

Child1 ChildNChild2

Child1 Child2 ChildN...

...

Page 20: Impulse Project DARPA Review – July 2000

21Impulse Adaptable Memory System

Pointer Prefetching Results

P1 (N) P1 (C) P1 (I) P3 (N) P3 (C) P3 (I)

Time 100M 99.7M 84.7M 124M 197M 109M

L1 hit ratio 67.5% 98.8% 67.5% 68.2% 97.9% 68.2%

L2 hit ratio 0.4% 0.1% 0.4% 0.4% 0.3% 0.5%

TLB cycles 1.6M 1.2M 1.6M 6.2M 6.2M 6.0M

Speedup --- 1.0X 1.2X --- -0.3X 1.14X

P1(N): singly-linked list, no prefetching P3(C): triply-linked list, compiler-directed prefetching P#(I): Impulse MMC-directed prefetching

Page 21: Impulse Project DARPA Review – July 2000

22Impulse Adaptable Memory System

Prototyping Status

Four stage prototype strategy I: Slow conventional MMC

II: Fast conventional MMC

III: Impulse on an FPGA

IV: Impulse in an ASIC

Current Status: Stage I complete (pictured)

Stage II imminent (final testing)

Stage III underway (3/01)

Stage IV next year (12/01)

Page 22: Impulse Project DARPA Review – July 2000

23Impulse Adaptable Memory System

Summary Impulse Benefits

– Higher memory bus utilization

– Higher cache utilization

– Turns sparse memory operations into dense ones

Range of optimizations– Fine-grained data remapping

– Page-grained data remapping

– Memory-based prefetching

Impact– Performance increase for small increase in cost

– Does not require changes to CPUs, caches, or DRAMs

Page 23: Impulse Project DARPA Review – July 2000

24Impulse Adaptable Memory System

Questions?

http://www.cs.utah.edu/impulse