the memory gap: to tolerate or to reduce? jean-luc gaudiot professor university of california,...

The Memory Gap: to Tolerate or to Reduce?

Jean-Luc Gaudiot

Professor

University of California, Irvine

April 2nd, 2002

Outline

The problem: the Memory GapThe problem: the Memory Gap Simultaneous Multithreading Decoupled Architectures Memory Technology Processor-In-Memory

The Memory Latency Problem

Technological Trend: Memory latency is getting longer relative to microprocessor speed (40% per year)

Problem: Memory Latency - Conventional Memory Hierarchy Insufficient:• Many applications have large data sets that are accessed non-contiguously.• Some SPEC benchmarks spend more than half of their time stalling [Lebeck and

Wood 1994]. Domain: benchmarks with large data sets: symbolic, signal processing and

scientific programs

Some SolutionsSolutionLarger Caches

Hardware Prefetching

Software Prefetching

Multithreading

Limitations— Slow

— Works well only if working set fits cache and there is temporal locality.

— Cannot be tailored for each application

— Behavior based on past and present execution-time behavior

— Ensure overheads of prefetching do not outweigh the benefits > conservative prefetching

— Adaptive software prefetching is required to change prefetch distance during run-time

— Hard to insert prefetches for irregular access patterns

— Solves the throughput problem, not the memory latency problem

Limitation of Present Solutions

Huge cache: • Slow and works well only if the working set fits cache

and there is some kind of locality

Prefetching• Hardware prefetching

– Cannot be tailored for each application– Behavior based on past and present execution-time behavior

• Software prefetching– Ensure overheads of prefetching do not outweigh the benefits– Hard to insert prefetches for irregular access patterns

SMT• Enhance the utilization and throughput at thread level

Outline

The problem: the memory gap Simultaneous MultithreadingSimultaneous Multithreading Decoupled Architectures Memory Technology Processor-In-Memory

Simultaneous Multi-Threading (SMT)

Horizontal and vertical sharing Hardware support of multiple threads Functional resources shared by multiple

threads Shared caches Highest utilization with multi-program or

parallel workload

SMT Compared to SS

Superscalar processors execute multiple instructions per cycle Superscalar functional units idle due to I-fetch stalls, conditional branches, data

dependencies SMT dispatches instructions from multiple data streams, allowing efficient execution and

latency tolerance• Vertical sharing (TLP and block multi-threading)• Horizontal sharing (ILP and simultaneous multiple thread instruction dispatch)

Thread 5

Thread 6

Thread 7

Thread 8

S tall

Thread 1

Thread 2

Thread 3

Thread 4

Superscalar

INT M EM FP

1

2

3

4

5

6

7

9 instr

Cycles

20 instr

SM T

INT M EM FP

1

2

3

4

5

6

7

Cycles

CMP Compared to SS

CMP uses thread-level parallelism to increase throughput CMP has layout efficiency

• More functional units• Faster clock rate

CMP hardware partition limits performance• Smaller level-1 resources cause increased miss rates• Execution resources not available from across partition

Thread 5

Thread 6

Thread 7

Thread 8

S tall

Thread 1

Thread 2

Thread 3

Thread 4

Superscalar

INT M EM FP

1

2

3

4

5

6

7

9 instr

Super-scalar

INT/M EM

FP

C M P-p2

1

2

3

4

5

6

7

8

9

FP INT/M EM

Super-scalar

13 instr

CyclesCycles

Wide Issue SS Inefficiencies

Architecture and software limitations• Limited program ILP => idle functional units• Increased waste of speculative execution

Technology issues• Area grows O((d3) {d = issue or dispatch width}• Area grows an additional O(tLog2(t)) {t= #SMT threads}• Increased wire delays (increased area, tighter spacings, thinner

oxides, thinner metal)• Increased memory access delays versus processor clock• Larger pipeline penalties

Problems solved through: CMP - localizes processor resources SMT - efficient use of FUs, latency tolerance Both CMP and SMT - thread level parallelism

POSM Configurations

All architectures above have eight threads Which configuration has the highest performance for an

average workload? Run benchmarks on various configurations, find optimal

performance point

FetchiL1

TLB

dL1Decode,Renam e

ReorderBuffer,

Instr. Queues,O-O-O logic

INTunit

W ide-issue SMT

FP Unit

Level 2 Cache

Ext In t

iL1 dL1

iL1 dL1

L2 C ross bar

Two processor POSM

Proc 2

Proc 1

TLB

TLB

Level 2 Cache

Ext In t

Four processor POSM

Proc 4Proc 3

Proc 1 Proc 2

L2 C ross bar

dL1iL1

TLB

dL1iL1

TLB

dL1iL1

TLB

dL1iL1

TLB

Ext In t

Level 2 Cache

8 processor CMP

L2 Cross bar

Ext In t

Level 2 Cache

P6

dL1

iL1

P7

dL1

iL1

P8

dL1

iL1

P2

dL1

iL1

P3

dL1

iL1

P4

dL1

iL1

P1

dL1

iL1

P5

dL1

iL1

Superscalar, SMT, CMP, and POSM Processors

CMP and SMT both have higher throughput than superscalar Combination of CMP/SMT has highest throughput Experiment results

Thread 5

Thread 6

Thread 7

Thread 8

S tall

Thread 1

Thread 2

Thread 3

Thread 4

Superscalar

INT M EM FP

1

2

3

4

5

6

7

SM T

INT M EM FP

1

2

3

4

5

6

7

Super-scalar

INT/M EM

FP

C M P-p2

1

2

3

4

5

6

7

8

9

FP INT/M EM

Super-scalar SM T

INT/M EM

FP

PO SM -p2

1

2

3

4

5

6

7

8

9

FP INT/M EM

SM T

9 instr 20 instr 13 instr 33 instr

Equivalent Functional Units

•SMT.p1 has highest performance through vertical and horizontal sharing•cmp.p8 has linear increase in performance

0.001.002.003.004.005.006.007.008.009.00

10.00

1 2 3 4 5 6 7 8

Number of threads

IPC

smt.p1.f2.t8.d16

posm.p2.f2.t4.d8

posm.p4.f1.t2.d4

cmp.p8.f1.t1.d2

Equivalent Silicon Area and System Clock Effects

•SMT.p1 throughput is limited•SMT.p1 and POSM.p2 have equivalent single thread performance•POSM.p4 and CMP.p8 have highest throughput

0.00

1.00

2.00

3.00

4.00

5.00

6.00

7.00

8.00

9.00

10.00

1 2 3 4 5 6 7 8

Number of threads

NIP

C smt.p1.f2.t8.d9

posm.p2.f2.t4.d6

posm.p4.f1.t2.d4

cmp.p8.f1.t1.d2

Synthesis

“Comparable silicon resources” are required for processor evaluation

POSM.p4 has 56% more throughput than wide-issue SMT.p1

Future wide-issue processors are difficult to implement, increasing the POSM advantage• Smaller technology spacings have higher routing delays due to

parasitic resistance and capacitance• The larger the processor, the larger the O(d2tLog2(t)) and O(d3t)

impact on area and delays

SMT works well with deep pipelines The ISA and micro-architecture affect SMT overhead

• 4-thread x86 SMT would have 1/8th the SMT overhead• Layout and micro-architecture techniques reduces SMT overhead

Outline

The problem: the memory gap Simultaneous Multithreading Decoupled Decoupled ArchitectureArchitecturess Memory Technology Processor-In-Memory

Observation:

• Software prefetching impacts compute performance

• PIMs and RAMBUS offer a high-bandwidth memory system - useful for speculative prefetching

The HiDISC Approach

Approach:

• Add a processor to manage prefetching -> hide overhead

• Compiler explicitly manages the memory hierarchy

• Prefetch distance adapts to the program runtime behavior

MIPS DEAP CAPP HiDISC(Conventional) (Decoupled) (New Decoupled)

2nd-Level Cache and Main Memory

Registers

5-issue

Cache

Access Processor(AP) - (3-issue)


2-issue

Cache Mgmt. Processor (CMP)

Registers



ComputationProcessor (CP)Computation

Processor (CP)

3-issue


Registers

8-issue

Registers

3-issue 3-issueCache Mgmt. Processor (CMP)

DEAP: [Kurian, Hulina, & Caraor ‘94]

PIPE: [Goodman ‘85]

Other Decoupled Processors: ACRI, ZS-1, WA

Cache





Processor (CP)


Processor (CP)


Processor (CP)

CacheCache

Decoupled Architectures

What is HiDISC?

A dedicated processor for each level of the memory hierarchy

Explicitly manage each level of the memory hierarchy using instructions generated by the compiler

Hide memory latency by converting data access predictability to data access locality (Just in Time Fetch)

Exploit instruction-level parallelism without extensive scheduling hardware

Zero overhead prefetches for maximal computation throughput

HiDISC

Access Processor(AP)

Access Processor(AP)

2-issue

Cache Mgmt. Processor (CMP)

Registers

3-issue

L2 Cache and Higher Level


Processor (CP)

L1 Cache

3-issueSlip

Control Queue

Store Data Queue

Store Address Queue

Load Data Queue

Slip Control Queue The Slip Control Queue (SCQ) adapts

dynamically

• Late prefetches = prefetched data arrived after load had been issued

• Useful prefetches = prefetched data arrived before load had been issued

if (prefetch_buffer_full ())Don’t change size of SCQ;

else if ((2*late_prefetches) > useful_prefetches)

Increase size of SCQ;else

Decrease size of SCQ;

Decoupling Programs for HiDISC(Discrete Convolution - Inner Loop)

for (j = 0; j < i; ++j)y[i]=y[i]+(x[j]*h[i-j-1]);

while (not EOD)y = y + (x * h);

send y to SDQ

for (j = 0; j < i; ++j) {load (x[j]);load (h[i-j-1]);GET_SCQ;

}send (EOD token)send address of y[i] to SAQ

for (j = 0; j < i; ++j) {prefetch (x[j]);prefetch (h[i-j-1];PUT_SCQ;

}

Inner Loop Convolution

Computation Processor Code

Access Processor Code

Cache Management Code

SAQ: Store Address QueueSDQ: Store Data QueueSCQ: Slip Control QueueEOD: End of Data

Benchmarks

Benchmarks Source of Benchmark

Lines of Source Code

Description Data Set Size

LLL1 Livermore Loops [45]

20 1024-element arrays, 100 iterations

24 KB

LLL2 Livermore Loops


16 KB



16 KB



16 KB



24 KB

Tomcatv SPECfp95 [68] 190 33x33-element matrices, 5 iterations

<64 KB

MXM NAS kernels [5] 113 Unrolled matrix multiply, 2 iterations

448 KB

CHOLSKY NAS kernels 156 Cholesky matrix decomposition

724 KB

VPENTA NAS kernels 199 Invert three pentadiagonals simultaneously

128 KB

Qsort Quicksort sorting algorithm [14]

58 Quicksort 128 KB

Parameter Value Parameter Value

L1 cache size 4 KB L2 cache size 16 KB

L1 cache associativity 2 L2 cache associativity 2

L1 cache block size 32 B L2 cache block size 32 B

Memory Latency Variable, (0-200 cycles)

Memory contention time

Variable

Victim cache size 32 entries Prefetch buffer size 8 entries

Load queue size 128 Store address queue size

128

Store data queue size 128 Total issue width 8

Simulation Parameters

Simulation Results

0

1

2

3

4

5

0 40 80 120 160 200Main Memory Latency

LLL3

MIPSDEAPCAPP

HiDISC

0

0.5

1

1.5

2

2.5

3


Tomcatv

MIPSDEAPCAPP

HiDISC

02468

10121416


Cholsky

MIPSDEAPCAPP

HiDISC

0

2

4

6

8

10

12


Vpenta

MIPSDEAP

CAPP

HiDISC

VLSI Layout Overhead (I)

Goal: Cost effectiveness of HiDISC architecture Cache has become a major portion of the chip area Methodology: Extrapolated HiDISC VLSI Layout based on

MIPS10000 processor (0.35 μm, 1996) The space overhead of HiDISC is extrapolated to be

11.3% more than a comparable MIPS processor The benchmark should be run again using these

parameters and new memory architectures

VLSI Layout Overhead (II)

Component Original MIPS R10K(0.35 m)

Extrapolation (0.15 m)

HiDISC (0.15 m)

D-Cache (32KB) 26 mm2 6.5 mm2 6.5 mm2

I-Cache (32KB) 28 mm2 7 mm2 14 mm2

TLB Part 10 mm2 2.5 mm2 2.5 mm2

External Interface Unit 27 mm2 6.8 mm2 6.8 mm2

Instruction Fetch Unit and BTB 18 mm2 4.5 mm2 13.5 mm2

Instruction Decode Section 21 mm2 5.3 mm2 5.3 mm2

Instruction Queue 28 mm2 7 mm2 0 mm2

Reorder Buffer 17 mm2 4.3 mm2 0 mm2

Integer Functional Unit 20 mm2 5 mm2 15 mm2

FP Functional Units 24 mm2 6 mm2 6 mm2

Clocking & Overhead 73 mm2 18.3 mm2 18.3 mm2

Total Size without L2 Cache 292 mm2 73.2 mm2 87.9 mm2

Total Size with on chip L2 Cache 129.2 mm2 143.9 mm2

The Flexi-DISC

Fundamental characteristics:• inherently highly dynamic at

execution time. Dynamic reconfigurable

central computational kernel (CK)

Multiple levels of caching and processing around CK• adjustable prefetching

Multiple processors on a chip which will provide for a flexible adaptation from multiple to single processors and horizontal sharing of the existing resources.

Memory Interface Ring

Low Level Cache Access Ring

Computation Kernel

The Flexi-DISC

Partitioning of Computation Kernel• It can be allocated to the

different portions of the application or different applications

CK requires separation of the next ring to feed it with data

The variety of target applications makes the memory accesses unpredictable

Identical processing units for outer rings• Highly efficient dynamic

partitioning of the resources and their run-time allocation can be achieved

Memory Interface Ring

Low Level Cache Access Ring

Computation Kernel

Application 1

Application 2

Application 3

Multiple HiDISC: McDISC

Problem: All extant, large-scale multiprocessors perform poorly when faced with a tightly-coupled parallel program.

Reason: Extant machines have a long latency when communication is needed between nodes. This long latency kills performance when executing tightly-coupled programs. (Note that multi-threading à la Tera does not help when there are dependencies.)

The McDISC solution: Provide the network interface processor (NIP) with a programmable processor to execute not only OS code (e.g. Stanford Flash), but user code, generated by the compiler.

Advantage: The NIP, executing user code, fetches data before it is needed by the node processors, eliminating the network fetch latency most of the time.

Result: Fast execution (speedup) of tightly-coupled parallel programs.

The McDISC System: Memory-Centered Distributed Instruction Set Computer

Registers

Cache

Access

Main Memory

Processor (AP)

ComputationProcessor (CP)

Cache ManagementProcessor (CMP)

Program CompilerAccess Instructions

Computation Instructions

Cache ManagementInstructions

DiscProcessor (DP)

Adaptive SignalPIM (ASP)

Adaptive GraphicsPIM (AGP)

Network InterfaceProcessor (NIP)

Disc Cache

SituationAwareness

SensorInputs

SAR Video

DynamicDatabase

X Y Z

3-D Torusof Pipelined Rings

RAID

Understanding

Inference Analysis

Decision ProcessTargeting

NetworkManagement

Register Linksto CP Neighbors

Instructions

FLIR SAR VIDEO ESS SES

SensorData

Disc Farm

to Displaysand Network

Summary

A processor for each level of the memory hierarchy

Adaptive memory hierarchy management

Reduces memory latency for systems with high memory

bandwidths (PIMs, RAMBUS)

2x speedup for scientific benchmarks

3x speedup for matrix decomposition/substitution

(Cholesky)

7x speedup for matrix multiply (MXM) (similar results

expected for ATR/SLD)

Outline

The problem: the memory gap Simultaneous Multithreading Decoupled Architectures Memory TechnologyMemory Technology Processor-In-Memory

Memory Technology

New DRAM technologies • DDR DRAM, SLDRAM and DRDRAM• Most DRAM technologies achieve higher

bandwidth Integrating memory and processor on a

single chip (PIM and IRAM)• Bandwidth and memory access latency sharply

improve

New Memory Technologies (Cont.)

Rambus DRAM (RDRAM)• memory interleaving system integrated onto a single

memory chip• Four outstanding requests with pipelined micro

architecture• Operates at much higher frequencies than SDRAM

Direct Rambus DRAM (DRDRAM)• Direct control of all row and column resources

concurrently with data transfer operations• Current DRDRAM can achieve 1.6 Gbytes/sec

bandwidth transferring on both clock edges

Intelligent RAM (IRAM)

Merging technology of processor and memory

All the memory accesses remain within a single chip• Bandwidth can be as high as 100 to 200

Gbytes/sec• Access latency is less than 20ns

Good solution for data intensive streaming application

Vector IRAM

Cost effective system • Incorporates vector processing units and the

memory system on a single chip Beneficial for the multimedia application with

critical DSP features Good energy efficiency Attractive for future mobile computing

processors

Outline

The problem: the memory gap Simultaneous Multithreading Decoupled Architectures Memory Technology Processor-In-MemoryProcessor-In-Memory

Overview of the System

Proposed DCS (Data-intensive Computing System) Architecture

DCS System (Cont’d)

Programming• Different from the conventional programming model• Applications are divided into two separate sections

– Software : Executed by the host processor– Hardware : Executed by the CMP

• The programmer must use CMP instructions CMP

• Several CMPs can be connected to the system bus• Variable CMP size and configuration depending on the

amount and complexity of job it has to handle• Variable size, function and location of logics inside of

CMP to better handle the application. Memory, Coprocessors, I/O

CMP Architecture

CMP (Computational Memory Processor) Architecture• The Heart of our work• Responsible for executing the core operation of data-

intensive applications• Attached to the system bus• CMP instructions are encapsulated in the normal memory

operations.• Consists of many ACME (Application-specific Computational

Memory Element) cells interconnected amongst themselves through dedicated communication links

CMC(Computing Memory Cluster)• A small number of ACME cells are put together to form a

CMC • The Network for connecting the CMCs are separate from the

memory decoder

CMP Architecture

CMC Architecture

ACME Architecture

ACME (Application-specific Computational Memory Elements) Architecture

• ACME-memory, configuration cache, CE (Computing Element), FSM

• CE is the reconfigurable computing unit and consists of many CC (Computing Cells)

• FSM govern the overall execution of the ACME

Inside the Computing Elements

Synchronization and Interface

Three different kinds of communications• Host processor with CMP (eventually with each ACME)

– Done by synchronization variables (specific memory locations) located inside the memory of each ACME cells

– Example : start and end signals for operations. CMP instructions for each ACME

• ACME to ACME– Two different approaches

• Host mediated– Simple– Not practical for frequent communications

• Distributed mediated approach– Expensive and complex– Efficient

• CMP to CMP

Benefits of the Paradigm

All the benefits from being the PIM• Increased bandwidth and Reduced latency• Faster Computation

– Parallel execution among many ACMEs

Effective usage of the full memory bandwidth Efficient co-existence of Software and Hardware More parallel execution inside of ACMEs by

efficiently configuring the structure with considerations for applications

Scalability

Implementation of the CMP

Projected how our CMP will be implemented…• According to 2000 edition of ITRS (International

Technology Roadmap for Semiconductors), in year 2008

– A High-end MPU with 1.381 billion transistors will be in production with 0.06um technology and 427mm2

– If half of the die size is allocated to memory, 8.13 Gbits storage will be available and 690 million transistors for logic

– There can be 2048 ACME cells with each 512Kbytes of memory and 315K transistors for logic, control, anything inside ACME and rest of resources (36M transistors) for interconnections inside.

Motion Estimation of MPEG

Finding the motion vectors for a macro block in the frame.

It absorbs about 70% of the total execution time of MPEG

Huge amount of simple addition, subtraction and comparisons

Example ME execution

One ACME structure to find a motion vector for a macro block• Executes in pipelined fashion reusing the data

Example ME execution

Performance• For a 8*8 macro block with 8 pixel displacement• 276 clock cycles to find the motion vector for

one macro block Performance comparison with other

architectures

the memory gap: to tolerate or to reduce? jean-luc gaudiot professor university of california,...

Documents

memory slide

smt overhead slide

outline n

performance slide

memory latency problem

n domain

delays smt

smt threads