a case for subarray -level parallelism (salp) in dram

48
A Case for Subarray-Level Parallelism (SALP) in DRAM Yoongu Kim, Vivek Seshadri, Donghyuk Lee, Jamie Liu, Onur Mutlu

Upload: sasha

Post on 24-Feb-2016

42 views

Category:

Documents


0 download

DESCRIPTION

A Case for Subarray -Level Parallelism (SALP) in DRAM. Yoongu Kim , Vivek Seshadri , Donghyuk Lee, Jamie Liu, Onur Mutlu. Executive Summary. Problem : Requests to same DRAM bank are serialized Our Goal : P arallelize requests to same DRAM bank at a low cost - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

A Case for Subarray-Level Parallelism

(SALP) in DRAMYoongu Kim, Vivek Seshadri,

Donghyuk Lee, Jamie Liu, Onur Mutlu

Page 2: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

2

Executive Summary• Problem: Requests to same DRAM bank are

serialized• Our Goal: Parallelize requests to same DRAM

bank at a low cost• Observation: A bank consists of subarrays

that occassionally share global structures • Solution: Increase independence of

subarrays to enable parallel operation• Result: Significantly higher performance and

energy-efficiency at low cost (+0.15% area)

Page 3: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

3

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 4: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

4

Introduction

Bank

DRAM

Bank

Bank

Bank

Req

Req

Req

Req

Req Req Req Req

Bank conflict!

4x latency

Page 5: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

5

Three Problems1. Requests are serialized2. Serialization is worse after write requests3. Thrashing in row-buffer

Bank conflicts degrade performance

RowRow

RowRow

Bank

Row-Buffer

ReqReqReq

Thrashing: increases latency

Page 6: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

6

Case Study: Timeline

timeWr Rd

Wr Rdtime

Bank

time

Bank

Bank

• Case #1. Different Banks

• Case #2. Same Bank

1. Serialization

Wr Wr Rd RdWr 2 Wr 2 Rd RdWr 2 Wr 2 Rd Rd3 3 3

2. Write Penalty3. Thrashing Row-Buffer

Served in parallel Delayed

Page 7: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

7

Our Goal• Goal: Mitigate the detrimental effects of

bank conflicts in a cost-effective manner

• Naïve solution: Add more banks– Very expensive

• We propose a cost-effective solution

Page 8: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

8

A DRAM bank is divided into subarraysKey Observation #1

Row

Row-Buffer

RowRowRow

32k rows

Logical Bank

A single row-buffer cannot drive all rows

Global Row-Buf

Physical Bank

Local Row-Buf

Local Row-BufSubarray1

Subarray64

Many local row-buffers, one at each subarray

Page 9: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

9

Key Observation #2Each subarray is mostly independent… – except occasionally sharing global structures

Global Row-Buf

Glob

al D

ecod

er

Bank

Local Row-Buf

Local Row-BufSubarray1

Subarray64

···

Page 10: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

10

Key Idea: Reduce Sharing of Globals

Global Row-Buf

Glob

al D

ecod

er

Bank

Local Row-Buf

Local Row-Buf

···

1. Parallel access to subarrays

2. Utilize multiple local row-buffers

Page 11: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

11

Overview of Our Mechanism

··· ReqReq

Global Row-Buf

Local Row-Buf

Req

Local Row-Buf

Req1. Parallelize2. Utilize multiple

local row-buffers

Subarray64

Subarray1

To same bank...but diff. subarrays

Page 12: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

12

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 13: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

13

DRAM System

Organization of DRAM System

Bank

Rank

Bank

RankChannel

Bus CPU

Page 14: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

14

1. More channels: expensive2. More ranks: low performance3. More banks: expensive

Naïve Solutions to Bank Conflicts

DRAM SystemChannel

Channel

Channel

Channel

Bus

Bus

Bus

Bus

Many CPU pins

Channel

R RR RLow frequency

ChannelRank

Bank

Significantly increases DRAM die area

Large load

Page 15: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

15

data

Logical Bank

RowRowRowRow

wordlines

bitlines

PrechargedState

ActivatedState

000

ACTIVATE

PRECHARGE

addrDe

code

r VDD

?

Row-Buffer RD/WR0

Total latency: 50ns!

Page 16: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

16

Physical Bank

Row-Buffer

32k

row

s

very long bitlines:hard to drive

Global Row-Buf

Local Row-Buf

Local Row-BufSubarray1

···

Local bitlines:short

512

row

s

Subarray64

Page 17: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

17

Hynix 4Gb DDR3 (23nm) Lim et al., ISSCC’12Ba

nk0

Bank

1

Bank

2

Bank

3 Subarray SubarrayDecoder

Tile

Magnified

Bank

5

Bank

6

Bank

7

Bank

8

Page 18: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

18

Bank: Full Picture

Global Row-Buf

Local Row-Buf

Local Row-Buf

···

Local bitlines

Subarray64

Subarray1

Local bitlines

Global bitlinesBankGl

obal

Dec

oder

SubarrayDecoderLa

tch

Page 19: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

19

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 20: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

20

Problem Statement

··· ReqReq

Global Row-Buf

Local Row-Buf

Local Row-Buf

Serialized!

To different subarrays

Page 21: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

21

MASA (Multitude of Activated Subarrays)Overview: MASA

···addr

VDD

addrGl

obal

Dec

oder

VDD

Local Row-Buf

Local Row-BufACTIVATED

Global Row-BufACTIVATED

READREAD

Challenges: Global Structures

Page 22: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

22

Challenges: Global Structures1. Global Address Latch

2. Global Bitlines

Page 23: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

23

Localrow-buffer

Localrow-bufferGlobalrow-buffer

Challenge #1. Global Address Latch

···addr

VDD

addr

Glob

al D

ecod

er

VDD

Latc

hLa

tch

Latc

h PRECHARGED

ACTIVATED

ACTIVATED

Page 24: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

24

Localrow-buffer

Localrow-bufferGlobalrow-buffer

Solution #1. Subarray Address Latch

···

VDD

Glob

al D

ecod

er

VDD

Latc

hLa

tch

Latc

h ACTIVATED

ACTIVATED

Page 25: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

25

Challenges: Global Structures1. Global Address Latch• Problem: Only one raised wordline• Solution: Subarray Address Latch

2. Global Bitlines

Page 26: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

26

Challenge #2. Global Bitlines

Localrow-buffer

Local row-buffer

Switch

Switch

READ

Global bitlines

Global row-buffer

Collision

Page 27: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

27

Wire

Solution #2. Designated-Bit LatchGlobal bitlines

Global row-buffer

Localrow-buffer

Local row-buffer

Switch

Switch

READREAD

DD

DD

Page 28: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

28

Challenges: Global Structures1. Global Address Latch• Problem: Only one raised wordline• Solution: Subarray Address Latch

2. Global Bitlines• Problem: Collision during access• Solution: Designated-Bit Latch

Page 29: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

29

• Baseline (Subarray-Oblivious)

• MASA

MASA: Advantages

timeWr 2 Wr 2 Rd Rd3 3 3

1. Serialization

2. Write Penalty 3. Thrashing

timeWr

Wr

Rd

Rd

Saved

Page 30: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

30

MASA: Overhead• DRAM Die Size: Only 0.15% increase– Subarray Address Latches– Designated-Bit Latches & Wire

• DRAM Static Energy: Small increase– 0.56mW for each activated subarray– But saves dynamic energy

• Controller: Small additional storage– Keep track of subarray status (< 256B)– Keep track of new timing constraints

Page 31: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

31

Cheaper Mechanisms

D

D

Latches

1. S

eria

lizati

on

2. W

r-Pen

alty

3. T

hras

hing

MASA

SALP-2

SALP-1

Page 32: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

32

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 33: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

33

Related Works• Randomized bank index [Rau ISCA’91, Zhang+ MICRO’00, …]– Use XOR hashing to generate bank index– Cannot parallelize bank conflicts

• Rank-subsetting [Ware+ ICCD’06, Zheng+ MICRO’08, Ahn+ CAL’09, …]– Partition rank and data-bus into multiple subsets– Increases unloaded DRAM latency

• Cached DRAM [Hidaka+ IEEE Micro’90, Hsu+ ISCA’93, …]– Add SRAM cache inside of DRAM chip– Increases DRAM die size (+38.8% for 64kB)

• Hierarchical Bank [Yamauchi+ ARVLSI’97]– Parallelize accesses to subarrays– Adds complex logic to subarrays– Does not utilize multiple local row-buffers

Page 34: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

34

Outline

•Motivation & Key Idea• Background•Mechanism• Related Works• Results

Page 35: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

35

Methodology• DRAM Area/Power– Micron DDR3 SDRAM System-Power Calculator– DRAM Area/Power Model [Vogelsang, MICRO’10]– CACTI-D [Thoziyoor+, ISCA’08]

• Simulator– CPU: Pin-based, in-house x86 simulator– Memory: Validated cycle-accurate DDR3 DRAM simulator

• Workloads– 32 Single-core benchmarks• SPEC CPU2006, TPC, STREAM, random-access• Representative 100 million instructions

– 16 Multi-core workloads• Random mix of single-thread benchmarks

Page 36: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

36

Configuration• System Configuration– CPU: 5.3GHz, 128 ROB, 8 MSHR– LLC: 512kB per-core slice

• Memory Configuration– DDR3-1066– (default) 1 channel, 1 rank, 8 banks, 8 subarrays-per-bank– (sensitivity) 1-8 chans, 1-8 ranks, 8-64 banks, 1-128 subarrays

• Mapping & Row-Policy– (default) Line-interleaved & Closed-row– (sensitivity) Row-interleaved & Open-row

• DRAM Controller Configuration– 64-/64-entry read/write queues per-channel– FR-FCFS, batch scheduling for writes

Page 37: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

37

Single-Core: Instruction Throughput

hmm

erle

slie3

dze

usm

p

Gem

s.sp

hinx

3

scal

e

add

tria

d

gmea

n0%10%20%30%40%50%60%70%80% MASA "Ideal"

IPC

Impr

ovem

ent

17%

20%

MASA achieves most of the benefit of having more banks (“Ideal”)

Page 38: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

38

Single-Core: Instruction Throughput

0%

10%

20%

30%

SALP-1 SALP-2MASA "Ideal"

IPC

Incr

ease

SALP-1, SALP-2, MASA improve performance at low cost

20%17%13%7%

DRAM Die Area

< 0.15% 0.15% 36.3%

Page 39: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

39

Single-Core: Sensitivity to Subarrays

1 2 4 8 16 32 64 1280%5%

10%15%20%25%30% MASA

Subarrays-per-bank

IPC

Impr

ovem

ent

You do not need many subarrays for high performance

Page 40: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

40

Single-Core: Row-Interleaved, Open-Row

0%

5%

10%

15%

20%

MASA "Ideal"IP

C In

crea

se

15%12%

MASA’s performance benefit is robust to mapping and page-policy

Page 41: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

41

Single-Core: Row-Interleaved, Open-Row

0.00.20.40.60.81.01.2

Baseline MASA

Nor

mal

ized

Dy

nam

ic E

nerg

y

0%

20%

40%

60%

80%

100%

Baseline MASA

Row

-Buff

er H

it-Ra

te

MASA increases energy-efficiency

-19%

+13%

Page 42: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

42

Other Results/Discussion in Paper• Multi-core results

• Sensitivity to number of channels & ranks

• DRAM die area overhead of:–Naively adding more banks–Naively adding SRAM caches

• Survey of alternative DRAM organizations–Qualitative comparison

Page 43: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

43

Conclusion• Problem: Requests to same DRAM bank are

serialized• Our Goal: Parallelize requests to same DRAM

bank at a low cost• Observation: A bank consists of subarrays

that occassionally share global structures • MASA: Reduces sharing to enable parallel

access and to utilize multiple row-buffers• Result: Significantly higher performance and

energy-efficiency at low cost (+0.15% area)

Page 44: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

A Case for Subarray-Level Parallelism

(SALP) in DRAMYoongu Kim, Vivek Seshadri,

Donghyuk Lee, Jamie Liu, Onur Mutlu

Page 45: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

45

Exposing Subarrays to Controller• Every DIMM has an SPD (Serial Presence Detect)

– 256-byte EEPROM– Contains information about DIMM and DRAM devices– Read by BIOS during system-boot

• SPD reserves 100+ bytes for manufacturer and user– Sufficient for subarray-related information

1. Whether SALP-1, SALP-2, MASA are supported2. Physical address bit positions for subarray index3. Values of timing constraints: tRA, tWA

(Image: JEDEC)

Page 46: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

46

Multi-Core: Memory SchedulingConfiguration: 8-16 cores, 2 chan, 2 ranks-per-chan

FRFCFS TCM FRFCFS TCM8-core system 16-core system

0%

5%

10%

15%

20%

25%Baseline SALP-1 SALP-2 MASA

WS

Incr

ease

Our mechanisms further improve performance when employed with application-aware schedulers

We believe it can be even greater with subarray-aware schedulers

Page 47: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

47

Number of Subarrays-Per-Bank• As DRAM chips grow in capacity…– More rows-per-bank More subarrays-per-bank

• Not all subarrays may be accessed in parallel– Faulty rows remapped to spare rows– If remapping occurs between two subarrays…• They can no longer be accessed in parallel

• Subarray group– Restrict remapping: only within a group of subarrays– Each subarray group can accessed in parallel– We refer to a subarray group as a “subarray”• We assume 8 subarrays-per-bank

Page 48: A Case for  Subarray -Level Parallelism  (SALP) in DRAM

48

Area & Power Overhead• Latches: Per-Subarray Row-Address, Designated-Bit

– Storage: 41 bits per subarray– Area: 0.15% in die area (assuming 8 subarrays-per-bank)– Power: 72.2uW (negligible)

• Multiple Activated Subarrays– Power: 0.56mW static power for each additional activated subarray

• Small compared to 48mW baseline static power

• SA-SEL Wire/Command– Area: One extra wire (negligible)– Power: SA-SEL consumes 49.6% the power of ACT

• Memory Controller: Tracking the status of subarrays– Storage: Less than 256 bytes

• Activated? Which wordline is raised? Designated?