vinodh cuppu and bruce jacob, university of maryland concurrency, latency, or system overhead: which...

Vinodh Cuppu and Bruce Jacob, University of Maryland

Concurrency, Latency, or System Overhead: which Has the Largest Impact on Uniprocessor DRAM-System Performance?

Richard Wells

ECE 7810

April 21, 2009

The University of Utah

Reservations

The paper is old Presented at ISCA 2001 Only considers uniprocessor systems

They draw some conclusions that while valid are focused on their research goals

Papers relating to our groups project are not prevalent in recent years, except one already presented at the architecture reading club.


Overview

Investigate DRAM system organization parameters to determine bottleneck

Determine synergy or antagonism between groups of parameters

Empirically determine the optimal DRAM system configuration


Methodologies to increase system performance

Concurrent transactions

Reducing latency

Reduce system overhead


Previous approaches to reduce memory system overhead

DRAM Component Increase bandwidth

Current “tack” taken by the PC industry Reduce DRAM latency

ESDRAM SRAM cache for the full row buffer Allows precharge to begin immediately after

access FCRAM

Subdivide internal bank by activating only a portion of each wordline


Previous approaches to reduce memory system overhead (cont.)

Reduce capacitance on word access to 30 ns (2001)

MoSys Subdivides storage into a large number of very

small banks Reduces latency of DRAM core to nearly that of

SRAM VCDRAM

Set-associative SRAM buffer that holds a number of sub-pages


The Jump

DRAM oriented approaches do reduce application execution time

Because zero latency DRAM doesn’t reduce the overhead of memory system to zero, bus transactions are considered

Other factors considered Turnaround time Queuing delays Inefficiencies due to asymmetric read/write requests Multiprocessor - Arbitration and Cache coherence would

add to overhead


CPU – DRAM Channel

Access reordering (cited Impulse group here at the U) Compacts sparse data into densely-packed bus

transactions Reduces the number of bus transactions Possibly reduces duration of bus transaction


Increasing concurrency

Different banks on the same channel Independent channels to different banks Pipelined requests Split-transaction bus


Decreasing channel latency

Due to channel contention Back to back read requests Read arriving during precharge Narrow channels Large data burst size


Addressing System Overhead

Bus turnaround time Dead cycles due to asymmetric read/write

shapes Queuing overhead Coalescing queued requests Dynamic re-prioritization of requests


Timing Assumptions

10 ns address 70 ns until burst starts on a read 40 ns until a write can start


Split Transaction Bus Assumptions

Overlapping Supported Back-to-back reads Back-to-back read/write pairs


Burst Ordering, Coalescing

Critical-burst first, non-critical burst second, writes last

Coalesce writes followed by reads


Bit Addressing & Page Policy

Bit assignments chosen to exploit page mode and maximize degree of memory concurrency Most significant bits identify the smallest-scale

component in the system Least significant bits identify the largest-scale

component in the system Allows sequential addresses to be stripped

across channels maximizing concurrency Close-page auto-precharge policy


Simulation Environment

SimpleScalar (used in 6810) 2 GHz clock L1 caches 64Kb/64Kb, 2-way set associative L2 cache unified 1Mb, 4-way set associative, 10

cycle access time Lock-up free cache using miss status holding

register (MSHR)


Timing Calculations

CPU + DRAM determined by running a second simulation with perfect primary memory (available on next cycle)


Results – Degrees of Freedom

Bus Speed: 800 MHz Bus width: 1, 2, 4, 8 bytes Channels: 1, 2, 4 Banks/Channel: 1, 2, 4, 8 Queue Size: infinite, 0, 1, 2, 8, 16,

32 Turnaround: 0, 1 cycles R/W shapes: symmetric, asymmetric


Results – Execution Times

Assumes infinite request queue System parameters can lead to widely varying CPI


Results – Turnaround and Banks

Turnaround only accounts for 5% of system related overhead

Banks/Channel accounts for 1.2x – 2x variation – shows concurrency is important

Latency accounts for over about 50% of CPI


Results – Burst Length vs. BW

Accounts for 10-30% of execution time Wider channels have optimal performance with larger bursts Narrow channels have optimal performance with smaller bursts


Results - Concurrency


Results – Concurrency (Cont.)

Increasing the number of banks typically increases performance, but not always much

Many narrow channels is risky because application might not have much inherent concurrency

Optimal 1 channel x 4 bytes x 64 byte burst, 2 channel x 2 bytes x 64 byte burst, 1 channel x 4 bytes x 128 byte burst

Performance varies depending on the concurrency of the benchmark


Results – Concurrency (Cont.)

“We find that, in a uniprocessor setting, concurrency is very important, but it is not more important than latency. . . . However, we find that if, in an attempt to increase support for concurrent transactions, one interleaves very small bursts or fragments the DRAM bus into multiple channels, one does so at the expense of latency, and this expense is too great for the levels of concurrency being produced.”


Results – Request Queue Size


Results – Request Queue Size

How queuing benefits system performance Sub-blocks of different read requests can be interleaved Writes can be buffered until read-burst traffic has died

down Read and write requests may be coalesced

Applications with significant write activity see more benefit from queuing Bzip has many more writes than GCC

Anomalies attributed to requests with temporal locality go to the same bank. With a small queue they are delayed.


Conclusions

Tuning system level parameters can improve the memory system performance by 40% Bus turnaround – 5-10% Banks – 1.2x – 2x Burst length vs. bandwidth – 10%-30% Concurrency

Smaller bursts to allow for interleaving is not a good idea because it limits concurrency


Our Project

To evaluate the effect of mat array size on power and latency of the DRAM chips.

Simulators Cacti DRAMSim Simics

Predicted Results Positive

Decreased memory latency Decreased power profile DIMM parallelism increase

Negative Decreased row buffer hit rates Decreased memory capacity (for same chip area) Increase the important cost/bit metric


How project relates to the paper

Trying to decrease the memory system bottlenecks Although we have evaluated bottlenecks

differently Jacob indirectly showed the importance of

minimizing DRAM latency DRAM latency was largest portion of CPI so

Amdahl’s law would justify reducing latency Both our solutions could work together

synergistically


Additional thoughts

The current path of DRAM innovation has limitations

DRAM chips and DIMMs need to undergo fundamental changes, of which this could be a step

Helps power efficiency Can balance with cost effectiveness Partially addresses the memory gap


Questions

Questions?

vinodh cuppu and bruce jacob, university of maryland concurrency, latency, or system overhead: which...

Documents

system overhead slide

university of utah overview

wordline slide

overhead of memory system

latency dram doesnt

latency of dram core

channel latency

number of bus transactions