memory controller design under cloud workloads · 2016-12-01 · state-of-the-art memory controller...

13
Memory Controller Design Under Cloud Workloads Mostafa Mahmoud Electrical and Computer Engineering University of Toronto Toronto, ON, Canada [email protected] Andreas Moshovos Electrical and Computer Engineering University of Toronto Toronto, ON, Canada [email protected] Abstract This work studies the behavior of state-of-the-art memory controller designs when executing scale-out workloads. It considers memory scheduling techniques, memory page management policies, the number of memory channels, and the address mapping scheme used. Experimental measurements demonstrate: 1) Sev- eral recently proposed memory scheduling policies are not a good match for these scale-out workloads. 2) The relatively simple First-Ready-First-Come-First-Served (FR-FCFS) policy performs consistently better, and 3) for most of the studied workloads, the even simpler First-Come-First-Served scheduling policy is within 1% of FR-FCFS. 4) Increasing the number of memory channels offers negligible performance benefits, e.g., performance improves by 1.7% on average for 4- channels vs. 1-channel. 5) 77%-90% of DRAM rows activations are accessed only once before closure. These observation can guide future development and optimization of memory controllers for scale-out work- loads. 1. Introduction Today there is an increasing demand for cloud services, such as media streaming, social networks and search engines. Cloud services providers have been expanding their computing infrastructure adding more data centers comprising many computing systems. The performance and power consumption of these systems dictates how capable these data centers are. Accordingly, improving the performance and power of these systems can greatly improve overall data center efficiency. The first step in improving these systems is understanding their behavior in order to identify inefficiencies and opportunities for improvement. The majority of modern server processors are built using the same microarchitecture as the one used for consumer designs with minor tweaks. There is already evidence suggesting that these architectures are subop- timal and that simpler processors tend to perform better for scale-out workloads [1], [2], [3]. Ferdman et al. did an extensive study and instrumentation of Cloud- Suite, a representative set of cloud applications [4] and discovered significant over-provisions, at the level of microarchitectural structures and cache and memory hierarchy, in currently widely used server processors. Their study focused on microarchitectural character- istics and identified several inefficiencies including long instruction-fetch stalls, under-utilized instruction- level parallelism (ILP) and memory-level parallelism (MLP) structures, last-level cache (LLC) ineffective- ness, over-sized L2 cache, reorder buffer and load-store queue under-utilization and off-chip bandwidth over- provisioning. Our work builds upon this past study and focuses on the memory controller, a component that has not been covered in detail yet. Specifically, this work studies the implications of scale-out workloads characteristics on memory con- troller design aspects including memory scheduling techniques, DRAM page management policies, the number of memory channels, and the indexing scheme used to access memory channels. Over the last decade, researchers and processor vendors have been devel- oping even more sophisticated on-chip memory con- trollers that consider a wide range of system status attributes in scheduling decisions [5], [6], [7], [8], [9], [10]. Furthermore, there has been a trend to increase the number of memory channels with four on-chip channels being available on high-end designs [11], [12]. Increasing the number of memory channels offers more parallelism and allows more memory to be connected to the same processor chip. This work characterizes the performance of several arXiv:1611.10316v1 [cs.AR] 30 Nov 2016

Upload: others

Post on 24-Mar-2020

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

Memory Controller Design Under Cloud Workloads

Mostafa MahmoudElectrical and Computer Engineering

University of TorontoToronto, ON, Canada

[email protected]

Andreas MoshovosElectrical and Computer Engineering

University of TorontoToronto, ON, Canada

[email protected]

Abstract

This work studies the behavior of state-of-the-artmemory controller designs when executing scale-outworkloads. It considers memory scheduling techniques,memory page management policies, the number ofmemory channels, and the address mapping schemeused. Experimental measurements demonstrate: 1) Sev-eral recently proposed memory scheduling policies arenot a good match for these scale-out workloads. 2) Therelatively simple First-Ready-First-Come-First-Served(FR-FCFS) policy performs consistently better, and3) for most of the studied workloads, the even simplerFirst-Come-First-Served scheduling policy is within1% of FR-FCFS. 4) Increasing the number of memorychannels offers negligible performance benefits, e.g.,performance improves by 1.7% on average for 4-channels vs. 1-channel. 5) 77%-90% of DRAM rowsactivations are accessed only once before closure.These observation can guide future development andoptimization of memory controllers for scale-out work-loads.

1. Introduction

Today there is an increasing demand for cloudservices, such as media streaming, social networks andsearch engines. Cloud services providers have beenexpanding their computing infrastructure adding moredata centers comprising many computing systems.The performance and power consumption of thesesystems dictates how capable these data centers are.Accordingly, improving the performance and power ofthese systems can greatly improve overall data centerefficiency. The first step in improving these systemsis understanding their behavior in order to identifyinefficiencies and opportunities for improvement.

The majority of modern server processors are builtusing the same microarchitecture as the one used forconsumer designs with minor tweaks. There is alreadyevidence suggesting that these architectures are subop-timal and that simpler processors tend to perform betterfor scale-out workloads [1], [2], [3]. Ferdman et al.did an extensive study and instrumentation of Cloud-Suite, a representative set of cloud applications [4] anddiscovered significant over-provisions, at the level ofmicroarchitectural structures and cache and memoryhierarchy, in currently widely used server processors.Their study focused on microarchitectural character-istics and identified several inefficiencies includinglong instruction-fetch stalls, under-utilized instruction-level parallelism (ILP) and memory-level parallelism(MLP) structures, last-level cache (LLC) ineffective-ness, over-sized L2 cache, reorder buffer and load-storequeue under-utilization and off-chip bandwidth over-provisioning. Our work builds upon this past studyand focuses on the memory controller, a componentthat has not been covered in detail yet.

Specifically, this work studies the implications ofscale-out workloads characteristics on memory con-troller design aspects including memory schedulingtechniques, DRAM page management policies, thenumber of memory channels, and the indexing schemeused to access memory channels. Over the last decade,researchers and processor vendors have been devel-oping even more sophisticated on-chip memory con-trollers that consider a wide range of system statusattributes in scheduling decisions [5], [6], [7], [8],[9], [10]. Furthermore, there has been a trend toincrease the number of memory channels with fouron-chip channels being available on high-end designs[11], [12]. Increasing the number of memory channelsoffers more parallelism and allows more memory to beconnected to the same processor chip.

This work characterizes the performance of several

arX

iv:1

611.

1031

6v1

[cs

.AR

] 3

0 N

ov 2

016

Page 2: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the CloudSuite benchmarks which are thebest available to us workloads representing moderncloud applications. This study also considers traditionalserver workloads from SPECweb99, online transactionprocessing (OLTP) and decision support applicationsfor completeness and comparison purposes. The fol-lowing memory scheduling algorithms are studied:ATLAS [6], PAR-BS [7] and Reinforcement Learning[10]. Section 2.1 reviews these policies.

The following key conclusions are drawn:

• The First-Ready First-Come-First-Served (FR-FCFS) algorithm [5] outperforms the other state-of-the-art memory scheduling algorithms for allthe studied server workload categories. Theseother techniques were designed for desktop andscientific applications with abundant memorylevel parallelism (MLP) or for multi-programmedheterogeneous memory-intensity workloads. Theserver workloads studied do not exhibit thesecharacteristics, and thus memory controller de-sign ought to be revisited to better match theseworkloads.

• For five out of the six studied scale-out workloads,the performance of the simpler First-Come-First-Served (FCFS) memory scheduling algorithm iswithin 1% of the baseline FR-FCFS. Accordingly,using this simpler policy may be best in somecases.

• Out of all DRAM row activations, 77%-90%receive only a single access before they are beingforced to close. Leaving a row open till a con-flict happens increases memory access latency forsubsequent requests whereas proactively closing arow that will not receive any further hits reducesit. Accordingly, smarter page management poli-cies may improve overall memory access latency.

• State-of-the-art page management policies de-grade performance by 4% for scale-out workloadsbut did improve performance by 3% for decisionsupport workloads.

• A multi-channel memory controller does not im-prove the performance for scale-out workloads butdoes benefit decision support workloads. Specif-ically, decision support workloads performanceimproves on average by 19% on a 4-channelsystem compared to a single-channel one.

Overall, the results of this study emphasize theunique off-chip memory access demands of scale-outworkloads and suggest that simpler memory controllersspecialized for cloud workloads may be beneficial. A

limitation of this study is that it does not directlyconsider energy and power consumption focusing pri-marily on performance. Future work may addressthis limitation. However, the results of this work arevaluable as the techniques that perform best at the endare also the simplest to implement and hence wouldalso reduce overall energy and power consumption.The rest of this paper is organized as follows. Section2 reviews state-of-the-art memory controller designsincluding memory scheduling techniques and pagemanagement policies. Section 3 presents the experi-mental methodology. Section 4 reports our findings.Section 5 discusses some of the limitations of thisstudy. Section 6 summarizes related work and Section7 concludes.

2. Background

This section reviews the two core memory controllercomponents that this study characterizes: the schedul-ing algorithm (Section 2.1) and the memory page man-agement policy, or page management policy (Section2.2). The former decides which memory request toservice next while the latter decides after servicing arequest, whether to keep the row open or precharge it,indirectly affecting the service time of any subsequentrequests.

2.1. Memory Scheduling Algorithms

A memory scheduling algorithm (MSA) specifieshow the memory controller decides the next requestto service given a pool of outstanding requests. TheMSA’s inputs are at the very least the state of theDRAM banks and buses, and a pool of waiting re-quests. Additional statistics or system attributes mayalso be used to guide the decision making. MSAs cantarget different objectives such as memory throughputor fairness among running threads. The MSA affectsthe waiting time each request experiences before beingserviced. Since main memory is a shared resource,requests from all cores contend and the MSA affectsoverall system and individual thread performance.

The FCFS scheduling algorithm services requests inthe order they arrive at the memory controller. FCFSdoes not exploit row-buffer locality as it does not try toreorder requests to increase the row-buffer hit rate. Ifcores generate memory requests at different rates, coreswith a low memory-intensity access stream can sufferfrom starvation and long memory access latencies.FCFS is the simplest memory scheduling techniqueand its hardware overhead and power consumptionare the lowest among those we study. We evaluate

Page 3: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

FCFS banks, a variant of FCFS that maintains separateper-bank request queues and thus can exploit bank-level parallelism.

FR-FCFS [5] separates requests into two groupsdepending on whether they will hit in the currentlyopen row, and further, within each group it orders re-quests according to age. It prioritizes hits over misses,and older over younger requests. FR-FCFS’s goal isto increase memory throughput. FR-FCFS can leadto starvation and low overall throughput when thereis imbalance in the row-buffer locality of the accessstreams across cores [7], [9], [13].

Parallelism-Aware Batch Scheduling (PAR-BS) [7]targets maintaining fairness among cores and reducingthe average stall time through Batching and Ranking.Batching groups the oldest N requests from each coreinto a batch that is prioritized over the remainingrequests. After batch formation, the cores within thebatch are ranked using a shortest-job first criterionwhere the core with the minimum number of requeststo any bank is considered the core with the shortest job.Ranking minimizes the average waiting time across allcores.

Adaptive per-Thread Least-Attained-Service mem-ory scheduling (ATLAS) [6] is based on the obser-vation that cores that require less memory servicetime are more vulnerable to interference from coresthat require more. ATLAS divides time into 10Mcycle quantums tracking the total memory service timeused by each core during each quantum as follows:every cycle, the attained service time (ATS) of a coreis increased by the number of banks servicing itsrequests. At the start of the subsequent quantum, thememory controller ranks the cores according to theirATS, ranking those with less ATS higher. This policyexploits bank-level parallelism and reduces averagelatency across cores. ATLAS prevents starvation byprioritizing requests that have been waiting for morethan a threshold T cycles.

Reinforcement Learning-based memory schedulingalgorithm (RL) [10] uses reinforcement learning [14]resulting in a self-optimizing memory scheduler. RLuses a number of attributes to represent the currentsystem state (s) such as the number of reads in therequest queue, the number of writes and the numberof load requests. The action (a) that it can perform ata given cycle is precharge, activate, write, read for aload miss, read for a store miss or no-action. For everypossible system state-action pair (s, a), RL associatesa Q-value that represents the expected future rewardif action a is executed in state s. Given a currentstate, RL’s goal is to maximize the future rewardsby choosing the action with the highest Q-value. RL

uses an area and energy efficient implementation ofthe aforementioned reinforcement learning algorithm.With a preset probability, the algorithm decides toexecute a random action instead of the one with thehighest Q-value to explore new search space areas.

The Fair Queuing Memory Scheduler (FQM) [8] isa memory scheduling algorithm based on a computernetwork fair queuing algorithm. FQM’s goal is toprovide equal memory bandwidth to all cores. In FQM,each bank in the DRAM keeps a virtual time counterper core that is increased every time a request fromthe corresponding core is serviced by that bank. Eachbank prioritizes the core with the earliest virtual timesince this is the core that got the least service fromthat bank. Later proposals outperformed FQM [7], [9]as the latter does not consider the long term intensityof the memory access stream per core and does notattempt to maximize row hits. For this reason, we donot consider it further.

2.2. Page Management Policies

The page management policy decides for how longa memory row or page should remain open. One suchpolicy is the open-page management policy (OPM )where a row will be closed only if another requestforces it to do so. The OPM tends to be a goodmatch for single-core systems whose memory accessstream often exhibits good spatial locality. This way,subsequent requests hit in the row-buffer and thusexperience lower latency.

The memory controller of many-core systems ob-serves the interleaving of several access streams andas such may not exhibit sufficient spatial locality. Inthese systems, row hits are less likely and waiting toclose the page only when necessary ends up adding tothe latency of the next request. In such a case, it is bestto close the page as soon as possible [15]. Accordingly,the close-page management policy (CPM ) immediatelycloses a row after accessing it.

The CPM is not free of trade offs. It suffers excessdelays and wasted power when there is some locality inthe interleaved access streams. Adaptive page policiessuch as the open-adaptive (OAPM ) and close-adaptive(CAPM ) have been proposed to improve upon OPM

and CPM . The OAPM policy closes the row onlywhen: 1) there are no more pending requests in thecontroller’s queue that would hit in the open row,and 2) there are pending requests to another row. TheCAPM policy closes a row as soon as there are nopending requests that would hit in the same row. Thus,OAPM speculates that near future requests are likelyto be hits.

Page 4: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

Neither policy is always best. Accordingly, laterwork proposes switching between the two on-the-fly. Awasthi et al. introduced the Access-Based PagePolicy (ABPP) for multi-core systems [16]. ABPPassumes that a row will receive the same number ofhits as the last time it was activated. The implementa-tion uses per-bank tables recording the most recentlyaccessed rows and the number of hits they receivedlast time. The tables are used to predict how long arow should stay open and are dynamically updated asthe program executes. In the absence of a table entry,a row stays open until a conflict forces it to close.

Shen et al. proposed the Row-Based Page Policy(RBPP) which has lower hardware overhead thanABPP [15]. RBPP exploits the observation that in ashort period of time most of the memory requests arefor a small number of rows even in a many-core systemwith different applications running on each core. RBPPuses a few most-accessed-row registers (MARR) perbank recording the number of hits received by recentlyaccessed rows that have received at least one hit.

Timer-based page closure precedes RBBP andABBP [17], [18] and predicts how long a row shouldstay open. The various timer-based policies differ inthe granularity of the timer, e.g., some maintain atimer per bank while others maintain a global timer.Other approaches used a branch prediction-like two-level predictor to decide whether to close an openrow-buffer [19], [20]. We opted not to include theseproposals in our study since they were proposed forsingle core systems and would have to be adaptedfor many-core systems, moreover, RBPP and ABPPoutperformed them [15].

3. Methodology

3.1. Workloads

Our study focuses on the scale-out server workloadsrepresented by the CloudSuite benchmark suite [4].CloudSuite includes Data Serving, MapReduce, SATSolver, Web Frontend, Web Search and Media Stream-ing applications. To evaluate and compare these emerg-ing workloads to traditional server workloads, weextended our experiments to include two traditionalworkload categories: 1) Transactional workloads in-cluding SPECweb99 and TPC-C. We ran TPC-C ontwo commercial database management systems. 2)Decision support workloads represented by three TPC-H queries; Q2, Q6 and Q17. The three queries covera wide range of select-intensive, join-intensive andselect-join queries.

Table 1: Categorized Workloads and Abbreviations

Category Categoryacronym

Workload Acronym

Scale-out SCOW

Data Serving DSMapReduce MRSAT Solver SSWeb Frontend WFWeb Search WSMedia Streaming MS

TRSW

SPECweb99 WSPEC99Transactional TPC-C1 (vendor A) TPC-

C1TPC-C2 (vendor B) TPC-

C2

DSPW

TPC-H Q2 TPCH-Q2

Decision Support TPC-H Q6 TPCH-Q6

TPC-H Q17 TPCH-Q17

For all the workloads, we use the same benchmarksconfigurations used by Ferdman et al. [4]. Table 1reports the workloads, categories and acronyms we usein the rest of this paper.

3.2. Simulation

This study used the Virtutech Simics functionalsimulator to model a system of 16-core chip-multiprocessor (CMP) unless otherwise noted. TheGEMS simulator [21] was used to extend Simicswith an on-chip network timing model and a Ruby-based full-memory hierarchy including the memorycontroller and off-chip DRAM timing models. Thecores run the SPARC v9 ISA.

We follow the SimFlex multiprocessor samplingtechnique for our simulation experiments [22]. Thesamples are taken over a 10-second simulated intervalof each application. Each sample ran for six billionuser-level instructions where the first one billion user-level instructions were used to warm up the system,e.g., the caches, memory queues, network buffers,and interconnects. Statistics were collected for thesubsequent five billion user-level instructions. TPC-H queries Q2 and Q17 were run to completion fora total run length of roughly 2.5 billion user-levelinstructions, one billion of which was used to warmup. We simulate both user-level and operating system-level instructions but we use the user-level instructioncount divided by the total simulated cycles count as anindicator of the overall system performance; a metricfound by Wenisch et al. to accurately represent sys-tem throughput [22]. Measured performance metricsinclude committed user-level instructions per cycle(user IPC), average memory access latency, DRAM

Page 5: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

Table 2: Baseline System Configuration

CMP Organization 16-core Scale-Out Processor podCore In-order @ 2GHzL1-I/D caches 32KB each, 64B blocks, 2-wayShared L2 cache 4MB, unified, 16-way, 64B blocks,

4 banksInterconnect 16x4 crossbarMemory Controller FR-FCFS scheduling, open-

adaptive page policy, 1-channel, 11.9GB/s bandwidth,RoRaBaCoCh address mapping

Off-chip DRAM 32-64GB, DDR3-1600 (800MHz),2 ranks, 8 banks per rank, 8KBrow-buffer

tCAS -tRCD-tRP -tRAS 11-11-11-28tRC -tWR-tWTR-tRTP 39-12-6-6tRRD-tFAW (in cycles) 5-24

row-buffer hit rate, LLC misses per kilo instructions(L2 MPKI), and memory bandwidth utilization. TheWeb Frontend benchmark uses only 8-cores in theconfiguration that was available to us.

3.3. Baseline System Configuration

The baseline system is a 16-core in-order CMP withonly two levels of on-chip caches based on the scale-out processor design recommendations proposed byLotfi-Kamran et al. [3]. In that study, pods of 16-32 in-order cores were found to achieve the highestperformance density for scale-out workloads. The chipfeatures a modestly sized 4MB LLC to capture theinstruction working set and shared OS data. Table 2details the architectural configurations, and Section 4.3explains the address mapping abbreviation listed.

4. Results

Section 4.1 compares the memory scheduling tech-niques in terms of overall system performance, averagememory access latency and row-buffer hit rate. Theobservations in Section 4.2.1 motivate the study ofSection 4.2 which identifies the efficiency of DRAMpage management policies in predicting when to closea row-buffer and when to keep it open. Section 4.3investigates the effectiveness of using multi-channelmemory controllers.

4.1. Memory Scheduling Study

We evaluate the baseline FR-FCFS scheduling algo-rithm [5], [23], [24] as well as the state-of-the-art mem-ory scheduling techniques PAR-BS [7], ATLAS [6]and RL [10]. We also evaluate the simplest algorithmFCFS banks. Results are normalized to the baseline

Table 3: Scheduling Algorithms Configurations

Algorithm Parameter ValuePAR-BS Batching-Cap 5

ATLASQuantum 10M cyclesα (bias to current quantum) 0.875Starvation threshold 50K cycles

RL

# of Q-value tables 32Q-value table size 256 Q-valuesα (learning rate) 0.1γ (discount rate) 0.95ε (random action probability) 0.05Starvation threshold 10K cycles

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0

0.2

0.4

0.6

0.8

1

1.2

FR-FCFS FCFS_Banks PAR-BS ATLAS RL

No

rma

lize

d u

se

r IP

C

Figure 1: User IPC normalized to FR-FCFS.

FR-FCFS unless otherwise stated. Table 3 shows theconfigurations used for PAR-BS, ATLAS and RL.

4.1.1. Performance. Figure 1 shows the user IPCnormalized to FR-FCFS. The results show that, underserver workloads and especially SCOW applications,FR-FCFS outperforms PAR-BS, ATLAS and RL. Theonly exceptions are TPC-C2 and Media Streamingwhere the aforementioned policies perform as well asFR-FCFS.

Performance with ATLAS is lower more so forSCOW that suffers a 20% average drop in performancecompared to FR-FCFS. ATLAS with its long quantumperiod of 10 million cycles causes some cores to beunfairly given lower priority for long periods. Forexample, in MapReduce and Web Frontend some coreshad 50% lower IPC than others, resulting in overallperformance degradation of 52% and 21% respectively.By comparison, the lowest per core IPC with FR-FCFS is within 85% of the highest per core IPCfor these two applications. Large IPC disparity acrosscores is also responsible for the 33% performance lossof SPECweb99. On average, ATLAS achieves averageperformance that is 20%, 12% and 10% lower forSCOW , TRSW and DSPW respectively.

RL also performs worse than FR-FCFS, more so for

Page 6: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0

10

20

30

40

50

60

70

FR-FCFS FCFS_Banks PAR-BS ATLAS RL

Ro

w-b

uffe

r h

it ra

te

Figure 2: Row-buffer hit rate.

DSPW where the loss is 10%. DSPW ’s access patternstend to be more random than that of conventionaldesktop and scientific applications and thus challengethe RL exploration process which introduces overheadsif activated frequently enough.

The simpler FCFS Banks performs closer to FR-FCFS than the other policies. FCFS Banks’ averageperformance is within 6%, 3% and 4% for SCOW ,TRSW and DSPW respectively. For five out of the sixSCOW workloads, FCFS Banks matches the perfor-mance of FR-FCFS. This can be explained by Figure2 which shows that the row-buffer hit rate changes byonly -4%, +1% and -2% for SCOW , TRSW and DSPW

respectively with FCFS Banks. FR-FCFS’ benefit overFCFS Banks is that it favors row hits. For these work-loads, the cores are either not concurrently competingfor the same bank or they access the same row inthe same bank at the same time. Accordingly, thereis little benefit from reordering the requests headingto the same bank. Web Frontend is the exceptionas its performance is 37% lower with FCFS Banks.This workload exhibits a row-buffer hit rate of 55%with FR-FCFS which drops to 45% with FCFS Banksleading to a 17% increase in average memory accesslatency as Figure 2 and Figure 3 show. Compared toFR-FCFS, FCFC Banks does not need to scan all therequest queues every cycle searching for a row hit topromote. As a result, it is simpler to implement andwould require less energy. However, overall systemenergy may increase unless performance stays thesame.

4.1.2. Memory Access Latency Sensitivity. The aver-age memory access latency in Figure 3 correlates withthe changes in user IPC. However, the sensitivity ofperformance to memory access latency differs acrossworkloads. For example, DSPW is less sensitive toaverage memory latency than the Web Frontend work-

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0.8

0.9

1

1.1

1.2

1.3

1.41.83 7.78 2.13 1.63 1.70 2.53 1.44 2.941.48

FR-FCFS FCFS_Banks PAR-BS ATLAS RL

Avg

me

mo

ry a

cce

ss la

ten

cy

Figure 3: Average memory access latency normalized to FR-FCFS.

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0

5

10

15

20

25

FR-FCFS FCFS_Banks PAR-BS ATLAS RL

L2

MP

KI

Figure 4: L2 Misses per kilo user instructions (MPKI).

load; while DSPW suffers a 15% increase in memoryaccess latency with FCFS Banks this translates to a4% reduction in IPC. DSPW exhibits some memorylevel parallelism which hides most of the increase inmemory access latency under FCFS Banks.

ATLAS suffers a significant increase in memoryaccess latency which is more pronounced for SCOW

at 2.94x on average and up to 7.78x for MapReduce.This increase in memory latency translates into a 20%performance loss on average. DSPW is also negativelyimpacted by ATLAS’ long-term ranking scheme andRL’s exploratory approach leading to 28% and 37%longer memory access latency respectively.

The MPKI measurements in Figure 4 indicate thatSCOW and TRSW have relatively low memory in-tensity exhibiting an average L2 MPKI of 5 and 8respectively. DSPW exhibits a higher average MPKIof around 18. This corroborates the different off-chipmemory bandwidth demands of scale-out workloadsreported in earlier work [4].

4.1.3. Requests Queue Length. The memory con-troller’s average read and write queue occupancies

Page 7: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0

1

2

3

4

5

6

7

8

9

10

FR-FCFS FCFS_Banks PAR-BS ATLAS RL

Avg

re

ad

Q le

ng

th

Figure 5: Average read queue length.

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0

5

10

15

20

25

30

35

40

45

50

FR-FCFS FCFS_Banks PAR-BS ATLAS RL

Avg

wri

te Q

len

gth

Figure 6: Average write queue length.

(length) are shown in Figure 5 and Figure 6 re-spectively. All memory scheduling techniques neverneeded more than a 10-entry read queue and a 50-entrywrite queue. On average, DSPW is more demandingthan SCOW with MapReduce under ATLAS being theexception. MapReduce’s behavior is due to the 7.78xincrease in memory access latency as discussed inSection 4.1.2. The observed queue lengths are far lowerthan those used in previous work [25], [26], [27]. Thismay be due to the relatively simple in-order cores usedhere.

RL exhibits noticeably lower write queue lengthscompared to the rest. The other techniques are switch-ing from a read phase to a write phase only when thewrite queue length is above a certain threshold in orderto reduce how often the data bus direction switchingpenalty is incurred. RL considers both reads and writeswhen it selects the memory request to serve next andbuilds its decision based on the optimum strategy itlearned so far. This gives the memory controller thefreedom to serve write requests whenever it can steala few cycles between critical memory reads.

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0

10

20

30

40

50

60

70

80

FR-FCFS FCFS_Banks PAR-BS ATLAS RL

Me

mo

ry B

W u

tiliz

atio

n

Figure 7: Memory bandwidth utilization.

4.1.4. Off-Chip Bandwidth Utilization. Figure7 shows the off-chip bandwidth utilization perworkload. Utilization under SCOW ranges from 14%and up to 50% of the available peak bandwidth withan average of 34%. TRSW exhibits a similar averagewhile DSPW has higher off-chip access demands withan average utilization of 54%. The measured averagememory bandwidth utilization motivates the study inSection 4.3 that considers the impact of introducingadditional memory channels.

4.1.5. Summary of Memory Scheduling Study Re-sults. A technique as simple as FR-FCFS has provenbest for the workloads studied under the pod-basedin-order processor design proposed by Lotfi-Kamranet al. [3]. For the server workloads, and more so forscale-out workloads, FR-FCFS outperformed the otherstate-of-the-art memory scheduling techniques. Thesetechniques were designed to outperform FR-FCFSfor desktop and parallel applications such as SPECCPU2006 [28] and SPLASH-2 [29] in environmentswhere fairness among threads or cores is a concern,which is not the case here. Scale-out workloads exhibitrelatively low MLP and different access patterns com-pared to scientific applications. Moreover, the coresused here are relatively simple as advised by Lotfi-Kamran et al. who found that those workloads donot exhibit similarly high ILP as SPEC and PARSECdo. Combined these explain the differences in relativeperformance across the various memory schedulingpolicies. Even a simple FCFS scheduler that exploitsbank-level parallelism is within 6% of FR-FCFS in allcases and matches its performance for most.

The analysis showed that relatively short read andwrite queues are sufficient for these workloads primar-ily due to the relatively low MLP. Finally, off-chipmemory bandwidth utilization was shown to be rel-atively low suggesting that adding additional memory

Page 8: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q6

SCO TRS DSP

0

10

20

30

40

50

60

70

80

90

100

1-a

cce

ss r

ow

-bu

ffe

r a

ctiva

tio

ns %

Figure 8: Percentage of single-access row-buffer activationsunder OAPM .

channels may not be needed if performance is the onlyconsideration.

4.2. Page Management Policies Study

4.2.1. Activated Row-Buffer Reuse. The baselineOAPM policy is often preferred for workloads thatexhibit significant row-buffer locality. As Figure 2shows, with this policy, the average row-buffer hit ratesare relatively low at 37%, 33% and 27.5% for SCOW ,TRSW and DSPW respectively. To reason about theaccess patterns of those workloads that might resultin such row-buffer hit rates, we studied histogramsof how frequently an activated row-buffer is reusedbefore precharging it. As Figure 8 shows, all workloadsaccess 77%-90% of their activated rows only once.Keeping a row open as long as possible thus will oftenprolong subsequent accesses that will have to first waitfor the row to be closed. This observation suggeststhat using the CAPM policy might be better. CAPM

precharges the row-buffer directly after the columnaccess if no more hits are waiting in the queue. Havinga high percentage of single-access row activations isnot necessarily at odds with a high row-buffer hitrate. For example, in Media Streaming 76% of therow activations observe only a single access while theremaining 24% of the row activations experience ahigh number of hits.

Figure 9 shows that CAPM exhibits much lowerrow-buffer hit rates which are below 6% for all work-loads. This indicates that a significant fraction of thehits were due to the optimistic OAPM . One wouldexpect that this reduction in row-buffer hits will hurtperformance and memory access latency. In practice,the performance gained from closing the single-accessrows under CAPM early compensates for the perfor-mance loss due to fewer row-buffer hits. Figure 10

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0

0.2

0.4

0.6

0.8

1

1.2

Open Adaptive Close Adaptive RBPP ABPP

Norm

aliz

ed r

ow

-buffer

hit

rate

Figure 9: Row-buffer hit rate for different page managementpolicies normalized to OAPM .

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0.8

0.85

0.9

0.95

1

1.05

1.1

1.15

1.2

Open Adaptive Close Adaptive RBPP ABPP

Avg

me

mo

ry a

cce

ss la

ten

cy

Figure 10: Average memory access latency for different pagemanagement policies normalized to OAPM .

shows that, under CAPM , the average memory accesslatency did not change for SCOW , while it decreasedby 4% and 13% for TRSW and DSPW . Performancewas reduced by 2.5% for SCOW but improved by4% for DSPW as Figure 11 shows. Web Frontendand Media Streaming that exhibited the highest row-buffer hit rates under OAPM , suffered a 15% increasein access latency and Web Frontend lost 20% of itsperformance. The relatively higher MPKI and MLPof Media Streaming limited its performance loss to1%. Data Serving, MapReduce and SAT Solver sawimproved access latencies by 8%, 12% and 9% andperformance improvements in the range of 1% to 2%.

4.2.2. Predictive Page Management Policies. Theresults of the previous section motivated us to studythe behavior of state-of-the-art predictive page man-agement policies. This section compares RBPP [15]and ABPP [16] to CAPM and OAPM policies in termsof memory access latency, row-buffer hits preservationand user IPC.

As Figure 9 shows, RBPP is preserving 70%, 75%

Page 9: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0.7

0.75

0.8

0.85

0.9

0.95

1

1.05

1.1

Open Adaptive Close Adaptive RBPP ABPP

No

rma

lize

d u

se

r IP

C

Figure 11: User IPC for different page management policiesnormalized to OAPM .

and 86% of the row-buffer hits for SCOW , TRSW

and DSPW respectively. ABPP preserves generally lessrow-buffer hits more so for DSPW . Figure 10 showsthat DSPW ’s access latency is reduced by 6% withRBPP, leading to a 3% increase in user IPC in Figure11. IPC for SCOW and TRSW degrades by 4% and2% under RBPP. This correlates with the losses in row-buffer hits in Figure 9.

RBPP and ABPP policies favor capturing row-bufferhits over timely closure of single-access row activa-tions. As a result, they do not avoid most of the penaltydue to late closure of rows and they fail to capturesome of the row hits when they prematurely close arow. Overall performance is equal or slightly less thanOAPM . These policies were designed and tested fordesktop applications such as SPEC CPU2006.

4.2.3. Summary of Page Management Policies Re-sults. This section found that scale-out workloadsexhibit a high percentage of single-access row acti-vations. Thus, smarter page management policies areneeded to timely close pages in order to achieve thefollowing two goals: 1) Capturing as many row hits aspossible, and 2) closing a page as early as necessaryto avoid penalizing subsequent accesses to other rows.

4.3. Multi-Channel Memory Systems Study

Modern server processors incorporate several mem-ory channels [11], [12]. This section considers theperformance impact of using multi-channel memorycontrollers, a study motivated by the relatively lowmemory bandwidth utilization of the studied work-loads. Specifically, this section studies the impact ofintegrating 2 and 4 memory channels in terms ofperformance, memory access latency and row-bufferhit rate. The address mapping scheme, that is how

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0.80.85

0.90.95

11.05

1.11.15

1.21.25

1.3

1_channel 2_channel 4_channel

No

rma

lize

d u

se

r IP

C

Figure 12: Normalized user IPC as the number of memorychannels increases.

Table 4: The best performing multi-channel mapping schemefor each workload

Workload 2-channel 4-channelData Serving RoRaBaChCo RoRaChBaCoMapReduce RoRaChBaCo RoChRaBaCoSAT Solver RoChRaBaCo RoRaChBaCoWeb Frontend RoChRaBaCo RoRaBaCoChWeb Search RoRaChBaCo RoRaBaChCoMedia Streaming RoChRaBaCo RoChRaBaCoWSPEC99 RoRaBaCoCh RoRaChBaCoTPC-C1 RoRaBaChCo RoChRaBaCoTPC-C2 RoRaChBaCo RoChRaBaCoTPC-H Q2 RoRaBaChCo RoChRaBaCoTPC-H Q6 RoRaChBaCo RoChRaBaCoTPC-H Q17 RoRaBaChCo RoChRaBaCo

physical addresses are mapped across the off-chipmemory system, can impact overall performance. Forthis reason, we studied a number of address schemesthat differ in which address bits they use to selectthe DRAM channel (Ch), column (Co), bank (Ba),rank (Ra) and row (Ro). The schemes studied are:RoRaBaCoCh (baseline mapping), RoRaBaChCo, Ro-RaChBaCo and RoChRaBaCo. In the interest of space,we report the results of the best performing schemeper workload and the averages. The best performingscheme for each workload is reported in Section 4

The baseline scheme, RoRaBaCoCh, generally hadthe worst performance because accesses that wouldrow hit in the 1-channel system may now map to a dif-ferent channel; The scheme alternates successive cacheblocks between the memory channels which means thatsequential accesses do not map to the same DRAMrow. As shown in Figure 13, RoRaBaChCo, RoRaCh-BaCo and RoChRaBaCo exhibit better row-buffer hitrates than the baseline system. Figure 12 shows thatintegrating more on-chip memory channels does notenhance SCOW ’s performance significantly and insome cases it hurts performance. The performance of

Page 10: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0.5

1

1.5

2

2.5

3

1_channel 2_channel 4_channel

No

rma

lize

d r

ow

-bu

ffe

r h

it r

ate

Figure 13: Normalized row-buffer hit rates as the number ofmemory channels increases.

DS

MR

SS

WF

WS

MS

WS

PE

C9

9

TP

C-C

1

TP

C-C

2

TP

CH

-Q2

TP

CH

-Q6

TP

CH

-Q1

7

Avg

_S

CO

Avg

_T

RS

Avg

_D

SP

SCO TRS DSP Average

0.4

0.5

0.6

0.7

0.8

0.9

1

1.1

1_channel 2_channel 4_channel

Avg

me

mo

ry a

cce

ss la

ten

cy

Figure 14: Normalized memory access latency as the numberof memory channels increases.

Web Frontend dropped by 10% and 9% for the multi-channel systems. The highest gain of 6% and 8.5% forthe 2- and the 4-channel configurations was observedfor Data Serving. The average improvement for SCOW

was below 1% and 1.7% for the 2- and 4-channelssystems. DSPW behaves differently and exhibits 11.5%and 19% average performance improvements for the 2-and 4-channel systems. Finally, TRSW also exhibitsaverage performance improvements of 2.3% and 6%on the 2- and 4-channel systems.

Figure 13 and Figure 14 explain why performancedid not improve for SCOW and TRSW in contrast toDSPW . SCOW and TRSW improved the least in termsof row-buffer hit rates and memory access latencies.For both categories, the average row-buffer hit rateincreased by 1.3x and 1.6x for the 2- and 4-channelssystems. The average memory access latency of SCOW

decreased to 81% and 70% of the baseline system.TRSW experiences similar reductions for the averagememory access latency. Meanwhile, DSPW ’s averagehit rates increased by 1.7x and 2.3x on the 2- and

4-channels systems respectively. Moreover, DSPW ’smemory access latency decreased to 64% and 47% ofthe baseline access latency. Along with the relativelyhigher MPKI of DSPW , shown in Figure 4, this ex-plains the difference in the IPC gains between DSPW

and SCOW when multiple channels are used.Regardless of the improvements in row-buffer hits

and memory access latency, Web Frontend’s perfor-mance dropped by around 10%. This workload exhib-ited an 11% and 25% increase in the total number ofmemory accesses for the 2- and 4-channel systems.The extra memory references are mostly DMA/IO andatomic memory requests which hit in the row buffersreducing average latency. However, they increase con-tention and latency for critical accesses thus hurtingperformance. The reported user IPC, which is a metricof the user-level progress per cycle, is expected to godown due to more congestion from the extra accesses.

Conclusion: Under the pod-based in-order proces-sor design proposed by Lotfi-Kamran et al. [3], in-tegrating multi-channel memory controllers on-chipwith higher die area and power consumption doesnot improve the performance for scale-out workloads.One channel can satisfy the low off-chip memorybandwidth demand of scale-out workloads under thestudied pod configuration. However, increasing corecount would lead to a higher demand for off-chipbandwidth which could benefit from multiple channels.

5. Limitations of this study

The focus of this study is limited to the pod-basedin-order scale-out processor design proposed by Lotfi-Kamran et al. [3]. Although their study proposed anadditional out-of-order scale-out processor design, westudied the in-order design as it was demonstrated tohave higher performance density, i.e., throughput perunit area. Aggressive out-of-order designs might leadto different conclusions about how simple the memoryscheduling technique should be and the needed off-chip memory bandwidth due to a potential increase inthe MLP generated under such architectures.

Lotfi-Kamran et al. assumed a 270 mm2 die, whichwas estimated to fit three 32-core pods and six memorycontrollers. This work studies the memory system re-quirements for one such pod representative of a lower-end server processor. Future work, should considerhigher pod counts.

The study did not include the TCM [30] memorycontroller policy which also targets fairness; experi-ments with ATLAS and PAR-BS showed that fairnessis not an issue for scale-out workloads.

Page 11: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

The study includes a subset of the possible addressmapping schemes and did not consider additionalpermutation-based interleaving schemes. However, theresults presented here have identified performance de-ficiencies which could guide such a future study.

This study is limited to reevaluating previous pro-posals for memory scheduling algorithms, page man-agement policies and multi-channel memory con-trollers. While no novel solution is presented, thisstudy provides directions for simplifying or improvingthese designs to better match the needs of emergingserver workloads. Finally, this study focused solely onperformance. Energy and power are equally importantconsiderations. The performance results presented herecan be useful in such a future study.

6. Related Work

Rixner et al. introduced FR-FCFS scheduling tech-niques as well as other techniques that give preferenceto column access commands maximizing row-bufferhits and memory throughput [5], [23]. Rixner targetedweb server workloads from SPECweb99 that exposeddifferent memory access behavior than the currentlywide spreading scale-out applications. We targeteddifferent workloads and emphasized directions thatcould simplify memory controller design.

Natarajan et al. studied the impact of several mem-ory controller design aspects on server processor per-formance [31]. The study investigated open-page vs.close-page policies, in-order vs. out-of-order memoryrequests scheduling and memory ranks interleaving.However, the study was limited to shared-bus multi-processor architectures and was based on syntheticrandom address traffic. Our study covers wider aspectsof memory controller design, includes state-of-the-artproposals for each and studies full applications.

Barroso et al. introduced Piranha [32] an earlyarchitecture that balanced complexity, performance andenergy to better meet the demands of server applica-tions. PicoServer [1] is a simplified-core CMP designfor server workloads that also favors relatively smallon-chip caches. PicoServer relies on 3D die-stackingtechnology to provide low latency DRAM access.

Abts et al. studied intelligent memory controllerplacement in many-core server processors [33]. Thestudy proposed diamond placement within the many-core tiles along with an enhanced routing algorithmfor requests and replies, namely the class-based deter-ministic routing, to avoid the hot spots caused by thememory controllers. Our study is orthogonal to Abts’splacement research as we investigate the microarchi-tecture of the memory controllers that better suits the

needs of the emerging cloud workloads.Hardavellas et al. proposed using CMPs for scale-

out server workloads through exploiting heterogeneousarchitectures and dark silicon to power up only theapplication-suitable cores [34]. The work was extendedby Ferdman et al. [4] where scale-out workloads wereshown to behave differently than traditional serverworkloads. The comparisons showed that commod-ity server processors are over-provisioned in severalways. Building upon that study, Lotfi-Kamran et al.introduced a pod-based CMP design that increasesperformance density [3].

7. Conclusions

Previous work had characterized the behavior ofscale-out workloads and its impact on core architectureand the on-chip memory hierarchy.

This work builds upon these past studies by alsostudying the off-chip memory access characteristics ofscale-out workloads and their interaction with state-of-the-art memory controller policies. The results showedthat relatively simple memory scheduling algorithmsworked best for these workloads outperforming othermore advanced algorithms tuned to desktop workloads.The design of these advanced algorithms need to berevisited for scale-out workloads. We found that scale-out workloads exhibit poor row-buffer locality and,thus, better memory page management policies areneeded to take advantage of the locality that existswhile avoiding delaying the majority row-conflict re-quests.

Finally, additional memory channels did not signif-icantly improve the performance for scale-out work-loads. However, other scheduling policies, a differentmemory mapping, a different core architecture, ordifferent data sets could have resulted in a differentconclusion.

References

[1] T. Kgil, S. D’Souza, A. Saidi, N. Binkert, R. Dreslinski,T. Mudge, S. Reinhardt, and K. Flautner, “Picoserver:Using 3d stacking technology to enable a compactenergy efficient chip multiprocessor,” in Proceedingsof the 12th International Conference on ArchitecturalSupport for Programming Languages and OperatingSystems, ASPLOS XII, 2006, pp. 117–128.

[2] V. Janapa Reddi, B. C. Lee, T. Chilimbi, and K. Vaid,“Web search using mobile cores: Quantifying and mit-igating the price of efficiency,” in Proceedings of the37th Annual International Symposium on ComputerArchitecture, ISCA ’10, pp. 314–325.

Page 12: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

[3] P. Lotfi-Kamran, B. Grot, M. Ferdman, S. Volos,O. Kocberber, J. Picorel, A. Adileh, D. Jevdjic, S. Id-gunji, E. Ozer, and B. Falsafi, “Scale-out processors,”in Proceedings of the 39th Annual International Sym-posium on Computer Architecture, ISCA ’12, pp. 500–511.

[4] M. Ferdman, A. Adileh, O. Kocberber, S. Volos, M. Al-isafaee, D. Jevdjic, C. Kaynak, A. D. Popescu, A. Ail-amaki, and B. Falsafi, “Clearing the clouds: A studyof emerging scale-out workloads on modern hardware,”in Proceedings of the 17th International Conference onArchitectural Support for Programming Languages andOperating Systems, ASPLOS XVII, 2012, pp. 37–48.

[5] S. Rixner, W. J. Dally, U. J. Kapasi, P. Mattson, andJ. D. Owens, “Memory access scheduling,” in Proceed-ings of the 27th Annual International Symposium onComputer Architecture, ISCA ’00, pp. 128–138.

[6] Y. Kim, D. Han, O. Mutlu, and M. Harchol-Balter,“Atlas: A scalable and high-performance schedulingalgorithm for multiple memory controllers,” in Pro-ceedings of the 16th Annual International Symposiumon High Performance Computer Architecture, HPCA’10, Jan, pp. 1–12.

[7] O. Mutlu and T. Moscibroda, “Parallelism-aware batchscheduling: Enhancing both performance and fairnessof shared dram systems,” in Proceedings of the 35thAnnual International Symposium on Computer Archi-tecture, ISCA ’08, pp. 63–74.

[8] K. J. Nesbit, N. Aggarwal, J. Laudon, and J. E. Smith,“Fair queuing memory systems,” in Proceedings of the39th Annual IEEE/ACM International Symposium onMicroarchitecture, MICRO 39, 2006, pp. 208–222.

[9] O. Mutlu and T. Moscibroda, “Stall-time fair memoryaccess scheduling for chip multiprocessors,” in Pro-ceedings of the 40th Annual IEEE/ACM InternationalSymposium on Microarchitecture, MICRO 40. IEEEComputer Society, 2007, pp. 146–160.

[10] E. Ipek, O. Mutlu, J. F. Martı́nez, and R. Caruana,“Self-optimizing memory controllers: A reinforcementlearning approach,” in Proceedings of the 35th AnnualInternational Symposium on Computer Architecture,ISCA ’08, pp. 39–50.

[11] “Intel Xeon Processor E5-2699 v4,” http://ark.intel.com/products/91317/Intel-Xeon-Processor-E5-2699-v4-55M-Cache-2 20-GHz.

[12] “AMD Opteron 6300 Series Processors,”http://www.amd.com/en-us/products/server/opteron/6000/6300.

[13] T. Moscibroda and O. Mutlu, “Memory performance at-tacks: Denial of memory service in multi-core systems,”in Proceedings of 16th USENIX Security Symposium onUSENIX Security Symposium, SS’07, pp. 18:1–18:18.

[14] R. S. Sutton and A. G. Barto, Introduction to Reinforce-ment Learning, 1st ed. MIT Press, 1998.

[15] X. Shen, F. Song, H. Meng, S. An, and Z. Zhang,“Rbpp: A row based dram page policy for the many-core era,” in Proceedings of the 20th International Con-ference on Parallel and Distributed Systems, ICPADS’14, Dec, pp. 999–1004.

[16] M. Awasthi, D. W. Nellans, R. Balasubramonian, andA. Davis, “Prediction based dram row-buffer manage-ment in the many-core era,” in Proceedings of the 2011International Conference on Parallel Architectures andCompilation Techniques, PACT ’11, pp. 183–184.

[17] S. Kareenahalli, Z. Bogin, and M. Shah, “Adaptive idletimer for a memory device,” Jun. 21 2005, uS Patent6,910,114.

[18] C. Teh, S. Kareenahalli, and Z. Bogin, “Dynamicupdate adaptive idle timer,” Oct. 4 2007, uS Patent App.11/394,461.

[19] Y. Xu, A. S. Agarwal, and B. T. Davis, “Predictionin dynamic sdram controller policies,” in Proceedingsof the 9th International Workshop on Embedded Com-puter Systems: Architectures, Modeling, and Simula-tion, SAMOS ’09, pp. 128–138.

[20] S.-I. Park and I.-C. Park, “History-based memory modeprediction for improving memory performance,” inProceedings of the 2003 International Symposium onCircuits and Systems, ISCAS ’03, vol. 5, May 2003,pp. V–185–V–188 vol.5.

[21] M. M. K. Martin, D. J. Sorin, B. M. Beckmann,M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore,M. D. Hill, and D. A. Wood, “Multifacet’s gen-eral execution-driven multiprocessor simulator (gems)toolset,” SIGARCH Comput. Archit. News, vol. 33,no. 4, pp. 92–99, Nov. 2005.

[22] T. F. Wenisch, R. E. Wunderlich, M. Ferdman, A. Ail-amaki, B. Falsafi, and J. C. Hoe, “Simflex: Statisticalsampling of computer system simulation,” IEEE Micro,vol. 26, no. 4, pp. 18–31, Jul. 2006.

[23] S. Rixner, “Memory controller optimizations for webservers,” in Proceedings of the 37th Annual IEEE/ACMInternational Symposium on Microarchitecture, MI-CRO 37, 2004, pp. 355–366.

[24] W. Zuravleff and T. Robinson, “Controller for a syn-chronous dram that maximizes throughput by allowingmemory requests and commands to be issued out oforder,” May 13 1997, uS Patent 5,630,096.

[25] S. A. Przybylski, Cache and Memory Hierarchy De-sign: A Performance-directed Approach. MorganKaufmann Publishers Inc., 1990.

[26] C. Sangani, M. Venkatesan, and R. Ramesh, “Phaseaware memory scheduling,” Stanford University, Elec-trical Engineering department, Tech. Rep., 2013.

Page 13: Memory Controller Design Under Cloud Workloads · 2016-12-01 · state-of-the-art memory controller designs when run-ning several modern server workloads. The empha-sis is on the

[27] G. Liu, X. Zhang, D. Wang, Z. Liu, and H. Wang, “Se-curity memory system for mobile device or computeragainst memory attacks,” in Trustworthy Computingand Services, Communications in Computer and Infor-mation Science. Springer Berlin Heidelberg, 2014,vol. 426, pp. 1–8.

[28] J. L. Henning, “Spec cpu2006 benchmark descriptions,”SIGARCH Comput. Archit. News, vol. 34, no. 4, pp. 1–17, Sep. 2006.

[29] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, andA. Gupta, “The splash-2 programs: Characterizationand methodological considerations,” in Proceedings ofthe 22nd Annual International Symposium on ComputerArchitecture, ISCA ’95, pp. 24–36.

[30] Y. Kim, M. Papamichael, O. Mutlu, and M. Harchol-Balter, “Thread cluster memory scheduling: Exploit-ing differences in memory access behavior,” in 201043rd Annual IEEE/ACM International Symposium onMicroarchitecture, Dec 2010, pp. 65–76.

[31] C. Natarajan, B. Christenson, and F. Briggs, “A studyof performance impact of memory controller featuresin multi-processor server environment,” in Proceedingsof the 3rd Workshop on Memory Performance Issues:In Conjunction with the 31st International Symposiumon Computer Architecture, WMPI ’04, pp. 80–87.

[32] L. A. Barroso, K. Gharachorloo, R. McNamara,A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets,and B. Verghese, “Piranha: A scalable architecturebased on single-chip multiprocessing,” in Proceedingsof the 27th Annual International Symposium on Com-puter Architecture, ISCA ’00, pp. 282–293.

[33] D. Abts, N. D. Enright Jerger, J. Kim, D. Gibson,and M. H. Lipasti, “Achieving predictable performancethrough better memory controller placement in many-core cmps,” in Proceedings of the 36th Annual Inter-national Symposium on Computer Architecture, ISCA’09, pp. 451–461.

[34] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ail-amaki, “Toward dark silicon in servers,” IEEE Micro,vol. 31, no. 4, pp. 6–15, Jul. 2011.