exploring opportunities for non-volatile memories in big...

6
Exploring Opportunities for Non-Volatile Memories in Big Data Applications Wei Wei Institute of Computing Technology, Chinese Academy of Sciences, University of Chinese Academy of Sciences [email protected] Dejun Jiang Institute of Computing Technology, Chinese Academy of Sciences [email protected] Jin Xiong Institute of Computing Technology, Chinese Academy of Sciences [email protected] Mingyu Chen Institute of Computing Technology, Chinese Academy of Sciences [email protected] Abstract Large-capacity memory system allows big data applications to load as much data as possible for in-memory processing, which im- proves application performance. However, DRAM faces both scal- ability and energy challenges due to its inherent charging mech- anism. Thus, DRAM-based memory system incurs excessive cost to meet both capacity and energy requirements for the emerging big data workloads. Fortunately, non-volatile memories(NVMs) are emerging with the advanced features of better scalability and lower power leakage. Integrating NVMs into main memory is non-trivial as NVMs have a few weakness, such as asymmetric read and write latency and power. Designing memory system comprising both DRAM and NVMs requires to understand the memory access be- haviors of big data applications. In this paper, we first investigate the memory access patterns of both typical big data workloads and traditional parallel workloads. By doing so, we show the read/write intensity as well as temporal/spatial locality of big data workloads. We then replay memory access traces of big data applications to DRAM simulator and PCM simulator, respectively. We explore de- sign implications of hybrid memory comprising DRAM and PCM. Keywords Big data benchmark, memory performance, phase change memory, energy consumption 1. Introduction Emerging big data applications need to process high volume of data. For instance,Facebook keeps 75% of its non-image data in main memory[1]. Loading as much data into memory as possible avoids time-consuming disk IO operations, and thus accelerates ap- plication performance. Large-capacity memory system is expected Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. BPOE ’14, March 1st, 2014, Salt Lake City, Utah, USA. Copyright c 2014 Authors and ICT to run big data applications. However, DRAM-based memory sys- tem faces both scalability and energy challenges due to its working mechanism. It is considered to be hard for DRAM to scale down to 20nm[22]. Furthermore, DRAM consumes 20% to 40% of the to- tal server energy in data center[12]. As such, DRAM-based mem- ory system incurs excessive cost to provide large capacity to run big data applications. Fortunately, non-volatile memories, such as Phase Change Memory[14], Resistive RAM[8], and STT-RAM[6], are emerging to become alternative candidates to DRAM-based memory system. Compared to DRAM, NVMs have higher density without energy leakage. A few NVMs have comparable operation performance with DRAM. Therefore, NVMs are desirable to serve as large-capacity memory system for big data applications. However, NVMs also have a few drawbacks. For example, PCM has asymmetric read/write energy consumption. The write latency of PCM is higher than DRAM. When adopting NVMs in mem- ory system for running big data applications, one needs to under- stand the memory access patterns of big data applications. Mem- ory access patterns, such as staged bandwidth, read/write ratio, temporal and spatial locality, are able to provide design impli- cations to NVM-based memory system. Without understanding these access patterns, one may allocate PCM spaces for write- intensive data objects, which results in decreasing memory perfor- mance due to the relatively slow PCM writes. Although there are some studies[4, 7, 13] characterizing the big data workloads, the opportunities of NVM in big data area have not been examined as far as we know. In this paper, we explore the opportunities of NVMs in big data applications. We conduct a set of experiments to study the detailed memory access patterns of typical big data applications. By gathering memory traces of big data applications, we analyze the staged bandwidth, read/write ratio, temporal locality and spa- tial locality. In addition, we develop a PCM simulator based on DRAMSim2[17] and replay the memory access traces to PCM sim- ulator. By doing so, we provide design implications for integrating PCM into memory system. The contribution of this paper is as fol- low. 1. We conduct experiments to analyze the memory access patterns of both big data workloads and traditional parallel workloads. We observe that big data workloads usually have similar read intensity and write intensity. In addition, we observe that big

Upload: others

Post on 06-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Exploring Opportunities for Non-Volatile Memories in Big ...prof.ict.ac.cn/bpoe_4_asplos/wp-content/uploads/... · examine the implications of big data workloads on system design

Exploring Opportunities for Non-VolatileMemories in Big Data Applications

Wei WeiInstitute of Computing Technology,

Chinese Academy of Sciences,University of Chinese Academy of

[email protected]

Dejun JiangInstitute of Computing Technology,

Chinese Academy of [email protected]

Jin XiongInstitute of Computing Technology,

Chinese Academy of [email protected]

Mingyu ChenInstitute of Computing Technology, Chinese Academy of Sciences

[email protected]

AbstractLarge-capacity memory system allows big data applications to loadas much data as possible for in-memory processing, which im-proves application performance. However, DRAM faces both scal-ability and energy challenges due to its inherent charging mech-anism. Thus, DRAM-based memory system incurs excessive costto meet both capacity and energy requirements for the emergingbig data workloads. Fortunately, non-volatile memories(NVMs) areemerging with the advanced features of better scalability and lowerpower leakage. Integrating NVMs into main memory is non-trivialas NVMs have a few weakness, such as asymmetric read and writelatency and power. Designing memory system comprising bothDRAM and NVMs requires to understand the memory access be-haviors of big data applications. In this paper, we first investigatethe memory access patterns of both typical big data workloads andtraditional parallel workloads. By doing so, we show the read/writeintensity as well as temporal/spatial locality of big data workloads.We then replay memory access traces of big data applications toDRAM simulator and PCM simulator, respectively. We explore de-sign implications of hybrid memory comprising DRAM and PCM.

Keywords Big data benchmark, memory performance, phasechange memory, energy consumption

1. IntroductionEmerging big data applications need to process high volume ofdata. For instance,Facebook keeps 75% of its non-image data inmain memory[1]. Loading as much data into memory as possibleavoids time-consuming disk IO operations, and thus accelerates ap-plication performance. Large-capacity memory system is expected

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee.BPOE ’14, March 1st, 2014, Salt Lake City, Utah, USA.Copyright c© 2014 Authors and ICT

to run big data applications. However, DRAM-based memory sys-tem faces both scalability and energy challenges due to its workingmechanism. It is considered to be hard for DRAM to scale down to20nm[22]. Furthermore, DRAM consumes 20% to 40% of the to-tal server energy in data center[12]. As such, DRAM-based mem-ory system incurs excessive cost to provide large capacity to runbig data applications. Fortunately, non-volatile memories, such asPhase Change Memory[14], Resistive RAM[8], and STT-RAM[6],are emerging to become alternative candidates to DRAM-basedmemory system. Compared to DRAM, NVMs have higher densitywithout energy leakage. A few NVMs have comparable operationperformance with DRAM. Therefore, NVMs are desirable to serveas large-capacity memory system for big data applications.

However, NVMs also have a few drawbacks. For example, PCMhas asymmetric read/write energy consumption. The write latencyof PCM is higher than DRAM. When adopting NVMs in mem-ory system for running big data applications, one needs to under-stand the memory access patterns of big data applications. Mem-ory access patterns, such as staged bandwidth, read/write ratio,temporal and spatial locality, are able to provide design impli-cations to NVM-based memory system. Without understandingthese access patterns, one may allocate PCM spaces for write-intensive data objects, which results in decreasing memory perfor-mance due to the relatively slow PCM writes. Although there aresome studies[4, 7, 13] characterizing the big data workloads, theopportunities of NVM in big data area have not been examined asfar as we know.

In this paper, we explore the opportunities of NVMs in bigdata applications. We conduct a set of experiments to study thedetailed memory access patterns of typical big data applications.By gathering memory traces of big data applications, we analyzethe staged bandwidth, read/write ratio, temporal locality and spa-tial locality. In addition, we develop a PCM simulator based onDRAMSim2[17] and replay the memory access traces to PCM sim-ulator. By doing so, we provide design implications for integratingPCM into memory system. The contribution of this paper is as fol-low.

1. We conduct experiments to analyze the memory access patternsof both big data workloads and traditional parallel workloads.We observe that big data workloads usually have similar readintensity and write intensity. In addition, we observe that big

Page 2: Exploring Opportunities for Non-Volatile Memories in Big ...prof.ict.ac.cn/bpoe_4_asplos/wp-content/uploads/... · examine the implications of big data workloads on system design

data workloads have weaker temporal and spatial locality com-pared to traditional ones.

2. We use PCM simulator to explore the opportunities of NVMs,especially PCM, to serve as memory system for running bigdata applications. We show the great opportunities for NVM toreduce energy consumption in big data applications.

3. We propose the adoption of DRAM-PCM hybrid memory asa desirable design choice when integrating NVMs into mainmemory. We argue that this design choice can fully utilizedadvantages of PCM while reduce the negative impact of its longwrite latency. To make the design more practicable, we suggestthat carefully designed data placement are essential.

The rest of the paper is organized as follows. Section 2 presentsrelated works. In Section 3, we describe the applications and meth-ods used in our experiment evaluations. Section 4 shows the mem-ory access pattern analysis results as well as the simulation resultsin both PCM and DRAM memory. We conclude the paper in Sec-tion 5.

2. Related work2.1 Using NVMs in computer memory architectureCompared to DRAM, NVMs have higher density. Meanwhile,NVMs are promising to have comparable read/write performanceas DRAM, and even SRAM. Therefore, a few research works in-tegrate NVMs into cache hierarchy or main memory to increasethe capacity and reduce energy cost. However, current NVMs alsohave several weaknesses, such as high dynamic energy and highwrite latency. Smullen et al.[18] addresses the issue by redesigningthe MTJ to use a smaller free layer. Li et al.[11] adopt STT-RAMtogether with SRAM to construct a hybrid adaptive on-chip cacheto provide low power consumption and low access latency. Recentworks also focus on combining a small amount of DRAM and alarge amount of PCM to provide a low latency and energy memorysystem [5, 9, 16]. These works mainly focus on the optimizationtechnologies of PCM to be adopted in main memory. These workstarget traditional workloads. However, this paper mainly focuseson the opportunities of NVMs in big data applications.

2.2 The role of NVMsSimilar to our work, recently, a few works also explore the roleof NVMs in different types of workloads. Li et al.[10] perform adetailed analysis to the access patterns of memory objects in stack,heap and global data for real large-scale scientific applications.Their results reveal a lot of opportunities for using NVRAM inscientific applications. Caulfield et al.[3] explores several optionsfor connecting solid-state storage to the host system to evaluatethe role of NVMs in storage level for high-performance and IO-intensive computing. Essen et al.[19] develops a NVM simulator tomodel the impact of future generations of I/O-bus-attached NVMon HPC application performance at scale. This paper differs fromthese works in that we mainly evaluate the performance and energyimprovement when one adopts NVMs in main memory for big dataworkloads.

2.3 Analysis of big data workloadsRecently, a few research efforts have been made to characterizeand understand the features of big data workloads. Chang et al.[4]examine the implications of big data workloads on system design.Jia et al.[7] characterize micro-architectural characteristics of bigdata workloads on the systems equipped with modern superscalarout-of-order processors. Dimitrov et al. [13] profile the memoryaccess patterns of big data workloads, such as memory footprint,

CPI, and cache misses etc. However, these works do not examinethe opportunities of NVMs (e.g. PCM) on big data workloads.

3. Applications and MethodologyIn this section, we present the evaluated applications and the eval-uation method.

3.1 ApplicationsWe select 5 workloads from the BigDataBench-2.1v[20] as the bigdata workload set, which are representative in big data scenarios.Table 1 shows the features of these workloads. The versions ofHadoop, Hive and JDK are 1.0.4, 0.11, and 1.7.0. We also se-lect two applications radiosity and volrend from SPLASH-2 bench-mark [21] as the traditional parallel workload set. We conduct ex-periments using the two sets of workloads and compare their mem-ory access patterns.

3.2 MethodologyWe first gather memory traces of the 7 workloads to analyze theirmemory access patterns. For each workload, we measure stagebandwidth, read counts, and write counts. The stage bandwidth ismeasured every 10 million requests at runtime, which shows thebandwidth trend of each workload. We calculate the total band-width, read bandwidth and write bandwidth for each stage. We cal-culate the read-to-write ratio rwRatio for each page to show thewrite intensity of each workload. Since memory access pattern isrelated to CPU cache design, we further analyze the temporal lo-cality of each workload by counting the access intervals of eachpage. We count the page references and the reference distributionfor each workload to show its spatial locality.

We use a 5 nodes cluster to run the big data workloads, inwhich one is the master node and others are slave nodes. Eachnode is configured with 4 Intel Xeon E5 processors, 8GB of DDR3memory, and 1TB Disk. We use linux operating system with 2.6.18kernel. The two traditional workloads run on one node within thecluster. We use a lightweight hardware tool HMTT [2] to gathermemory traces. The memory trace for a workload includes thetimestamp of a memory access, the memory accessing address, andthe accessing size. We develop tools to derive evaluation metricsfrom the raw data.

In order to explore the opportunities of NVM in big data work-loads, we use the collected memory traces to evaluate the applica-tion performance and energy consumption when adopting PCM asmemory system. We develop a PCM simulator PCMSim based onDRAMSim2 [17] to simulate the results when running workloadson PCM-based memory. The PCM simulator receives the collectedmemory traces as input and outputs the statistical results, such asaccess latency and energy. To compare the performance and energyresults with DRAM-based memory system, we also replay memorytraces to the DRAMSim2 simulator. Table 2 shows the configura-tions of the two simulators.

4. Experimental resultsIn this section, we first show the analysis results of memory accesspatterns of both big data workloads and traditional parallel work-loads. Then, we present the simulation results by comparing run-ning workloads on PCM memory and DRAM memory separately.

4.1 Memory access patterns4.1.1 Stage bandwidthFigure 1 shows the stage bandwidths of 6 workloads. The band-width result of workload join query is similar to that of workloadselect query. Thus, we only show the result of workload select

Page 3: Exploring Opportunities for Non-Volatile Memories in Big ...prof.ict.ac.cn/bpoe_4_asplos/wp-content/uploads/... · examine the implications of big data workloads on system design

Application Scenarios Applications Data Type Data Size Software Stack Application Type Memory Footprintper Node

Relational Query Join Query Table 7.1GB Hive Offline Analytics 7688MBRelational Query Select Query Table 7.1GB Hive Offline Analytics 7867MB

Search Engine PageRank Graph 4.2GB Hadoop Offline Analytics 7892MBSocial Network K-means Graph 4.1GB Hadoop Offline Analytics 7283MB

Micro Benchmarks Terasort Text 2GB Hadoop Offline Analytics 7215MBTraditional Applications Volrend N/A N/A N/A Parallel Applications 6289MBTraditional Applications Radiosity N/A N/A N/A Parallel Applications 7314MB

Table 1. Features of Applications

Features ValuesDRAM PCM

Device Device width 16 16Rank Size 512(MB) 512(MB)

LatencyBurst Latency 9.33(NS) 6.47(NS)Read Latency 20(NS) 36.28(NS)Write Latency 20(NS) 210.54(NS)

EnergyBurst Read Energy 23.55(PJ) 7.36(PJ)Burst Write Energy 23.55(PJ) 7.36(PJ)Array Read Energy 23.76(PJ) 20.86(PJ)Array Write Energy 28.28(PJ) 20(PJ)

Others

Rowbuffer Policy close closeTransaction queue 32 32Command queue 32 32

Table 2. Features of DRAMSim2 and PCMSim

query. For workload select query, its write bandwidth is higher thanthe read bandwidth in the first half stages. Its write bandwidth thendrops and becomes close to the read bandwidth in the second halfstages. On average, all other workloads have higher read bandwidththan write bandwidth . Especially, Figure 1(e) shows that the tra-ditional workload radiosity become read-dominant in the secondhalf stages. Meanwhile, Figure 1(f) shows that workload volrendis read-intensive in the whole running process. We further calcu-late the ratio of maximum read bandwidth to maximum write band-width for each workload. The ratio results show that big data work-loads have similar counts of read and write operations. For exam-ple, the ratio results of workloads Kmeans, Terasort, and Pagerankare 3.27, 32.41, and 3.12 respectively. In contrast, the ratio result ofworkload radiosity is 135.25. Therefore, traditional parallel work-loads can exhibit read-intensive feature, while big data workloadsusually exhibit similar read intensity and write intensity.

4.1.2 rwRatioIn addition to calculate the ratio of read bandwidth to write band-width, we further the rwRatio of each page for all workloads. Bydoing so, we can investigate the read/write features of these work-loads at finer granularity. We plot the CDF curves of rwRatios forall workloads in Figure 2(a). We observe that the rwRatio of eachpage is between 0.01 and 100 in almost all workloads. As such, weshow the enlarged portion of rwRatio just between 0.01 and 100 inFigure 2(b). Furthermore, we show the enlarged portion of rwRatiojust between 0.1 and 10 in Figure 2(c).

In traditional workloads, volrend has 60% pages whose rwRatiosare between 1 and 10, and 35% pages whose rwRatios are greaterthan 100. Therefore, it is desirable to allocate pages in PCM forworkload volrend to achieve near-DRAM read performance whilekeeping low energy consumption. However, the rwRatio results ofbig data workloads are much more diverse. For example, work-load join query has a very wide distribution of rwRatio. The pages

whose rwRatios are greater than 10 and smaller than 0.1 occupy20% respectively. Therefore, these pages are expected to be allo-cated in PCM and DRAM separately. The rwRatios of other bigdata workloads are nearly between 0.1 and 10, which indicates thatmost pages have equal read/write references. For example, work-loads Pagerank and Kmeans both have nearly 75% pages whoserwRatios are between 1 and 10.

4.1.3 Temporal and spatial localityFigure 3(a) plots the CDF curves of average access interval forpages in all workloads. In most workloads except kmeans, 80%pages have the average access interval below 500 ms. In order tomake more detailed analysis, we plot the enlarged distributions ofaverage access interval between 0 and 100ms in Figure 3(b) andthat between 0 and 20ms in Figure3(c). For workload radiosity andvolrend, 60% pages have the average access interval below 2.5ms.In contrast, there are no more than 10% pages whose average accessintervals are below 5ms in kmeans and pagerank. Similarly, work-loads terasort, select query, and join query have about 20% pageswhose average access intervals are below 5ms. As such, comparedto traditional workloads, big data workloads have weaker tempo-ral locality which requires changes in CPU LLC cache algorithmdesign.

In order to analyze the spatial locality of these workloads, wecalculate the references of each page in all workloads. We sortthese pages according to their references. We then calculate thepercentage of the pages that contribute 80% of total accesses foreach workload. We show the results in Figure 4. Less than 10% ofpages contribute 80% references for workloads hive-join and hive-select, which exhibits strong spatial locality. Hybrid memory com-prising DRAM and NVM either employs NVM as the extensionof DRAM or places DRAM in front of NVM as a cache. As such,the hybrid memory architecture employing DRAM as a cache ismore appropriate for the Relational Query workloads, such as hive-join and hive-select. For workloads pagerank, the 80% referencesare contributed by more than 35% pages, which exhibits weak spa-tial locality. Considering the high volume processed data processed,the big data workload with weak spatial locality expects carefully-designed cache policy to avoid degraded performance due to diskIO overheads.

4.2 PCM simulation resultsIn this section, we explore the opportunities of NVMs as mainmemory to run big data applications. We use PCM as the candidatememory device. We show the energy and average latency resultswhen directly running big data workloads in both DRAM and PCMmemory systems. We use the conventional memory managementstack when using PCM memory. We use DRAM memory as thebaseline and normalize all results in PCM to the baseline.

Page 4: Exploring Opportunities for Non-Volatile Memories in Big ...prof.ict.ac.cn/bpoe_4_asplos/wp-content/uploads/... · examine the implications of big data workloads on system design

0

0.5

1

1.5

2

2.5

3

3.5

0 50 100 150 200 250 300 350 400 450

band

wid

th(G

B/s)

stages

total bandwidthread bandwidthwrite bandwidth

(a) kmeans

0

1

2

3

4

5

0 100 200 300 400 500 600 700 800 900

band

wid

th(G

B/s)

stages

total bandwidthread bandwidthwrite bandwidth

(b) terasort

0

1

2

3

4

0 100 200 300 400 500 600 700

band

wid

th(G

B/s)

stages

total bandwidthread bandwidthwrite bandwidth

(c) pagerank

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

0 5 10 15 20 25

band

wid

th(G

B/s)

stages

total bandwidthread bandwidthwrite bandwidth

(d) select

0

1

2

3

4

5

6

0 50 100 150 200 250 300 350

band

wid

th(G

B/s)

stages

total bandwidthread bandwidthwrite bandwidth

(e) radiosity

0

0.1

0.2

0.3

0.4

0.5

0 50 100 150 200

band

wid

th(G

B/s)

stages

total bandwidthread bandwidthwrite bandwidth

(f) volrend

Figure 1. Stage bandwidth of workloads

0

0.2

0.4

0.6

0.8

1

1e-06 1e-04 1e-02 1e+00 1e+02 1e+04 1e+06 1e+08

CD

F

rwRatio per page

kmeansterasort

pagerankhive-select

hive-joinradiosityvolrend

(a) global

0

0.2

0.4

0.6

0.8

1

0.01 0.1 1 10 100

CD

F

rwRatio per page

kmeansterasort

pagerankhive-select

hive-joinradiosityvolrend

(b) partial(0.01-100)

0

0.2

0.4

0.6

0.8

1

0.1 1 10

CD

F

rwRatio per page

kmeansterasort

pagerankhive-select

hive-joinradiosityvolrend

(c) partial(0.1-10)

Figure 2. CDF of rwRatio among pages

0

0.2

0.4

0.6

0.8

1

0 500 1000 1500 2000

CD

F

average access interval per page(ms)

kmeansterasort

pagerankhive-select

hive-joinradiosityvolrend

(a) global

0

0.2

0.4

0.6

0.8

1

0 20 40 60 80 100

CD

F

average access interval per page(ms)

kmeansterasort

pagerankhive-select

hive-joinradiosityvolrend

(b) partial(0-100ms)

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20

CD

F

average access interval per page(ms)

kmeansterasort

pagerankhive-select

hive-joinradiosityvolrend

(c) partial(0-20ms)

Figure 3. CDF of average access interval

Page 5: Exploring Opportunities for Non-Volatile Memories in Big ...prof.ict.ac.cn/bpoe_4_asplos/wp-content/uploads/... · examine the implications of big data workloads on system design

0%  5%  10%  15%  20%  25%  30%  35%  40%  45%  

percen

tages  

Figure 4. The percentages of the pages contributing 80% refer-ences

0%  10%  20%  30%  40%  50%  60%  70%  80%  90%  

100%  

Normalized

 Ene

rgy  

DRAM   PCM  

Figure 5. Comparison of Energy consumption

0  0.5  1  

1.5  2  

2.5  3  

3.5  4  

Normalized

 latency  

DRAM   PCM  

Figure 6. Average latency comparison in DRAM and PCM

4.2.1 Energy and LatencyFigure 5 shows the energy results of all workloads running in thetwo types of memory devices. Since PCM has no refresh energyconsumption, all workloads achieve lower energy cost in PCM. Inbig data workloads, the maximum energy reduction is 74.28% forterasort, meanwhile the minimum energy reduction also reaches70.98% for kmeans. Big data applications typically require large-capacity memory to accelerate data processing. By adopting low-energy NVMs into memory, one can not only fully utilize the highdensity of NVM chips to increase capacity, but also significantlyreduce the energy cost of memory system and thus the total cost ofdata-center servers.

However, as shown in Figure 6, the average latency observed inPCM memory is higher than that in DRAM memory for all work-loads. The minimum average latency increase in PCM is 1 timeshigher than that in DRAM for workload kmeans. Especially, the

0  0.2  0.4  0.6  0.8  1  

1.2  

Normalize  ED

 

DRAM   PCM  

(a) ED

0  1  2  3  4  5  6  7  8  9  

Normalized

 ED2

 

DRAM   PCM  

(b) ED2

Figure 7. Comparison of ED and ED2

average latencies of hive-join and hive-select increase by up to 3.6times, compared to the baseline. This is because both workloadshave a large fraction of write operations (as shown in Section 4.1.1),and the long write latency of PCM correspondingly results in theincrease of their average latencies. Although PCM provides betterscalability and lower energy to memory system, the long write la-tency limits its application in real-world systems. However, notethat we employ the conventional memory management to the un-derlying PCM memory. Moreover, we only evaluate the perfor-mance with the memory model comprising only PCM. Actually,in order to avoid the negative impact of long write latency of PCM,one can use hybrid memory architecture to exploit advantages ofboth DRAM and PCM while reduce the impacts of their disadvan-tages. Correspondingly, one should carefully design data placementpolicies when using hybrid memory, instead of simply applyingcurrent memory management stack.

4.2.2 ED and ED2We further plot the ED (energy*delay) and ED2(energy ∗ delay2)results for all workloads in DRAM and PCM. The two metrics isable to show the tradeoff between performance gain and energyreduction, which are used in studies such as [15, 16]. Figure 7shows the values of ED and ED2 in the two memory devices. Interms of ED, the workloads that have significant read operationsperform better in PCM, such as kmeans, terasort, and pagerank.This is because the read latency of PCM is close to DRAM, read-intensive workloads benefit more from the energy reduction, whicheliminates the overhead of increased latency. However, for write-intensive workloads hive-join and hive-select, the increased writelatency becomes the main factor that reduces the memory systemefficiency. Figure 7(b) shows the ED2 results. It is obvious thatthe long write latency of PCM significantly reduces the efficiencyof memory system. For example, the ED2 value of workload hive-

Page 6: Exploring Opportunities for Non-Volatile Memories in Big ...prof.ict.ac.cn/bpoe_4_asplos/wp-content/uploads/... · examine the implications of big data workloads on system design

join in PCM is 8.3 times higher than that in DRAM. For latency-sensitive workloads, it is not a good design choice to completelyreplace DRAM memory with PCM memory though PCM can pro-vide larger capacity. We suggest that one should consider hybridmemory system to allow more practical usage of PCM.

5. ConclusionThis paper examines the opportunity of NVMs in big data work-loads, which is not explored in previous research. We performmemory access pattern analysis towards both emerging big dataworkloads and traditional parallel workloads. We investigate theread-to-write ratios as well as temporal and spatial locality of bothworkload sets. We show that big data workloads exhibit weak tem-poral and spatial locality compared to traditional workloads. In or-der to explore the opportunities of NVMs as memory system torun big data workloads, we further compare the performance andenergy results by directly running big data workloads in DRAMmemory and PCM memory. We evaluate the memory efficiency byreplaying memory access traces to both DRAM and PCM simula-tors, respectively. By doing so, we observe the great opportunityfor NVMs to significantly reduce the energy consumption for bigdata workloads. We also propose hybrid memory architecture as adesirable design choice for integrating NVMs into main memorysystem. This design choice can exploit the advantages of NVMs,and meanwhile avoid their unpleasant features.

6. AcknowledgementThis work is supported by National Basic Research Program ofChina under grant no.2011CB302502, National Science Founda-tion of China under grants no. 61379042 and no. 61202063, andHuawei Research Program YB2013090048.

References[1] International technology roadmap for semiconductors emerging re-

search devices. http://www.itrs.net, 2011.

[2] Y. Bao, M. Chen, Y. Ruan, L. Liu, J. Fan, Q. Yuan, B. Song, and J. Xu.HMTT: a platform independent full-system memory trace monitoringsystem. In Proceedings of the 2008 ACM SIGMETRICS internationalconference on Measurement and modeling of computer systems, pages229–240, 2008.

[3] A. M. Caulfield, J. Coburn, T. Mollov, A. De, A. Akel, J. He, A. Ja-gatheesan, R. K. Gupta, A. Snavely, and S. Swanson. Understandingthe impact of emerging non-volatile memories on high-performance,io-intensive computing. In Proceedings of the 2010 ACM/IEEE Inter-national Conference for High Performance Computing, Networking,Storage and Analysis, pages 1–11, 2010.

[4] J. Chang, K. T. Lim, J. Byrne, L. Ramirez, and P. Ranganathan. Work-load diversity and dynamics in big data analytics: Implications to sys-tem designers. In Proceedings of the 2Nd Workshop on Architecturesand Systems for Big Data, pages 21–26, 2012.

[5] G. Dhiman, R. Ayoub, and T. Rosing. PDRAM a hybrid pram anddram main memory system. In Proceedings of 46th ACM/IEEE DesignAutomation Conference, pages 664 –669, 2009.

[6] B. Dieny, R. Sousa, S. Bandiera, M. Castro Souza, S. Auffret, B. Rod-macq, J. Nozieres, J. Herault, E. Gapihan, I. Prejbeanu, et al. Ex-tended scalability and functionalities of mram based on thermally as-sisted writing. In 2011 IEEE International Electron Devices Meeting,pages 1–3, 2011.

[7] Z. Jia, L. Wang, J. Zhan, L. Zhang, and C. Luo. Characterizing dataanalysis workloads in data centers. CoRR, 1307.8013, 2013.

[8] A. Kawahara, R. Azuma, Y. Ikeda, K. Kawai, Y. Katoh, K. Tan-abe, T. Nakamura, Y. Sumimoto, N. Yamada, N. Nakai, S. Sakamoto,Y. Hayakawa, K. Tsuji, S. Yoneda, A. Himeno, K. Origasa, K. Shi-makawa, T. Takagi, T. Mikawa, and K. Aono. An 8mb multi-layered

cross-point reram macro with 443mb/s write throughput. In Pro-ceedings of 2012 IEEE International Solid-State Circuits Conference,pages 432 –434, 2012. .

[9] H. G. Lee, S. Baek, C. Nicopoulos, and J. Kim. An energy- andperformance-aware dram cache architecture for hybrid dram/pcmmain memory systems. In Proceedings of IEEE 29th InternationalComputer Design Conf, pages 381–387, 2011.

[10] D. Li, J. S. Vetter, G. Marin, C. McCurdy, C. Cira, Z. Liu, and W. Yu.Identifying opportunities for byte-addressable non-volatile memory inextreme-scale scientific applications. In Proceedings of the 2012 IEEEInternational Parallel and Distributed Processing Symposium, pages945–956, 2012.

[11] J. Li, C. Xue, and Y. Xu. STT-RAM based energy-efficiency hybridcache for cmps. In 2011 IEEE/IFIP International Conference on VLSIand System-on-Chip, pages 31–36, 2011.

[12] K. Lim, P. Ranganathan, J. Chang, C. Patel, T. N. Mudge, and S. Rein-hardt. Understanding and designing new server architectures foremerging warehouse-computing environments. In Proceedings ofthe 35th Annual International Symposium on Computer Architecture,pages 315–326, 2008.

[13] P. L. Martin Dimitrov, Karthik Kumar and V. Viswanathan. Memorysystem characterization of big data workloads. In The 1st Workshopon Benchmarks, Performance Optimization, and Emerging hardwareof Big Data Systems and Applications, 2013.

[14] T. Nirschl, J. B. Phipp, T. D. Happ, G. W. Burr, B. Rajendran, M.-H. Lee, A. Schrott, M. Yang, M. Breitwisch, C.-F. Chen, E. Joseph,M. Lamorey, R. Cheek, S.-H. Chen, S. Zaidi, S. Raoux, Y. C. Chen,Y. Zhu, R. Bergmann, H.-L. Lung, and C. Lam. Write strategies for 2and 4-bit multi-level phase-change memory. In Proceedings of IEEEInternational Electron Devices Meeting, pages 461–464, 2007.

[15] M.-R. Ra, J. Paek, A. B. Sharma, R. Govindan, M. H. Krieger, andM. J. Neely. Energy-delay tradeoffs in smartphone applications. InProceedings of the 8th International Conference on Mobile Systems,Applications, and Services, 2010.

[16] L. E. Ramos, E. Gorbatov, and R. Bianchini. Page placement in hybridmemory systems. In Proceedings of the international conference onSupercomputing, pages 85–95, 2011.

[17] P. Rosenfeld, E. Cooper-Balis, and B. Jacob. DRAMSim2: A cycleaccurate memory system simulator. Computer Architecture Letters,10:16–19, 2011.

[18] C. Smullen, V. Mohan, A. Nigam, S. Gurumurthi, and M. Stan. Re-laxing non-volatility for fast and energy-efficient stt-ram caches. InProceedings of 17th International Symposium onHigh PerformanceComputer Architecture, pages 50 –61, 2011.

[19] B. Van Essen, R. Pearce, S. Ames, and M. Gokhale. On the roleof nvram in data-intensive architectures: An evaluation. In Proceed-ings of 26th InternationalParallel Distributed Processing Symposium,pages 703–714, 2012.

[20] L. Wang, C. Luo, Y. He, J. Zhan, K. Zhan, X. Li, Y. Zhu, S. Zhang,Q. Yang, B. Qiu, and Z. Jia. Bigdatabench: a big data benchmark suitefrom internet services. In Proceedings of 20th IEEE InternationalSymposium On High Performance Computer Architecture, 2014.

[21] S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. TheSPLASH-2 programs: characterization and methodological consider-ations. In Proceedings of the 22nd annual international symposium oncomputer architecture, pages 24–36, 1995.

[22] C. J. Xue, Y. Zhang, Y. Chen, G. Sun, J. J. Yang, and H. Li. Emerg-ing non-volatile memories: opportunities and challenges. In Proceed-ings of the 7th IEEE/ACM/IFIP international conference on Hard-ware/software codesign and system synthesis, pages 325–334, 2011.